GCP BigQuery Performance Tuning and Optimization

Your weekly dose of actionable cloud wisdom tothis is my latest blog post. can you recomend a feature image for the post start the week right

The Problem

Your BigQuery queries are taking 10 minutes to complete what should be 30-second operations, you’re burning through your monthly slot quota by Tuesday, and your data team is getting frustrated with “query exceeded resource limits” errors. Meanwhile, your BigQuery bill shows you’re paying premium prices for queries that scan entire tables just to return a handful of rows.

The Solution

Optimise BigQuery performance using proper table design, intelligent query patterns, and slot management techniques. Most BigQuery performance issues stem from poor partitioning, inefficient SQL patterns, and misunderstanding how BigQuery’s columnar storage works. Well-optimised queries run 10-50x faster whilst using a fraction of the compute resources.

Essential Performance Optimization Techniques:

1. Table Partitioning and Clustering Strategy

-- Create properly partitioned and clustered table
CREATE TABLE `project.dataset.optimized_sales_data`
(
  transaction_id STRING,
  customer_id STRING,
  product_category STRING,
  transaction_date DATE,
  transaction_timestamp TIMESTAMP,
  amount DECIMAL(10,2),
  region STRING,
  sales_rep_id STRING
)
PARTITION BY transaction_date
CLUSTER BY customer_id, product_category
OPTIONS(
  description="Sales data optimized for performance",
  partition_expiration_days=365,
  require_partition_filter=true
);

-- Load data with optimal performance
INSERT INTO `project.dataset.optimized_sales_data`
SELECT 
  transaction_id,
  customer_id,
  product_category,
  DATE(transaction_timestamp) as transaction_date,
  transaction_timestamp,
  amount,
  region,
  sales_rep_id
FROM `project.dataset.raw_sales_data`
WHERE DATE(transaction_timestamp) >= '2024-01-01';

2. Query Optimization Patterns

-- BAD: Scans entire table (expensive and slow)
SELECT customer_id, SUM(amount) as total_spent
FROM `project.dataset.sales_data`
WHERE customer_id IN ('C001', 'C002', 'C003')
GROUP BY customer_id;

-- GOOD: Uses partition pruning and clustering
SELECT customer_id, SUM(amount) as total_spent
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= '2024-01-01'  -- Partition filter
  AND transaction_date <= '2024-12-31'  -- Partition filter
  AND customer_id IN ('C001', 'C002', 'C003')  -- Clustering benefit
GROUP BY customer_id;

-- ADVANCED: Use approximate aggregation for large datasets
SELECT 
  product_category,
  APPROX_COUNT_DISTINCT(customer_id) as unique_customers,
  APPROX_QUANTILES(amount, 100)[OFFSET(50)] as median_amount,
  APPROX_QUANTILES(amount, 100)[OFFSET(95)] as p95_amount
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= CURRENT_DATE() - 30
GROUP BY product_category;

3. Efficient JOIN Strategies

-- OPTIMIZED: Use proper JOIN order and filtering
WITH recent_customers AS (
  SELECT DISTINCT customer_id
  FROM `project.dataset.optimized_sales_data`
  WHERE transaction_date >= CURRENT_DATE() - 7
),
customer_metrics AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    c.tier,
    COUNT(s.transaction_id) as recent_transactions,
    SUM(s.amount) as recent_spend
  FROM `project.dataset.customers` c
  INNER JOIN recent_customers rc ON c.customer_id = rc.customer_id
  INNER JOIN `project.dataset.optimized_sales_data` s 
    ON c.customer_id = s.customer_id
    AND s.transaction_date >= CURRENT_DATE() - 7  -- Partition filter
  GROUP BY c.customer_id, c.customer_name, c.tier
)
SELECT *
FROM customer_metrics
WHERE recent_spend > 1000
ORDER BY recent_spend DESC;

-- Use ARRAY_AGG for denormalization when appropriate
SELECT 
  customer_id,
  ARRAY_AGG(
    STRUCT(
      product_category,
      SUM(amount) as category_spend
    )
  ) as category_breakdown
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= CURRENT_DATE() - 30
GROUP BY customer_id, product_category
GROUP BY customer_id;

4. Materialized Views for Performance

-- Create materialized view for frequently accessed aggregations
CREATE MATERIALIZED VIEW `project.dataset.daily_sales_summary`
PARTITION BY sales_date
CLUSTER BY region, product_category
AS
SELECT 
  DATE(transaction_timestamp) as sales_date,
  region,
  product_category,
  COUNT(*) as transaction_count,
  SUM(amount) as total_revenue,
  AVG(amount) as avg_transaction_value,
  COUNT(DISTINCT customer_id) as unique_customers
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= '2024-01-01'
GROUP BY 
  DATE(transaction_timestamp),
  region,
  product_category;

-- Query the materialized view (much faster)
SELECT 
  region,
  SUM(total_revenue) as monthly_revenue,
  AVG(avg_transaction_value) as avg_order_value
FROM `project.dataset.daily_sales_summary`
WHERE sales_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY region
ORDER BY monthly_revenue DESC;

Slot Management and Cost Control

5. Reserved Slots vs On-Demand Optimization

# Python script to analyze slot usage and recommend reservations
from google.cloud import bigquery
from google.cloud import monitoring_v3
import pandas as pd
from datetime import datetime, timedelta

def analyze_slot_usage(project_id, days_back=30):
    """
    Analyze BigQuery slot usage to optimize reservations
    """
    client = bigquery.Client(project=project_id)
    monitoring_client = monitoring_v3.MetricServiceClient()
    
    # Query job history for slot analysis
    query = f"""
    SELECT
      creation_time,
      job_id,
      total_slot_ms,
      total_bytes_processed,
      query,
      state
    FROM `{project_id}.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {days_back} DAY)
      AND job_type = 'QUERY'
      AND state = 'DONE'
    ORDER BY creation_time DESC
    """
    
    jobs_df = client.query(query).to_dataframe()
    
    # Calculate slot usage patterns
    jobs_df['slots_used'] = jobs_df['total_slot_ms'] / 1000  # Convert to slot-seconds
    jobs_df['hour'] = jobs_df['creation_time'].dt.hour
    jobs_df['day_of_week'] = jobs_df['creation_time'].dt.dayofweek
    
    # Hourly slot usage analysis
    hourly_usage = jobs_df.groupby('hour')['slots_used'].agg(['mean', 'max', 'count'])
    
    # Peak usage analysis
    peak_hours = hourly_usage.nlargest(3, 'mean')
    avg_slots_needed = hourly_usage['mean'].quantile(0.95)
    
    # Cost analysis
    on_demand_cost_per_tb = 5.0  # $5 per TB
    reserved_cost_per_slot = 1200  # $1200 per 100 slots annually
    
    monthly_tb_processed = jobs_df['total_bytes_processed'].sum() / (1024**4) * (30/days_back)
    monthly_on_demand_cost = monthly_tb_processed * on_demand_cost_per_tb
    
    recommended_slots = int(avg_slots_needed)
    annual_reservation_cost = (recommended_slots / 100) * reserved_cost_per_slot
    monthly_reservation_cost = annual_reservation_cost / 12
    
    print("=== BigQuery Slot Usage Analysis ===")
    print(f"Analysis period: {days_back} days")
    print(f"Total queries analyzed: {len(jobs_df)}")
    print(f"Average daily queries: {len(jobs_df) / days_back:.0f}")
    print()
    print("=== Slot Usage Patterns ===")
    print("Peak usage hours:")
    for hour, data in peak_hours.iterrows():
        print(f"  {hour:02d}:00 - Avg: {data['mean']:.0f} slots, Max: {data['max']:.0f} slots")
    print()
    print("=== Cost Analysis ===")
    print(f"Monthly TB processed: {monthly_tb_processed:.2f} TB")
    print(f"Current on-demand cost: ${monthly_on_demand_cost:.2f}/month")
    print(f"Recommended reservation: {recommended_slots} slots")
    print(f"Reservation cost: ${monthly_reservation_cost:.2f}/month")
    
    if monthly_reservation_cost < monthly_on_demand_cost:
        savings = monthly_on_demand_cost - monthly_reservation_cost
        print(f"💰 Potential savings: ${savings:.2f}/month (${savings*12:.2f}/year)")
    else:
        print("💡 On-demand pricing is more cost-effective for your usage pattern")
    
    return {
        'recommended_slots': recommended_slots,
        'monthly_savings': max(0, monthly_on_demand_cost - monthly_reservation_cost),
        'usage_patterns': hourly_usage
    }

# Run the analysis
project_id = "your-project-id"
analysis = analyze_slot_usage(project_id)

6. Query Performance Monitoring

-- Create performance monitoring dashboard queries
-- Query to identify slow queries
SELECT
  job_id,
  user_email,
  query,
  total_slot_ms / 1000 as slot_seconds,
  total_bytes_processed / POW(10, 12) as tb_processed,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_seconds,
  creation_time
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND job_type = 'QUERY'
  AND state = 'DONE'
  AND total_slot_ms / 1000 > 1000  -- Queries using more than 1000 slot-seconds
ORDER BY total_slot_ms DESC
LIMIT 20;

-- Query to identify expensive queries
SELECT
  DATE(creation_time) as query_date,
  user_email,
  COUNT(*) as query_count,
  SUM(total_bytes_processed) / POW(10, 12) as total_tb_processed,
  AVG(total_slot_ms / 1000) as avg_slot_seconds,
  SUM(total_slot_ms / 1000) as total_slot_seconds
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
GROUP BY DATE(creation_time), user_email
HAVING total_tb_processed > 0.1  -- Users processing more than 100GB
ORDER BY total_tb_processed DESC;

7. Query Optimization Recommendations Engine

# Python function to analyze and recommend query optimizations
def analyze_query_performance(project_id, query_text, table_schema):
    """
    Analyze query and provide optimization recommendations
    """
    recommendations = []
    
    # Check for common anti-patterns
    query_lower = query_text.lower()
    
    # 1. Check for SELECT *
    if 'select *' in query_lower:
        recommendations.append({
            'priority': 'HIGH',
            'issue': 'SELECT * anti-pattern',
            'recommendation': 'Specify only needed columns to reduce data scanned',
            'potential_savings': '50-90% reduction in bytes processed'
        })
    
    # 2. Check for missing WHERE clauses on partitioned tables
    if '_partitiontime' not in query_lower and '_table_suffix' not in query_lower:
        for table in table_schema:
            if table.get('partitioned'):
                recommendations.append({
                    'priority': 'HIGH',
                    'issue': 'Missing partition filter',
                    'recommendation': f'Add WHERE clause on partition column for table {table["name"]}',
                    'potential_savings': '80-99% reduction in bytes processed'
                })
    
    # 3. Check for inefficient JOINs
    if 'cross join' in query_lower:
        recommendations.append({
            'priority': 'CRITICAL',
            'issue': 'CROSS JOIN detected',
            'recommendation': 'Replace CROSS JOIN with proper JOIN conditions',
            'potential_savings': 'Prevents query timeout and excessive costs'
        })
    
    # 4. Check for string operations in WHERE clauses
    if any(func in query_lower for func in ['upper(', 'lower(', 'substr(']):
        recommendations.append({
            'priority': 'MEDIUM',
            'issue': 'String functions in WHERE clause',
            'recommendation': 'Move string transformations to SELECT or use clustered columns',
            'potential_savings': '20-40% performance improvement'
        })
    
    # 5. Check for ORDER BY without LIMIT
    if 'order by' in query_lower and 'limit' not in query_lower:
        recommendations.append({
            'priority': 'MEDIUM',
            'issue': 'ORDER BY without LIMIT',
            'recommendation': 'Add LIMIT clause or use window functions for large result sets',
            'potential_savings': '30-60% reduction in processing time'
        })
    
    return recommendations

# Example usage
query_to_analyze = """
SELECT *
FROM sales_data
WHERE UPPER(customer_name) LIKE '%ACME%'
ORDER BY transaction_date
"""

table_schema = [
    {'name': 'sales_data', 'partitioned': True, 'partition_column': 'transaction_date'}
]

recommendations = analyze_query_performance('project-id', query_to_analyze, table_schema)

for rec in recommendations:
    print(f"🚨 {rec['priority']}: {rec['issue']}")
    print(f"   💡 {rec['recommendation']}")
    print(f"   💰 {rec['potential_savings']}")
    print()

Advanced Performance Techniques

8. BI Engine Optimization

-- Create tables optimized for BI Engine
CREATE TABLE `project.dataset.bi_engine_optimized`
(
  date_key DATE,
  product_category STRING,
  region STRING,
  revenue DECIMAL(15,2),
  units_sold INT64,
  customer_count INT64
)
PARTITION BY date_key
CLUSTER BY product_category, region
OPTIONS(
  description="Pre-aggregated data optimized for BI Engine",
  max_staleness=INTERVAL 1 HOUR
);

-- Enable BI Engine reservation for dashboards
-- Note: Done through Console or API, not SQL

9. Streaming Insert Optimization

# Optimized streaming insert pattern
from google.cloud import bigquery
import json

def optimized_streaming_insert(project_id, dataset_id, table_id, rows_to_insert):
    """
    Efficiently stream data to BigQuery with error handling
    """
    client = bigquery.Client(project=project_id)
    table_ref = client.dataset(dataset_id).table(table_id)
    table = client.get_table(table_ref)
    
    # Batch rows for efficiency (BigQuery recommends batches of 1000-10000 rows)
    batch_size = 1000
    batches = [rows_to_insert[i:i + batch_size] for i in range(0, len(rows_to_insert), batch_size)]
    
    errors = []
    successful_inserts = 0
    
    for batch in batches:
        # Add insert_id for deduplication
        rows_with_ids = []
        for i, row in enumerate(batch):
            rows_with_ids.append({
                'insertId': f"{int(time.time())}_{i}",
                'json': row
            })
        
        result = client.insert_rows_json(
            table,
            [row['json'] for row in rows_with_ids],
            row_ids=[row['insertId'] for row in rows_with_ids]
        )
        
        if result:
            errors.extend(result)
        else:
            successful_inserts += len(batch)
    
    print(f"Successfully inserted {successful_inserts} rows")
    if errors:
        print(f"Encountered {len(errors)} errors:")
        for error in errors[:5]:  # Show first 5 errors
            print(f"  {error}")
    
    return successful_inserts, errors

Why It Matters

Query Performance: Optimised queries run 10-50x faster than poorly written ones
Cost Control: Proper partitioning and clustering can reduce costs by 80-95%
User Experience: Fast queries = happy data analysts and better decision making
Resource Efficiency: Better slot utilisation = more concurrent queries possible

Try This Week

Audit your slowest queries – Use the INFORMATION_SCHEMA queries above
Add partitioning – Start with your largest, most-queried table
Implement clustering – Add clustering keys to frequently filtered columns
Create a materialized view – For your most common aggregation query

Quick BigQuery Performance Assessment

#!/bin/bash
# BigQuery performance assessment script

PROJECT_ID="your-project-id"

echo "=== BigQuery Performance Assessment ==="
echo

echo "📊 Recent query performance (last 24 hours):"
bq query --use_legacy_sql=false --format=table \
"SELECT
  COUNT(*) as total_queries,
  AVG(total_slot_ms / 1000) as avg_slot_seconds,
  SUM(total_bytes_processed) / POW(10, 12) as total_tb_processed,
  COUNTIF(total_slot_ms / 1000 > 1000) as slow_queries
FROM \`${PROJECT_ID}.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT\`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND job_type = 'QUERY'
  AND state = 'DONE'"

echo
echo "🔍 Tables without partitioning:"
bq query --use_legacy_sql=false --format=table \
"SELECT
  table_catalog,
  table_schema, 
  table_name,
  row_count,
  size_bytes / POW(10, 9) as size_gb
FROM \`${PROJECT_ID}.INFORMATION_SCHEMA.TABLES\`
WHERE table_type = 'BASE_TABLE'
  AND partition_by IS NULL
  AND row_count > 1000000
ORDER BY size_bytes DESC
LIMIT 10"

echo
echo "💰 Most expensive queries (by bytes processed):"
bq query --use_legacy_sql=false --format=table \
"SELECT
  job_id,
  user_email,
  total_bytes_processed / POW(10, 12) as tb_processed,
  total_slot_ms / 1000 as slot_seconds,
  SUBSTR(query, 1, 100) as query_preview
FROM \`${PROJECT_ID}.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT\`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 10"

echo
echo "🎯 Optimization recommendations:"
echo "1. Add partitioning to large tables (> 1M rows)"
echo "2. Use clustering on frequently filtered columns"
echo "3. Replace SELECT * with specific column lists"
echo "4. Add partition filters to queries on partitioned tables"
echo "5. Consider materialized views for frequent aggregations"
echo "6. Monitor slot usage patterns for reservation opportunities"

Common Performance Killers

SELECT * queries: Scanning unnecessary columns wastes slots and money
Missing partition filters: Querying entire tables instead of relevant partitions
Poor JOIN order: Large table JOINs without proper filtering
String operations in WHERE: Functions that prevent partition pruning
No clustering: Missing opportunities for data locality optimisation

Advanced Optimization Tools

Query validator: Check queries before running for cost estimation
Slot monitoring: Track slot usage patterns for reservation planning
BI Engine: In-memory analytics for sub-second dashboard queries
Materialized views: Pre-computed aggregations for common queries

Pro Tip: Use BigQuery’s query validator and cost estimator before running expensive queries. The “DRY RUN” option shows exactly how much data will be processed without actually running the query, helping you optimise before you pay.