Your weekly dose of actionable cloud wisdom tothis is my latest blog post. can you recomend a feature image for the post start the week right
The Problem
Your BigQuery queries are taking 10 minutes to complete what should be 30-second operations, you’re burning through your monthly slot quota by Tuesday, and your data team is getting frustrated with “query exceeded resource limits” errors. Meanwhile, your BigQuery bill shows you’re paying premium prices for queries that scan entire tables just to return a handful of rows.
The Solution
Optimise BigQuery performance using proper table design, intelligent query patterns, and slot management techniques. Most BigQuery performance issues stem from poor partitioning, inefficient SQL patterns, and misunderstanding how BigQuery’s columnar storage works. Well-optimised queries run 10-50x faster whilst using a fraction of the compute resources.
Essential Performance Optimization Techniques:
1. Table Partitioning and Clustering Strategy
-- Create properly partitioned and clustered table
CREATE TABLE `project.dataset.optimized_sales_data`
(
transaction_id STRING,
customer_id STRING,
product_category STRING,
transaction_date DATE,
transaction_timestamp TIMESTAMP,
amount DECIMAL(10,2),
region STRING,
sales_rep_id STRING
)
PARTITION BY transaction_date
CLUSTER BY customer_id, product_category
OPTIONS(
description="Sales data optimized for performance",
partition_expiration_days=365,
require_partition_filter=true
);
-- Load data with optimal performance
INSERT INTO `project.dataset.optimized_sales_data`
SELECT
transaction_id,
customer_id,
product_category,
DATE(transaction_timestamp) as transaction_date,
transaction_timestamp,
amount,
region,
sales_rep_id
FROM `project.dataset.raw_sales_data`
WHERE DATE(transaction_timestamp) >= '2024-01-01';
2. Query Optimization Patterns
-- BAD: Scans entire table (expensive and slow)
SELECT customer_id, SUM(amount) as total_spent
FROM `project.dataset.sales_data`
WHERE customer_id IN ('C001', 'C002', 'C003')
GROUP BY customer_id;
-- GOOD: Uses partition pruning and clustering
SELECT customer_id, SUM(amount) as total_spent
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= '2024-01-01' -- Partition filter
AND transaction_date <= '2024-12-31' -- Partition filter
AND customer_id IN ('C001', 'C002', 'C003') -- Clustering benefit
GROUP BY customer_id;
-- ADVANCED: Use approximate aggregation for large datasets
SELECT
product_category,
APPROX_COUNT_DISTINCT(customer_id) as unique_customers,
APPROX_QUANTILES(amount, 100)[OFFSET(50)] as median_amount,
APPROX_QUANTILES(amount, 100)[OFFSET(95)] as p95_amount
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= CURRENT_DATE() - 30
GROUP BY product_category;
3. Efficient JOIN Strategies
-- OPTIMIZED: Use proper JOIN order and filtering
WITH recent_customers AS (
SELECT DISTINCT customer_id
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= CURRENT_DATE() - 7
),
customer_metrics AS (
SELECT
c.customer_id,
c.customer_name,
c.tier,
COUNT(s.transaction_id) as recent_transactions,
SUM(s.amount) as recent_spend
FROM `project.dataset.customers` c
INNER JOIN recent_customers rc ON c.customer_id = rc.customer_id
INNER JOIN `project.dataset.optimized_sales_data` s
ON c.customer_id = s.customer_id
AND s.transaction_date >= CURRENT_DATE() - 7 -- Partition filter
GROUP BY c.customer_id, c.customer_name, c.tier
)
SELECT *
FROM customer_metrics
WHERE recent_spend > 1000
ORDER BY recent_spend DESC;
-- Use ARRAY_AGG for denormalization when appropriate
SELECT
customer_id,
ARRAY_AGG(
STRUCT(
product_category,
SUM(amount) as category_spend
)
) as category_breakdown
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= CURRENT_DATE() - 30
GROUP BY customer_id, product_category
GROUP BY customer_id;
4. Materialized Views for Performance
-- Create materialized view for frequently accessed aggregations
CREATE MATERIALIZED VIEW `project.dataset.daily_sales_summary`
PARTITION BY sales_date
CLUSTER BY region, product_category
AS
SELECT
DATE(transaction_timestamp) as sales_date,
region,
product_category,
COUNT(*) as transaction_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_transaction_value,
COUNT(DISTINCT customer_id) as unique_customers
FROM `project.dataset.optimized_sales_data`
WHERE transaction_date >= '2024-01-01'
GROUP BY
DATE(transaction_timestamp),
region,
product_category;
-- Query the materialized view (much faster)
SELECT
region,
SUM(total_revenue) as monthly_revenue,
AVG(avg_transaction_value) as avg_order_value
FROM `project.dataset.daily_sales_summary`
WHERE sales_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY region
ORDER BY monthly_revenue DESC;
Slot Management and Cost Control
5. Reserved Slots vs On-Demand Optimization
# Python script to analyze slot usage and recommend reservations
from google.cloud import bigquery
from google.cloud import monitoring_v3
import pandas as pd
from datetime import datetime, timedelta
def analyze_slot_usage(project_id, days_back=30):
"""
Analyze BigQuery slot usage to optimize reservations
"""
client = bigquery.Client(project=project_id)
monitoring_client = monitoring_v3.MetricServiceClient()
# Query job history for slot analysis
query = f"""
SELECT
creation_time,
job_id,
total_slot_ms,
total_bytes_processed,
query,
state
FROM `{project_id}.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {days_back} DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY creation_time DESC
"""
jobs_df = client.query(query).to_dataframe()
# Calculate slot usage patterns
jobs_df['slots_used'] = jobs_df['total_slot_ms'] / 1000 # Convert to slot-seconds
jobs_df['hour'] = jobs_df['creation_time'].dt.hour
jobs_df['day_of_week'] = jobs_df['creation_time'].dt.dayofweek
# Hourly slot usage analysis
hourly_usage = jobs_df.groupby('hour')['slots_used'].agg(['mean', 'max', 'count'])
# Peak usage analysis
peak_hours = hourly_usage.nlargest(3, 'mean')
avg_slots_needed = hourly_usage['mean'].quantile(0.95)
# Cost analysis
on_demand_cost_per_tb = 5.0 # $5 per TB
reserved_cost_per_slot = 1200 # $1200 per 100 slots annually
monthly_tb_processed = jobs_df['total_bytes_processed'].sum() / (1024**4) * (30/days_back)
monthly_on_demand_cost = monthly_tb_processed * on_demand_cost_per_tb
recommended_slots = int(avg_slots_needed)
annual_reservation_cost = (recommended_slots / 100) * reserved_cost_per_slot
monthly_reservation_cost = annual_reservation_cost / 12
print("=== BigQuery Slot Usage Analysis ===")
print(f"Analysis period: {days_back} days")
print(f"Total queries analyzed: {len(jobs_df)}")
print(f"Average daily queries: {len(jobs_df) / days_back:.0f}")
print()
print("=== Slot Usage Patterns ===")
print("Peak usage hours:")
for hour, data in peak_hours.iterrows():
print(f" {hour:02d}:00 - Avg: {data['mean']:.0f} slots, Max: {data['max']:.0f} slots")
print()
print("=== Cost Analysis ===")
print(f"Monthly TB processed: {monthly_tb_processed:.2f} TB")
print(f"Current on-demand cost: ${monthly_on_demand_cost:.2f}/month")
print(f"Recommended reservation: {recommended_slots} slots")
print(f"Reservation cost: ${monthly_reservation_cost:.2f}/month")
if monthly_reservation_cost < monthly_on_demand_cost:
savings = monthly_on_demand_cost - monthly_reservation_cost
print(f"💰 Potential savings: ${savings:.2f}/month (${savings*12:.2f}/year)")
else:
print("💡 On-demand pricing is more cost-effective for your usage pattern")
return {
'recommended_slots': recommended_slots,
'monthly_savings': max(0, monthly_on_demand_cost - monthly_reservation_cost),
'usage_patterns': hourly_usage
}
# Run the analysis
project_id = "your-project-id"
analysis = analyze_slot_usage(project_id)
6. Query Performance Monitoring
-- Create performance monitoring dashboard queries
-- Query to identify slow queries
SELECT
job_id,
user_email,
query,
total_slot_ms / 1000 as slot_seconds,
total_bytes_processed / POW(10, 12) as tb_processed,
TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_seconds,
creation_time
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
AND state = 'DONE'
AND total_slot_ms / 1000 > 1000 -- Queries using more than 1000 slot-seconds
ORDER BY total_slot_ms DESC
LIMIT 20;
-- Query to identify expensive queries
SELECT
DATE(creation_time) as query_date,
user_email,
COUNT(*) as query_count,
SUM(total_bytes_processed) / POW(10, 12) as total_tb_processed,
AVG(total_slot_ms / 1000) as avg_slot_seconds,
SUM(total_slot_ms / 1000) as total_slot_seconds
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
GROUP BY DATE(creation_time), user_email
HAVING total_tb_processed > 0.1 -- Users processing more than 100GB
ORDER BY total_tb_processed DESC;
7. Query Optimization Recommendations Engine
# Python function to analyze and recommend query optimizations
def analyze_query_performance(project_id, query_text, table_schema):
"""
Analyze query and provide optimization recommendations
"""
recommendations = []
# Check for common anti-patterns
query_lower = query_text.lower()
# 1. Check for SELECT *
if 'select *' in query_lower:
recommendations.append({
'priority': 'HIGH',
'issue': 'SELECT * anti-pattern',
'recommendation': 'Specify only needed columns to reduce data scanned',
'potential_savings': '50-90% reduction in bytes processed'
})
# 2. Check for missing WHERE clauses on partitioned tables
if '_partitiontime' not in query_lower and '_table_suffix' not in query_lower:
for table in table_schema:
if table.get('partitioned'):
recommendations.append({
'priority': 'HIGH',
'issue': 'Missing partition filter',
'recommendation': f'Add WHERE clause on partition column for table {table["name"]}',
'potential_savings': '80-99% reduction in bytes processed'
})
# 3. Check for inefficient JOINs
if 'cross join' in query_lower:
recommendations.append({
'priority': 'CRITICAL',
'issue': 'CROSS JOIN detected',
'recommendation': 'Replace CROSS JOIN with proper JOIN conditions',
'potential_savings': 'Prevents query timeout and excessive costs'
})
# 4. Check for string operations in WHERE clauses
if any(func in query_lower for func in ['upper(', 'lower(', 'substr(']):
recommendations.append({
'priority': 'MEDIUM',
'issue': 'String functions in WHERE clause',
'recommendation': 'Move string transformations to SELECT or use clustered columns',
'potential_savings': '20-40% performance improvement'
})
# 5. Check for ORDER BY without LIMIT
if 'order by' in query_lower and 'limit' not in query_lower:
recommendations.append({
'priority': 'MEDIUM',
'issue': 'ORDER BY without LIMIT',
'recommendation': 'Add LIMIT clause or use window functions for large result sets',
'potential_savings': '30-60% reduction in processing time'
})
return recommendations
# Example usage
query_to_analyze = """
SELECT *
FROM sales_data
WHERE UPPER(customer_name) LIKE '%ACME%'
ORDER BY transaction_date
"""
table_schema = [
{'name': 'sales_data', 'partitioned': True, 'partition_column': 'transaction_date'}
]
recommendations = analyze_query_performance('project-id', query_to_analyze, table_schema)
for rec in recommendations:
print(f"🚨 {rec['priority']}: {rec['issue']}")
print(f" 💡 {rec['recommendation']}")
print(f" 💰 {rec['potential_savings']}")
print()
Advanced Performance Techniques
8. BI Engine Optimization
-- Create tables optimized for BI Engine
CREATE TABLE `project.dataset.bi_engine_optimized`
(
date_key DATE,
product_category STRING,
region STRING,
revenue DECIMAL(15,2),
units_sold INT64,
customer_count INT64
)
PARTITION BY date_key
CLUSTER BY product_category, region
OPTIONS(
description="Pre-aggregated data optimized for BI Engine",
max_staleness=INTERVAL 1 HOUR
);
-- Enable BI Engine reservation for dashboards
-- Note: Done through Console or API, not SQL
9. Streaming Insert Optimization
# Optimized streaming insert pattern
from google.cloud import bigquery
import json
def optimized_streaming_insert(project_id, dataset_id, table_id, rows_to_insert):
"""
Efficiently stream data to BigQuery with error handling
"""
client = bigquery.Client(project=project_id)
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)
# Batch rows for efficiency (BigQuery recommends batches of 1000-10000 rows)
batch_size = 1000
batches = [rows_to_insert[i:i + batch_size] for i in range(0, len(rows_to_insert), batch_size)]
errors = []
successful_inserts = 0
for batch in batches:
# Add insert_id for deduplication
rows_with_ids = []
for i, row in enumerate(batch):
rows_with_ids.append({
'insertId': f"{int(time.time())}_{i}",
'json': row
})
result = client.insert_rows_json(
table,
[row['json'] for row in rows_with_ids],
row_ids=[row['insertId'] for row in rows_with_ids]
)
if result:
errors.extend(result)
else:
successful_inserts += len(batch)
print(f"Successfully inserted {successful_inserts} rows")
if errors:
print(f"Encountered {len(errors)} errors:")
for error in errors[:5]: # Show first 5 errors
print(f" {error}")
return successful_inserts, errors
Why It Matters
- Query Performance: Optimised queries run 10-50x faster than poorly written ones
- Cost Control: Proper partitioning and clustering can reduce costs by 80-95%
- User Experience: Fast queries = happy data analysts and better decision making
- Resource Efficiency: Better slot utilisation = more concurrent queries possible
Try This Week
- Audit your slowest queries – Use the INFORMATION_SCHEMA queries above
- Add partitioning – Start with your largest, most-queried table
- Implement clustering – Add clustering keys to frequently filtered columns
- Create a materialized view – For your most common aggregation query
Quick BigQuery Performance Assessment
#!/bin/bash
# BigQuery performance assessment script
PROJECT_ID="your-project-id"
echo "=== BigQuery Performance Assessment ==="
echo
echo "📊 Recent query performance (last 24 hours):"
bq query --use_legacy_sql=false --format=table \
"SELECT
COUNT(*) as total_queries,
AVG(total_slot_ms / 1000) as avg_slot_seconds,
SUM(total_bytes_processed) / POW(10, 12) as total_tb_processed,
COUNTIF(total_slot_ms / 1000 > 1000) as slow_queries
FROM \`${PROJECT_ID}.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT\`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND job_type = 'QUERY'
AND state = 'DONE'"
echo
echo "🔍 Tables without partitioning:"
bq query --use_legacy_sql=false --format=table \
"SELECT
table_catalog,
table_schema,
table_name,
row_count,
size_bytes / POW(10, 9) as size_gb
FROM \`${PROJECT_ID}.INFORMATION_SCHEMA.TABLES\`
WHERE table_type = 'BASE_TABLE'
AND partition_by IS NULL
AND row_count > 1000000
ORDER BY size_bytes DESC
LIMIT 10"
echo
echo "💰 Most expensive queries (by bytes processed):"
bq query --use_legacy_sql=false --format=table \
"SELECT
job_id,
user_email,
total_bytes_processed / POW(10, 12) as tb_processed,
total_slot_ms / 1000 as slot_seconds,
SUBSTR(query, 1, 100) as query_preview
FROM \`${PROJECT_ID}.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT\`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 10"
echo
echo "🎯 Optimization recommendations:"
echo "1. Add partitioning to large tables (> 1M rows)"
echo "2. Use clustering on frequently filtered columns"
echo "3. Replace SELECT * with specific column lists"
echo "4. Add partition filters to queries on partitioned tables"
echo "5. Consider materialized views for frequent aggregations"
echo "6. Monitor slot usage patterns for reservation opportunities"
Common Performance Killers
- SELECT * queries: Scanning unnecessary columns wastes slots and money
- Missing partition filters: Querying entire tables instead of relevant partitions
- Poor JOIN order: Large table JOINs without proper filtering
- String operations in WHERE: Functions that prevent partition pruning
- No clustering: Missing opportunities for data locality optimisation
Advanced Optimization Tools
- Query validator: Check queries before running for cost estimation
- Slot monitoring: Track slot usage patterns for reservation planning
- BI Engine: In-memory analytics for sub-second dashboard queries
- Materialized views: Pre-computed aggregations for common queries
Pro Tip: Use BigQuery’s query validator and cost estimator before running expensive queries. The “DRY RUN” option shows exactly how much data will be processed without actually running the query, helping you optimise before you pay.








