BigQuery SQL Interview Questions & Answers

4 min readOct 5, 2024

Q1:- How do you generate a report that combines data from multiple sources, including sales data, customer demographics, and product information stored in different tables? Write a SQL query that efficiently joins these tables and explain your approach to ensure optimal performance.

Answer:

SELECT
    s.SaleID,
    s.SaleDate,
    s.Quantity,
    s.CustomerID,
    c.CustomerName,
    c.City,
    c.State,
    p.ProductName,
    p.Category,
    p.Price
FROM
    Sales s -- Start with the main table you want to retrieve data from
JOIN
    Customers c ON s.CustomerID = c.CustomerID -- Join with Customers using the shared key
JOIN
    Products p ON s.ProductID = p.ProductID -- Join with Products using the shared key
WHERE
    s.SaleDate BETWEEN '2023-01-01' AND '2023-12-31'; -- Example filter for a specific date range

Q2:- How do you calculate the moving average of sales over the last three months for each product? Write a SQL query using window functions to achieve this and explain how you would ensure accuracy in your calculations.

Answer:

-- Assuming your table is named "sales" and has columns like "product_id", "sale_date", and "sales_amount"
SELECT
    product_id,
    sale_date,
    sales_amount,
    -- Calculate the 3-month moving average for each product
    AVG(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_average
FROM
    sales
ORDER BY
    product_id, sale_date;

Q3:
You are working with a large dataset in BigQuery that contains billions of rows. The query performance has degraded, and you need to optimize it. What steps will you take to improve the performance?

Answer:

Partitioning & Clustering: Use table partitioning (e.g., time-based) and clustering to reduce the amount of data scanned during the query.
Query Pruning: Ensure the query is scanning only necessary columns by using SELECT with specific fields rather than SELECT *.
Caching: Utilize BigQuery’s query results cache for repeated queries to avoid unnecessary computations.
Materialized Views: Create materialized views for frequent aggregations or queries that don’t need real-time data.
Denormalization: Consider flattening highly normalized data by using repeated fields or nested records to reduce the need for multiple JOINs.
Query Scheduling: For batch data, run complex queries at off-peak hours to avoid performance hits during peak times.
Using Approximate Functions: Leverage approximate aggregations (APPROX_COUNT_DISTINCT, etc.) for faster results on large datasets when 100% accuracy isn't necessary.

Question:
You have a dataset in BigQuery where some fields contain NULL values. How would you handle these NULL values in your queries?

Answer:

Handling NULLs in Aggregations: Use functions like IFNULL() or COALESCE() to replace NULLs with a default value.

SELECT COALESCE(sales_amount, 0) AS total_sales
FROM orders;

SELECT *
FROM users
WHERE email IS NOT NULL;

Filtering NULLs: To exclude NULL values from query results, use a WHERE clause.

Counting NULL and Non-NULL Values: Use conditional expressions to count both NULL and non-NULL occurrences.

SELECT COUNTIF(email IS NULL) AS null_count, COUNTIF(email IS NOT NULL) AS not_null_count
FROM users;

Q4:-
You need to join two large tables in BigQuery, orders and customers, with billions of rows each. What strategies would you use to ensure the query performs well?

Answer:

Use Partitioned/Clustered Tables: Ensure that both orders and customers are partitioned or clustered on the fields being used for the join (e.g., customer_id).
Reduce Scanned Data: Select only the necessary columns and filter data before joining to minimize the volume of data being processed.
Avoid Cross-Joins: Ensure that your join condition is properly set up (e.g., ON orders.customer_id = customers.customer_id) to avoid cross joins, which can significantly increase query time.
Use WITH Clause (Common Table Expressions): If there are multiple joins or subqueries, simplify the query by breaking it down into smaller, more readable parts using the WITH clause.
Broadcast Joins: If one table is significantly smaller, you can use a broadcast join to replicate the smaller table across nodes rather than shuffling large amounts of data.

Q5:-
You need to find the top 3 highest sales transactions per customer in the past month. How would you approach this in BigQuery?

Answer:

Use Window Functions: Use the ROW_NUMBER() function over a partitioned set to rank transactions per customer.

WITH ranked_transactions AS (
  SELECT
    customer_id,
    transaction_id,
    amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank
  FROM transactions
  WHERE transaction_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 MONTH)
)
SELECT
  customer_id,
  transaction_id,
  amount
FROM ranked_transactions
WHERE rank <= 3;

Q: Orders table with columns order_id, customer_id, and order_date, Write a query to find customers who have placed orders in both 2023 and 2024.

SELECT customer_id
FROM orders
WHERE EXTRACT(YEAR FROM order_date) IN (2023, 2024)
GROUP BY customer_id
HAVING COUNT(DISTINCT EXTRACT(YEAR FROM order_date)) = 2;

Q:-Orders table with columns order_id, customer_id, order_amount, write a query to find the highest and lowest order amount for each customer.

WITH CustomerOrderSummary AS (
    SELECT
        customer_id,
        MAX(order_amount) AS max_order_amount,
        MIN(order_amount) AS min_order_amount
    FROM
        orders
    GROUP BY
        customer_id
)
SELECT * FROM CustomerOrderSummary;

BigQuery SQL Interview Questions & Answers

Written by Biswanath Giri

No responses yet