BigQuery SQL Interview Questions & Answers

Biswanath Giri
4 min readOct 5, 2024

--

Q1:- How do you generate a report that combines data from multiple sources, including sales data, customer demographics, and product information stored in different tables? Write a SQL query that efficiently joins these tables and explain your approach to ensure optimal performance.

Answer:

SELECT
s.SaleID,
s.SaleDate,
s.Quantity,
s.CustomerID,
c.CustomerName,
c.City,
c.State,
p.ProductName,
p.Category,
p.Price
FROM
Sales s -- Start with the main table you want to retrieve data from
JOIN
Customers c ON s.CustomerID = c.CustomerID -- Join with Customers using the shared key
JOIN
Products p ON s.ProductID = p.ProductID -- Join with Products using the shared key
WHERE
s.SaleDate BETWEEN '2023-01-01' AND '2023-12-31'; -- Example filter for a specific date range

Q2:- How do you calculate the moving average of sales over the last three months for each product? Write a SQL query using window functions to achieve this and explain how you would ensure accuracy in your calculations.

Answer:

-- Assuming your table is named "sales" and has columns like "product_id", "sale_date", and "sales_amount"
SELECT
product_id,
sale_date,
sales_amount,
-- Calculate the 3-month moving average for each product
AVG(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_average
FROM
sales
ORDER BY
product_id, sale_date;

Q3:
You are working with a large dataset in BigQuery that contains billions of rows. The query performance has degraded, and you need to optimize it. What steps will you take to improve the performance?

Answer:

  • Partitioning & Clustering: Use table partitioning (e.g., time-based) and clustering to reduce the amount of data scanned during the query.
  • Query Pruning: Ensure the query is scanning only necessary columns by using SELECT with specific fields rather than SELECT *.
  • Caching: Utilize BigQuery’s query results cache for repeated queries to avoid unnecessary computations.
  • Materialized Views: Create materialized views for frequent aggregations or queries that don’t need real-time data.
  • Denormalization: Consider flattening highly normalized data by using repeated fields or nested records to reduce the need for multiple JOINs.
  • Query Scheduling: For batch data, run complex queries at off-peak hours to avoid performance hits during peak times.
  • Using Approximate Functions: Leverage approximate aggregations (APPROX_COUNT_DISTINCT, etc.) for faster results on large datasets when 100% accuracy isn't necessary.

Question:
You have a dataset in BigQuery where some fields contain NULL values. How would you handle these NULL values in your queries?

Answer:

  • Handling NULLs in Aggregations: Use functions like IFNULL() or COALESCE() to replace NULLs with a default value.
SELECT COALESCE(sales_amount, 0) AS total_sales
FROM orders;
SELECT *
FROM users
WHERE email IS NOT NULL;

Filtering NULLs: To exclude NULL values from query results, use a WHERE clause.

Counting NULL and Non-NULL Values: Use conditional expressions to count both NULL and non-NULL occurrences.

SELECT COUNTIF(email IS NULL) AS null_count, COUNTIF(email IS NOT NULL) AS not_null_count
FROM users;

Q4:-
You need to join two large tables in BigQuery, orders and customers, with billions of rows each. What strategies would you use to ensure the query performs well?

Answer:

  • Use Partitioned/Clustered Tables: Ensure that both orders and customers are partitioned or clustered on the fields being used for the join (e.g., customer_id).
  • Reduce Scanned Data: Select only the necessary columns and filter data before joining to minimize the volume of data being processed.
  • Avoid Cross-Joins: Ensure that your join condition is properly set up (e.g., ON orders.customer_id = customers.customer_id) to avoid cross joins, which can significantly increase query time.
  • Use WITH Clause (Common Table Expressions): If there are multiple joins or subqueries, simplify the query by breaking it down into smaller, more readable parts using the WITH clause.
  • Broadcast Joins: If one table is significantly smaller, you can use a broadcast join to replicate the smaller table across nodes rather than shuffling large amounts of data.

Q5:-
You need to find the top 3 highest sales transactions per customer in the past month. How would you approach this in BigQuery?

Answer:

  • Use Window Functions: Use the ROW_NUMBER() function over a partitioned set to rank transactions per customer.
WITH ranked_transactions AS (
SELECT
customer_id,
transaction_id,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank
FROM transactions
WHERE transaction_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 MONTH)
)
SELECT
customer_id,
transaction_id,
amount
FROM ranked_transactions
WHERE rank <= 3;

Q: Orders table with columns order_id, customer_id, and order_date, Write a query to find customers who have placed orders in both 2023 and 2024.

SELECT customer_id
FROM orders
WHERE EXTRACT(YEAR FROM order_date) IN (2023, 2024)
GROUP BY customer_id
HAVING COUNT(DISTINCT EXTRACT(YEAR FROM order_date)) = 2;

Q:-Orders table with columns order_id, customer_id, order_amount, write a query to find the highest and lowest order amount for each customer.

WITH CustomerOrderSummary AS (
SELECT
customer_id,
MAX(order_amount) AS max_order_amount,
MIN(order_amount) AS min_order_amount
FROM
orders
GROUP BY
customer_id
)
SELECT * FROM CustomerOrderSummary;

--

--

Biswanath Giri
Biswanath Giri

Written by Biswanath Giri

Cloud & AI Architect | Empowering People in Cloud Computing, Google Cloud AI/ML, and Google Workspace | Enabling Businesses on Their Cloud Journey

No responses yet