14 Advanced

Syntax¶

Window Functions¶

Faster and more readable alternative to self-joins

SELECT
    function(col_a) over w AS output
FROM table
WINDOW
    w AS ( PARTITION BY col_b ORDER BY col_c ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED SUCCEEDING )
QUALIFY output > 1

Note - Use ORDER BY NULL) if ordering does not matter

3 types - Navigation function - Numbering function - Aggregate functions

Running totals

-- risky default
SUM(gmv_amount_lc) OVER( PARTITION BY account_id ORDER BY order_id )
SUM(gmv_amount_lc) OVER( PARTITION BY account_id ORDER BY order_id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- equivalent

-- better
SUM(gmv_amount_lc) OVER( PARTITION BY account_id ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

Joins¶

Semi join
Anti join
- LEFT JOIN b where b.account_id IS NULL
- NOT EXISTS
- NOT IN
Correlated subquery
Except all

Approx Functions¶

APPROX_COUNT_DISTINCT
HYPERLOGLOG++, etc

Validation¶

Fan-out?
- Ideally, all these should be equal
  - COUNT(*)
  - COUNT(primary_key)
    - Use this for best readability, and check with the others
  - COUNT(DISTINCT primary_key)
  - COUNT(1)
Numbers?
- Decompose binary outcomes into components and verify if they add up

Performance¶

		Wrong	Correct
	Expressions in your WHERE clauses should be ordered with the most selective expression first
	Better to specify that dates are dates	`'2025-01-01'`	`DATE '2025-01-01'`
	Use partitioning and clustering
	Do not perform functions on filter partition columns	`DATETRUNC(order_date, MONTH) BETWEEN DATE '2025-01-01' AND DATE '2025-12-01'`	`order_date BETWEEN DATE '2025-01-01' AND LAST_DAY(DATE '2025-12-01', MONTH)`
Aggregation	Filter before aggregation	SELECT FROM ( SELECT event_date, COUNT() AS total FROM `project.dataset.events` GROUP BY event_date ) WHERE event_date >= '2025-11-01';	SELECT event_date, COUNT(*) AS total FROM `project.dataset.events` WHERE event_date >= '2025-11-01' GROUP BY event_date;
	Aggregate before joins		Only if amount of data being joined is drastically reduced and tables are aggregated to the same level (i.e., if there is only one row for every join key value)
	Aggregate before calculated fields
Joins	Prefer broadcast joins over hash joins		1. Largest 2. ⁠Smallest 3. ⁠2^nd largest 4. ⁠3^rd largest 5. ...
	Join on cluster keys/ Clustering on join keys		When you cluster a table based on the key that is used to join, the data is already co-located which makes it easier for workers to split the data into the necessary partitions within the memory shuffle.
CTE/View	Materialize CTE/view if it is used multiple times In some database engines, regular CTEs and views are materialized - For eg, DuckDB's query optimizer automatically decides whether to materialize a CTE or inline it into the main query, based on heuristics In some database engines, Regular CTEs and views - For eg, bigquery - are like macros - textual substitution/inline expansion of code at query time - They are not materialized even if they are referenced multiple times - If a non-recursive CTE is referenced in multiple places in a query, then the CTE is executed once for each reference. - BigQuery only materializes the results of recursive CTEs	WITH cte AS ( SELECT a.name, b.state, RAND() AS random, FROM `bigquery-public-data.usa_names.usa_1910_2013` a LEFT JOIN `bigquery-public-data.usa_names.usa_1910_2013` b USING (number) LIMIT 10 ) SELECT * FROM cte UNION ALL SELECT * FROM cte UNION ALL SELECT * FROM cte UNION ALL SELECT * FROM cte UNION ALL SELECT * FROM cte	WITH RECURSIVE cte AS ( SELECT a.name, b.state, RAND() AS random, FROM `bigquery-public-data.usa_names.usa_1910_2013` a LEFT JOIN `bigquery-public-data.usa_names.usa_1910_2013` b USING (number) LIMIT 10 ), my_cte_materialized AS ( SELECT * FROM `dulcet-antler-481417-p8.test.view_name` UNION ALL (SELECT * FROM my_cte_materialized WHERE FALSE) ) SELECT * FROM my_cte_materialized UNION ALL SELECT * FROM my_cte_materialized UNION ALL SELECT * FROM my_cte_materialized UNION ALL SELECT * FROM my_cte_materialized UNION ALL SELECT * FROM my_cte_materialized
First or last record only	When trying to calculate the first or last record in a subset of your data, using the ROW_NUMBER() function can fail with Resources Exceeded errors if there are too many elements to ORDER BY in a single partition. Instead, try using ARRAY_AGG() - which runs more efficiently because the ORDER BY is allowed to drop everything except the top record on each GROUP BY This is because there is no join here, so group by is faster than window function	SELECT a.country_code, a.account_id, a.order_id, FROM base a QUALIFY ROW_NUMBER() OVER (PARTITION BY a.country_code, a.account_id ORDER BY a.order_time ASC) = 1	SELECT a.country_code, a.account_id, MIN_BY(a.order_id, a.order_time), -- single column MIN_BY(STRUCT(a.order_id, a.is_successful), a.order_time)., -- multiple columns MIN_BY(a, a.order_time). EXCEPT (country_code, account_id) -- all columns FROM base GROUP BY ALL
First or last record same granularity		Join	-- this does not perform a sort operation SELECT a.country_code, a.account_id, a.order_id, FIRST_VALUE(a.country_code) OVER w AS acq_country_code, FIRST_VALUE(a.order_id) OVER w AS acq_order_id, -- or LAST_VALUE -- or NTH_VALUE FROM base WINDOW w AS (PARTITION BY a.country_code, a.account_id ORDER BY a.order_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Nth value only	However, for k>10, row_number is faster as the overhead of maintaining the array is higher	SELECT a.state, a.number, FROM `bigquery-public-data.usa_names.usa_1910_2013` a qualify row_number() over (partition by a.name order by a.number ASC) = 1000;	SELECT ARRAY_AGG(a ORDER BY a.number DESC LIMIT 5)[OFFSET(4)] AS top_k FROM `bigquery-public-data.usa_names.usa_1910_2013` a GROUP BY a.account_id, a.country_code HAVING COUNT(a.number) >= 5
Top k only	However, for k>10, row_number is faster as the overhead of maintaining the array is higher	SELECT a.state, a.number, FROM `bigquery-public-data.usa_names.usa_1910_2013` a qualify row_number() over (partition by a.name order by a.number ASC) <= 1000;	WITH base AS ( SELECT ARRAY_AGG( a ORDER BY a.number DESC LIMIT 5 ) AS top_k FROM base a GROUP BY a.account_id, a.country_code ) SELECT top_k.* FROM base, UNNEST(top_k) AS top_k
String comparison	When possible, use LIKE instead of REGEXP_CONTAINS.	REGEXP_CONTAINS	LIKE
## Materialized CTE/View

Trade-off

	Materialization	Macro
Lower elapsed time	✅
Lower slot-time		✅
Lower data reading	✅
Lower data shuffled	✅

De-Duplicating Data¶

In order of performance

QUALIFY
- QUALIFY ROW_NUMBER() OVER (PARTITION BY foi.order_id ORDER BY NULL) = 1
- Fastest for small datasets, slowest for large datasets due to implicit and unavoidable sort
GROUP BY ALL
- Lower "Compute" time in the GROUP BY version (because hashing is cheaper than aggregate function logic).
- Lower "Memory" peak in the GROUP BY version (because it's not maintaining ANY_VALUE state buffers).
- The engine simply scans the index/columns and returns unique pairs.
GROUP BY primary key and ANY_VALUE of other fields
- ANY_VALUE requires an extra step
  - the engine must create groups for the key
  - then execute an aggregate function (ANY_VALUE) on the second column for every group
- The only time ANY_VALUE is faster is if the columns you are "picking" are extremely wide string (e.g., a 10KB JSON blob)
  - In that case, hashing the column is expensive
  - ANY_VALUE might be cheaper
DISTINCT
- Hard to read
- GROUP BY is distributed

2026-02-16