Skip to content

14 Advanced

Syntax

Window Functions

Faster and more readable alternative to self-joins

SELECT
    function(col_a) over w AS output
FROM table
WINDOW
    w AS ( PARTITION BY col_b ORDER BY col_c ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED SUCCEEDING )
QUALIFY output > 1

Note - Use ORDER BY NULL) if ordering does not matter

3 types - Navigation function - Numbering function - Aggregate functions

Running totals

-- risky default
SUM(gmv_amount_lc) OVER( PARTITION BY account_id ORDER BY order_id )
SUM(gmv_amount_lc) OVER( PARTITION BY account_id ORDER BY order_id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- equivalent

-- better
SUM(gmv_amount_lc) OVER( PARTITION BY account_id ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

Joins

  • Semi join
  • Anti join
    • LEFT JOIN b where b.account_id IS NULL
    • NOT EXISTS
    • NOT IN
  • Correlated subquery
  • Except all

Approx Functions

  • APPROX_COUNT_DISTINCT
  • HYPERLOGLOG++, etc

Validation

  • Fan-out?
    • Ideally, all these should be equal
      • COUNT(*)
      • COUNT(primary_key)
        • Use this for best readability, and check with the others
      • COUNT(DISTINCT primary_key)
      • COUNT(1)
  • Numbers?
    • Decompose binary outcomes into components and verify if they add up

Performance

Wrong Correct
Expressions in your WHERE clauses should be ordered with the most selective expression first
Better to specify that dates are dates '2025-01-01' DATE '2025-01-01'
Use partitioning and clustering
Do not perform functions on filter partition columns DATETRUNC(order_date, MONTH) BETWEEN DATE '2025-01-01' AND DATE '2025-12-01' order_date BETWEEN DATE '2025-01-01' AND LAST_DAY(DATE '2025-12-01', MONTH)
Aggregation Filter before aggregation SELECT
FROM (
SELECT event_date, COUNT(
) AS total
FROM project.dataset.events
GROUP BY event_date
)
WHERE event_date >= '2025-11-01';
SELECT event_date, COUNT(*) AS total
FROM project.dataset.events
WHERE event_date >= '2025-11-01'
GROUP BY event_date;
Aggregate before joins Only if amount of data being joined is drastically reduced and tables are aggregated to the same level
(i.e., if there is only one row for every join key value)
Aggregate before calculated fields
Joins Prefer broadcast joins over hash joins 1. Largest
2. โ Smallest
3. โ 2nd largest
4. โ 3rd largest
5. ...
Join on cluster keys/
Clustering on join keys
When you cluster a table based on the key that is used to join, the data is already co-located which makes it easier for workers to split the data into the necessary partitions within the memory shuffle.
CTE/View Materialize CTE/view if it is used multiple times


In some database engines, regular CTEs and views are materialized
- For eg, DuckDB's query optimizer automatically decides whether to materialize a CTE or inline it into the main query, based on heuristics

In some database engines, Regular CTEs and views
- For eg, bigquery
- are like macros
- textual substitution/inline expansion of code at query time
- They are not materialized even if they are referenced multiple times
- If a non-recursive CTE is referenced in multiple places in a query, then the CTE is executed once for each reference.
- BigQuery only materializes the results of recursive CTEs
WITH
cte AS (
SELECT
a.name,
b.state,
RAND() AS random,
FROM bigquery-public-data.usa_names.usa_1910_2013 a
LEFT JOIN bigquery-public-data.usa_names.usa_1910_2013 b
USING (number)
LIMIT 10
)









SELECT * FROM cte
UNION ALL SELECT * FROM cte
UNION ALL SELECT * FROM cte
UNION ALL SELECT * FROM cte
UNION ALL SELECT * FROM cte
WITH RECURSIVE
cte AS (
SELECT
a.name,
b.state,
RAND() AS random,
FROM bigquery-public-data.usa_names.usa_1910_2013 a
LEFT JOIN bigquery-public-data.usa_names.usa_1910_2013 b
USING (number)
LIMIT 10
),

my_cte_materialized AS (
SELECT * FROM dulcet-antler-481417-p8.test.view_name

UNION ALL (SELECT * FROM my_cte_materialized WHERE FALSE)
)

SELECT * FROM my_cte_materialized
UNION ALL SELECT * FROM my_cte_materialized
UNION ALL SELECT * FROM my_cte_materialized
UNION ALL SELECT * FROM my_cte_materialized
UNION ALL SELECT * FROM my_cte_materialized
First or last record only When trying to calculate the first or last record in a subset of your data, using the ROW_NUMBER() function can fail with Resources Exceeded errors if there are too many elements to ORDER BY in a single partition.

Instead, try using ARRAY_AGG() - which runs more efficiently because the ORDER BY is allowed to drop everything except the top record on each GROUP BY

This is because there is no join here, so group by is faster than window function

SELECT
a.country_code,
a.account_id,
a.order_id,
FROM base a
QUALIFY
ROW_NUMBER() OVER (PARTITION BY a.country_code, a.account_id ORDER BY a.order_time ASC) = 1
SELECT
a.country_code,
a.account_id,

MIN_BY(a.order_id, a.order_time), -- single column
MIN_BY(STRUCT(a.order_id, a.is_successful), a.order_time)., -- multiple columns
MIN_BY(a, a.order_time).
EXCEPT (country_code, account_id) -- all columns

FROM base
GROUP BY ALL
First or last record
same granularity
Join -- this does not perform a sort operation

SELECT
a.country_code,
a.account_id,
a.order_id,

FIRST_VALUE(a.country_code) OVER w AS acq_country_code,
FIRST_VALUE(a.order_id) OVER w AS acq_order_id,
-- or LAST_VALUE
-- or NTH_VALUE

FROM base
WINDOW w AS (PARTITION BY a.country_code, a.account_id ORDER BY a.order_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Nth value only However, for k>10, row_number is faster as the overhead of maintaining the array is higher SELECT
a.state,
a.number,
FROM bigquery-public-data.usa_names.usa_1910_2013 a
qualify row_number() over (partition by a.name order by a.number ASC) = 1000;
SELECT
ARRAY_AGG(a ORDER BY a.number DESC LIMIT 5)[OFFSET(4)] AS top_k
FROM bigquery-public-data.usa_names.usa_1910_2013 a
GROUP BY a.account_id, a.country_code
HAVING COUNT(a.number) >= 5
Top k only However, for k>10, row_number is faster as the overhead of maintaining the array is higher SELECT
a.state,
a.number,
FROM bigquery-public-data.usa_names.usa_1910_2013 a
qualify row_number() over (partition by a.name order by a.number ASC) <= 1000;
WITH base AS (
SELECT
ARRAY_AGG(
a
ORDER BY a.number DESC
LIMIT 5
) AS top_k
FROM base a
GROUP BY a.account_id, a.country_code
)
SELECT top_k.*
FROM base, UNNEST(top_k) AS top_k
String comparison When possible, use LIKE instead of REGEXP_CONTAINS. REGEXP_CONTAINS LIKE
## Materialized CTE/View

Trade-off

Materialization Macro
Lower elapsed time โœ…
Lower slot-time โœ…
Lower data reading โœ…
Lower data shuffled โœ…

De-Duplicating Data

In order of performance

  • QUALIFY
    • QUALIFY ROW_NUMBER() OVER (PARTITION BY foi.order_id ORDER BY NULL) = 1
    • Fastest for small datasets, slowest for large datasets due to implicit and unavoidable sort
  • GROUP BY ALL
    • Lower "Compute" time in the GROUP BY version (because hashing is cheaper than aggregate function logic).
    • Lower "Memory" peak in the GROUP BY version (because it's not maintaining ANY_VALUE state buffers).
    • The engine simply scans the index/columns and returns unique pairs.
  • GROUP BY primary key and ANY_VALUE of other fields
    • ANY_VALUE requires an extra step
      • the engine must create groups for the key
      • then execute an aggregate function (ANY_VALUE) on the second column for every group
    • The only time ANY_VALUE is faster is if the columns you are "picking" are extremely wide string (e.g., a 10KB JSON blob)
      • In that case, hashing the column is expensive
      • ANY_VALUE might be cheaper
  • DISTINCT
    • Hard to read
    • GROUP BY is distributed
2026-02-16

Comments