Skip to main content
BigQuery Clustering Optimization

BigQuery Clustering Optimization

gunyoung.Park
Author
gunyoung.Park
Always curious, always exploring new tech
BigQuery - This article is part of a series.
Part 2: This Article

Clustering
#

Clustering Concept

SELECT *
FROM user_signups
WHERE
  country = 'Lebanon'
  AND registration_date = '2023-12-01'

Through clustering, BigQuery can perform less work in accessing data, thereby increasing query speed. However, before clustering, it’s good to consider the amount of data in the table and whether the cost of processing clustering might be worse.

For example, if a BigQuery column-based table has only 10 rows of data, you should recognize that the cost of clustering will be higher than the cost of doing a full scan.

Quoting a former Google engineer, if the data group for clustering is less than 100MB, doing a full scan might be better than clustering.

Reference: Google BigQuery clustered table not reducing query size

Important Note

Additionally, if you don’t filter on the clustered base column during query time, it provides no help to query performance whatsoever.


Example of Creating Clustered Tables
#

CREATE TABLE `myproject.mydataset.clustered_table` (
  registration_date DATE,
  country STRING,
  tier STRING,
  username STRING
) CLUSTER BY country;

Clustered Table Structure

Clustering Features

  • Can cluster up to 4 columns maximum
  • Unlike partitioning, not limited to INT64 and DATE types only
  • Can also use types like STRING and GEOGRAPHY

Combine Clustering with Partitioning
#

Clustering + Partitioning

Using partitioning and clustering together enables more efficient data access.

Combination Strategy

  • Partitioning: Date-based data division
  • Clustering: Additional sorting within partitions
  • Query performance and cost optimization

References
#

BigQuery - This article is part of a series.
Part 2: This Article

Related