Blue Snowflakes on white sky
Data Engineering

Increasing Query Performance Using the Snowflake Query History

Lesezeit
11 ​​min

Picture this: You are responsible for running a big data platform in Snowflake and over the last few months you noticed a steady increase in the consumed credits. There are no new use cases that you know of. You have already talked to the data onboarding team about the amount of data that is ingested every day and they confirmed to you, that the amount of ingested data has been more or less constant over the last half year. So the old use cases are not dealing with an increase in data volume per day. How do you find out what is causing the increase in consumed credits?

Other than monetary cost, there are many different reasons why you should be concerned about the performance of your data pipelines. Faster query execution means users can retrieve the information they need more quickly. This is crucial in applications where real-time or near-real-time responses are required. Faster access to data also means quicker decision-making, which can be crucial in situations where timely information is of utmost importance (e.g., in emergency response systems). Finally, well-optimized query performance enables the system to handle more concurrent users or larger datasets without a significant drop in response times.

In this article, we take a look at Snowflake’s QUERY_HISTORY view. How can it be used to identify costly queries and what clues does it provide for fixing problematic queries?

Investigating Query Performance in the QUERY_HISTORY View

The ACCOUNT_USAGE and READER_ACCOUNT_USAGE schema both contain a view called QUERY_HISTROY. There are some differences between the two views (detailed here), but for our purposes either will work.

The QUERY_HISTROY contains metadata for all queries that have been run within the Snowflake account over the last 365 days. As with all ACCOUNT_USAGE views, there is some latency with which the information is available. In this case, it is about 45 minutes.

A note about credit usage: Since Snowflake does not bill you for computation time directly, estimating the credits consumed by a specific query is an inexact science. Snowflake bills you for warehouse uptime. So you might be tempted to multiply the time elapsed during the query compilation and execution for a specific query (in seconds) with the credits per second for the warehouse it ran on. There are however two problems with this approach. Warehouses can run several queries in parallel, so you might be overestimating the costs for each specific query. On the other hand, running a query might wake the warehouse up from suspension and leave it in the active state for an unnecessarily long time, leading to underestimations of the cost caused by a specific query. For these reasons, we will simply use the query run time (TOTAL_ELAPSED_TIME in the QUERY_HISTORY view) as an indicator of how costly it is.

Monthly Query Run Time

There are several different dimensions along which we can investigate queries in order to find problematic queries as a target for performance improvements. First, we’ll take a look at a monthly breakdown of the query run time:

MONTHQUERY_COUNTMIN_ELAPSED_TIME_SMAX_ELAPSED_TIME_SSUM_ELAPSED_TIME_S
2023-09-019730.02118,939.4491,167,132.897
2023-08-018770.01620,769.2611,023,945.685
2023-07-015410.01713,981.067983,629.880
2023-06-016610.01818,257.220269,772.616
2023-05-016950.02512,104.698297,825.463
2023-04-016180.0198,248.591190,993.656
2023-03-017980.0219,178.900394,948.741
2023-02-016950.0206,417.905265,338.306
2023-01-016180.0216,716.117195,266.476
2022-12-017980.0195,772.717106,1075.433
2022-11-018780.0188,700.304160,056.014
2022-10-014860.0215,487.664104,637.947

The TOTAL_ELAPSED_TIME combines the compilation time and the execution time. Query compilation encompasses for example parsing, fetching metadata, and query rewriting. We can see that the run time for queries has indeed increased over the last 4 months. Based on this information, we can investigate the last month in more detail.

Query Run Time by Query Type

We can break down all queries from the last month by query type.

QUERY_TYPEQUERY_COUNTMIN_ELAPSED_TIME_SMAX_ELAPSED_TIME_SSUM_ELAPSED_TIME_S
SELECT3020.02318,939.449339,173.783
CREATE_TABLE_AS_SELECT2310.33117,698.261253,989.821
DELETE800.4811,257.1962,955.905
UPDATE230.345651.8311,580.721
SHOW320.0217.32828.765

The query type gives us a good indication of what kind of queries typically run for how long. Here, SELECT and CREATE_TABLE_AS_SELECT take by far the longest. This is typical since SELECT queries often need to scan large tables or perform complex aggregations.

Using the Query Hash

In order to really track down what kind of queries are the most costly, we can use one of Snowflake’s newer features. With the 2023_06 behavior change bundle the QUERY_HISTORY view gained some new columns that help us to find non-obvious patterns in queries. Snowflake calculates a query hash for each query. Even queries that are not syntactically identical receive the same hash. In particular, the column QUERY_PARAMETERIZED_HASH is interesting for us, because this hash ignores differences in whitespace as well as literal arguments within the query. This allows us to group queries by their general structure and get a better picture of the kind of queries that cause the most problems.

Also pay attention to the QUERY_PARAMETERIZED_HASH_VERSION column, since hashes produced by different versions of the hashing logic should not be used for grouping queries. To keep queries simple, we assume here that all hashes are produced by the same logic.

Let’s look at some queries that give us some insight into several performance indicators.

Query Performance Indicators

We first look at run times for different kinds of queries.

QUERY_PARAMETERIZED_HASHQUERY_COUNTMIN_ELAPSED_TIME_SMAX_ELAPSED_TIME_S
00d8f496857f970b8187cb89008b9ffb3015,230.13618,939.449
01630b795a0b07452936076314b999ee240.0261508.126
01e81ce2a48c0ac6d12390fd25e482c61248.491159.186
01ff4d44bd11bfb5add87d88e2ce80191123.426123.426
02f54b0f2bfc98844a87cfabf32ba360339.34739.347

We sorted the results by the maximum run time for each query hash, because it gives a good first indication of what queries likely cause the most costs. Instances, where the minimum run time is near instant, indicate that the query has been answered from cached results. But also run times of less than 10 seconds warrant some investigations. If a query is answered very fast on a large warehouse, it is likely that the query can also be executed quickly in a smaller warehouse, which saves cost.

Data Spilling

There are several reasons why a query sometimes takes a long time to be answered. The first obvious reason is that it requires complicated operations and aggregations to produce the result. If a query takes disproportionately long, it is time to look for potential optimizations. Let’s look more closely at other performance indicators.

QUERY_PARAMETERIZED_HASHQUERY_COUNTSPILLED_L_STORAGESPILLED_R_STORAGE
00d8f496857f970b8187cb89008b9ffb302,899,818,3731,422,512,417
01630b795a0b07452936076314b999ee24224,158,589456,256,215
01e81ce2a48c0ac6d12390fd25e482c61222,451,4560
01ff4d44bd11bfb5add87d88e2ce80191242,112,4730
02f54b0f2bfc98844a87cfabf32ba360348,245,1680

The queries with the longest run times in our results also show that quite a lot of data is spilled to local and remote disks. This occurs when the warehouse executing the query runs out of memory. In these cases, the data required for answering the query is temporarily stored on disk, which particularly in the case of remote disk spilling means slow read and write processes that prolong the execution time. For queries that spill a lot of data to remote storage, it can be useful to experiment with larger warehouses. Virtual Warehouses not only double the number of computing cores with each step between T-shirt sizes but also provide more RAM and local disk space. This avoids remote spilling and typically provides a big boost in execution time.

Micro-partition Usage

Another performance indicator from the query history view is the ratio between the PARTITIONS_SCANNED and the PARTITIONS_TOTAL.

QUERY_PARAMETERIZED_HASHQUERY_COUNTPARTITIONS_SCANNEDPARTITIONS_TOTAL
00d8f496857f970b8187cb89008b9ffb30140,927251,727
01630b795a0b07452936076314b999ee24240,933251,597
01e81ce2a48c0ac6d12390fd25e482c6123,482249,471
01ff4d44bd11bfb5add87d88e2ce8019122,951247,104
02f54b0f2bfc98844a87cfabf32ba360315,539245,689

These two columns give you an idea about how many micro partitions were scanned versus how many micro partitions all tables included in the query are composed of. If a lot of micro partitions that ultimately did not contribute to the result were scanned, it is worth checking the clustering of the tables that are used in that query. Make sure to understand what micro partitions are and how Snowflake performs automatic clustering. For large tables (> 1TB) or tables that are frequently queried with WHERE, JOIN or ORDER_BY clauses, it might be worthwhile to specify a clustering key, so that micro partitions can be pruned more effectively during query execution.

Nailing Down the Root Cause

There are lots of different reasons why queries take longer than you would expect and equally many ways to optimize them. In our example analysis we have seen, that the amount of data that is spilled to remote storage is quite considerable. So we might investigate further in that direction. From this point on you can start investigating individual queries. We have isolated similar queries using the QUERY_PARAMETERIZED_HASH. Using this hash we can get individual query IDs that we can look at more closely.

QUERY_IDQUERY_TEXTQUERY_PARAMETERIZED_HASH
01af43ee-0302-d906-0000-1bb5239e7346SELECT
count(*) as count_transactions,
max(st.value) as max_value,
sum(st.value) as sum_value
from stg_transactions st
where st.transaction_timestamp >= '2023-06-01'
00d8f496857f970b8187cb89008b9ffb

Snowflake offers great tooling for analyzing queries, like the Query Profiler. It can give you insight into how the query is executed step by step.

One common problem the query profiler can point out to you is misplaced WHERE clauses. Filter steps should happen as early as possible within a query execution. Filtering late can cause larger result sets during aggregations and joins, which in the worst case cause data spill to remote storage.

Storing and Updating Costly Query Results

Looking at the queries with the most data spilling in our example, we can see that the filter on the time column is quite coarse and results in transaction data starting from 2023-06-01 being aggregated every day. While the amount of transaction data added every day has not increased, obviously the total amount of transactions increases every day. Here it would make sense to perform the aggregation less often (maybe once a week) and store the results in a new table. If the aggregation needs to incorporate the most current transactions every day, sums, counts, min, and max aggregations can easily be updated daily without reading all the transactions every time.

Conclusions

Performance optimization of queries in a big data setting is oftentimes a long and tedious process. With the QUERY_HISTORY, Snowflake offers us a great resource for investigating the performance of our queries by giving us easy access to metadata like run times, partitions scanned, and data spilled.

The new query hash feature makes it easy to group queries and identify the kind of queries that warrant closer investigation. With tools like Snowflake’s Query Profiler and a good understanding of how Snowflake executes queries, we can come up with more efficient queries for our daily workloads. This not only reduces cost but also increases the productivity of the team and the satisfaction of our data consumers.

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert