Picture this: You are responsible for running a big data platform in Snowflake and over the last few months you noticed a steady increase in the consumed credits. There are no new use cases that you know of. You have already talked to the data onboarding team about the amount of data that is ingested every day and they confirmed to you, that the amount of ingested data has been more or less constant over the last half year. So the old use cases are not dealing with an increase in data volume per day. How do you find out what is causing the increase in consumed credits?
Other than monetary cost, there are many different reasons why you should be concerned about the performance of your data pipelines. Faster query execution means users can retrieve the information they need more quickly. This is crucial in applications where real-time or near-real-time responses are required. Faster access to data also means quicker decision-making, which can be crucial in situations where timely information is of utmost importance (e.g., in emergency response systems). Finally, well-optimized query performance enables the system to handle more concurrent users or larger datasets without a significant drop in response times.
In this article, we take a look at Snowflake’s QUERY_HISTORY view. How can it be used to identify costly queries and what clues does it provide for fixing problematic queries?
Investigating Query Performance in the QUERY_HISTORY View
The ACCOUNT_USAGE and READER_ACCOUNT_USAGE schema both contain a view called QUERY_HISTROY. There are some differences between the two views (detailed here), but for our purposes either will work.
The QUERY_HISTROY contains metadata for all queries that have been run within the Snowflake account over the last 365 days. As with all ACCOUNT_USAGE views, there is some latency with which the information is available. In this case, it is about 45 minutes.
A note about credit usage: Since Snowflake does not bill you for computation time directly, estimating the credits consumed by a specific query is an inexact science. Snowflake bills you for warehouse uptime. So you might be tempted to multiply the time elapsed during the query compilation and execution for a specific query (in seconds) with the credits per second for the warehouse it ran on. There are however two problems with this approach. Warehouses can run several queries in parallel, so you might be overestimating the costs for each specific query. On the other hand, running a query might wake the warehouse up from suspension and leave it in the active state for an unnecessarily long time, leading to underestimations of the cost caused by a specific query. For these reasons, we will simply use the query run time (TOTAL_ELAPSED_TIME in the QUERY_HISTORY view) as an indicator of how costly it is.
Monthly Query Run Time
There are several different dimensions along which we can investigate queries in order to find problematic queries as a target for performance improvements. First, we’ll take a look at a monthly breakdown of the query run time:
1 2 3 4 5 6 7 8 |
SELECT DATE_TRUNC('MONTH', "START_TIME") as MONTH, COUNT(*) as QUERY_COUNT, MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S, MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S, SUM(TOTAL_ELAPSED_TIME/1000) as SUM_ELAPSED_TIME_S FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY GROUP BY MONTH |
MONTH | QUERY_COUNT | MIN_ELAPSED_TIME_S | MAX_ELAPSED_TIME_S | SUM_ELAPSED_TIME_S |
---|---|---|---|---|
2023-09-01 | 973 | 0.021 | 18,939.449 | 1,167,132.897 |
2023-08-01 | 877 | 0.016 | 20,769.261 | 1,023,945.685 |
2023-07-01 | 541 | 0.017 | 13,981.067 | 983,629.880 |
2023-06-01 | 661 | 0.018 | 18,257.220 | 269,772.616 |
2023-05-01 | 695 | 0.025 | 12,104.698 | 297,825.463 |
2023-04-01 | 618 | 0.019 | 8,248.591 | 190,993.656 |
2023-03-01 | 798 | 0.021 | 9,178.900 | 394,948.741 |
2023-02-01 | 695 | 0.020 | 6,417.905 | 265,338.306 |
2023-01-01 | 618 | 0.021 | 6,716.117 | 195,266.476 |
2022-12-01 | 798 | 0.019 | 5,772.717 | 106,1075.433 |
2022-11-01 | 878 | 0.018 | 8,700.304 | 160,056.014 |
2022-10-01 | 486 | 0.021 | 5,487.664 | 104,637.947 |
The TOTAL_ELAPSED_TIME combines the compilation time and the execution time. Query compilation encompasses for example parsing, fetching metadata, and query rewriting. We can see that the run time for queries has indeed increased over the last 4 months. Based on this information, we can investigate the last month in more detail.
Query Run Time by Query Type
We can break down all queries from the last month by query type.
1 2 3 4 5 6 7 8 9 10 11 |
SELECT QUERY_TYPE, COUNT(*) as QUERY_COUNT, MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S, MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S, SUM(TOTAL_ELAPSED_TIME/1000) as SUM_ELAPSED_TIME_S FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01' GROUP BY QUERY_TYPE ORDER BY SUM_ELAPSED_TIME_S LIMIT 5 |
QUERY_TYPE | QUERY_COUNT | MIN_ELAPSED_TIME_S | MAX_ELAPSED_TIME_S | SUM_ELAPSED_TIME_S |
---|---|---|---|---|
SELECT | 302 | 0.023 | 18,939.449 | 339,173.783 |
CREATE_TABLE_AS_SELECT | 231 | 0.331 | 17,698.261 | 253,989.821 |
DELETE | 80 | 0.481 | 1,257.196 | 2,955.905 |
UPDATE | 23 | 0.345 | 651.831 | 1,580.721 |
SHOW | 32 | 0.021 | 7.328 | 28.765 |
The query type gives us a good indication of what kind of queries typically run for how long. Here, SELECT and CREATE_TABLE_AS_SELECT take by far the longest. This is typical since SELECT queries often need to scan large tables or perform complex aggregations.
Using the Query Hash
In order to really track down what kind of queries are the most costly, we can use one of Snowflake’s newer features. With the 2023_06 behavior change bundle the QUERY_HISTORY view gained some new columns that help us to find non-obvious patterns in queries. Snowflake calculates a query hash for each query. Even queries that are not syntactically identical receive the same hash. In particular, the column QUERY_PARAMETERIZED_HASH is interesting for us, because this hash ignores differences in whitespace as well as literal arguments within the query. This allows us to group queries by their general structure and get a better picture of the kind of queries that cause the most problems.
Also pay attention to the QUERY_PARAMETERIZED_HASH_VERSION column, since hashes produced by different versions of the hashing logic should not be used for grouping queries. To keep queries simple, we assume here that all hashes are produced by the same logic.
Let’s look at some queries that give us some insight into several performance indicators.
Query Performance Indicators
We first look at run times for different kinds of queries.
1 2 3 4 5 6 7 8 9 10 |
SELECT QUERY_PARAMETERIZED_HASH, COUNT(*) as QUERY_COUNT, MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S, MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S, FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01' GROUP BY QUERY_PARAMETERIZED_HASH ORDER BY MAX_ELAPSED_TIME_S LIMIT 5 |
QUERY_PARAMETERIZED_HASH | QUERY_COUNT | MIN_ELAPSED_TIME_S | MAX_ELAPSED_TIME_S |
---|---|---|---|
00d8f496857f970b8187cb89008b9ffb | 30 | 15,230.136 | 18,939.449 |
01630b795a0b07452936076314b999ee | 24 | 0.026 | 1508.126 |
01e81ce2a48c0ac6d12390fd25e482c6 | 12 | 48.491 | 159.186 |
01ff4d44bd11bfb5add87d88e2ce8019 | 1 | 123.426 | 123.426 |
02f54b0f2bfc98844a87cfabf32ba360 | 3 | 39.347 | 39.347 |
We sorted the results by the maximum run time for each query hash, because it gives a good first indication of what queries likely cause the most costs. Instances, where the minimum run time is near instant, indicate that the query has been answered from cached results. But also run times of less than 10 seconds warrant some investigations. If a query is answered very fast on a large warehouse, it is likely that the query can also be executed quickly in a smaller warehouse, which saves cost.
Data Spilling
There are several reasons why a query sometimes takes a long time to be answered. The first obvious reason is that it requires complicated operations and aggregations to produce the result. If a query takes disproportionately long, it is time to look for potential optimizations. Let’s look more closely at other performance indicators.
1 2 3 4 5 6 7 8 9 |
SELECT QUERY_PARAMETERIZED_HASH, COUNT(*) as QUERY_COUNT, MAX(BYTES_SPILLED_TO_LOCAL_STORAGE) as SPILLED_L_STORAGE, MAX(BYTES_SPILLED_TO_REMOTE_STORAGE) as SPILLED_R_STORAGE, FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01' GROUP BY QUERY_PARAMETERIZED_HASH LIMIT 5 |
QUERY_PARAMETERIZED_HASH | QUERY_COUNT | SPILLED_L_STORAGE | SPILLED_R_STORAGE |
---|---|---|---|
00d8f496857f970b8187cb89008b9ffb | 30 | 2,899,818,373 | 1,422,512,417 |
01630b795a0b07452936076314b999ee | 24 | 224,158,589 | 456,256,215 |
01e81ce2a48c0ac6d12390fd25e482c6 | 12 | 22,451,456 | 0 |
01ff4d44bd11bfb5add87d88e2ce8019 | 1 | 242,112,473 | 0 |
02f54b0f2bfc98844a87cfabf32ba360 | 3 | 48,245,168 | 0 |
The queries with the longest run times in our results also show that quite a lot of data is spilled to local and remote disks. This occurs when the warehouse executing the query runs out of memory. In these cases, the data required for answering the query is temporarily stored on disk, which particularly in the case of remote disk spilling means slow read and write processes that prolong the execution time. For queries that spill a lot of data to remote storage, it can be useful to experiment with larger warehouses. Virtual Warehouses not only double the number of computing cores with each step between T-shirt sizes but also provide more RAM and local disk space. This avoids remote spilling and typically provides a big boost in execution time.
Micro-partition Usage
Another performance indicator from the query history view is the ratio between the PARTITIONS_SCANNED and the PARTITIONS_TOTAL.
1 2 3 4 5 6 7 8 9 |
SELECT QUERY_PARAMETERIZED_HASH, COUNT(*) as QUERY_COUNT, MAX(PARTITIONS_SCANNED) as PARTITIONS_SCANNED, MAX(PARTITIONS_TOTAL) as PARTITIONS_TOTAL FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01' GROUP BY QUERY_PARAMETERIZED_HASH LIMIT 5 |
QUERY_PARAMETERIZED_HASH | QUERY_COUNT | PARTITIONS_SCANNED | PARTITIONS_TOTAL |
---|---|---|---|
00d8f496857f970b8187cb89008b9ffb | 30 | 140,927 | 251,727 |
01630b795a0b07452936076314b999ee | 24 | 240,933 | 251,597 |
01e81ce2a48c0ac6d12390fd25e482c6 | 12 | 3,482 | 249,471 |
01ff4d44bd11bfb5add87d88e2ce8019 | 1 | 22,951 | 247,104 |
02f54b0f2bfc98844a87cfabf32ba360 | 3 | 15,539 | 245,689 |
These two columns give you an idea about how many micro partitions were scanned versus how many micro partitions all tables included in the query are composed of. If a lot of micro partitions that ultimately did not contribute to the result were scanned, it is worth checking the clustering of the tables that are used in that query. Make sure to understand what micro partitions are and how Snowflake performs automatic clustering. For large tables (> 1TB) or tables that are frequently queried with WHERE, JOIN or ORDER_BY clauses, it might be worthwhile to specify a clustering key, so that micro partitions can be pruned more effectively during query execution.
Nailing Down the Root Cause
There are lots of different reasons why queries take longer than you would expect and equally many ways to optimize them. In our example analysis we have seen, that the amount of data that is spilled to remote storage is quite considerable. So we might investigate further in that direction. From this point on you can start investigating individual queries. We have isolated similar queries using the QUERY_PARAMETERIZED_HASH. Using this hash we can get individual query IDs that we can look at more closely.
1 2 3 4 5 6 7 |
SELECT QUERY_ID, QUERY_TEXT, QUERY_PARAMETERIZED_HASH FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE QUERY_PARAMETERIZED_HASH = '00d8f496857f970b8187cb89008b9ffb' LIMIT 1 |
QUERY_ID | QUERY_TEXT | QUERY_PARAMETERIZED_HASH |
---|---|---|
01af43ee-0302-d906-0000-1bb5239e7346 | SELECT count(*) as count_transactions, max(st.value) as max_value, sum(st.value) as sum_value from stg_transactions st where st.transaction_timestamp >= '2023-06-01' | 00d8f496857f970b8187cb89008b9ffb |
Snowflake offers great tooling for analyzing queries, like the Query Profiler. It can give you insight into how the query is executed step by step.
One common problem the query profiler can point out to you is misplaced WHERE clauses. Filter steps should happen as early as possible within a query execution. Filtering late can cause larger result sets during aggregations and joins, which in the worst case cause data spill to remote storage.
Storing and Updating Costly Query Results
Looking at the queries with the most data spilling in our example, we can see that the filter on the time column is quite coarse and results in transaction data starting from 2023-06-01 being aggregated every day. While the amount of transaction data added every day has not increased, obviously the total amount of transactions increases every day. Here it would make sense to perform the aggregation less often (maybe once a week) and store the results in a new table. If the aggregation needs to incorporate the most current transactions every day, sums, counts, min, and max aggregations can easily be updated daily without reading all the transactions every time.
Conclusions
Performance optimization of queries in a big data setting is oftentimes a long and tedious process. With the QUERY_HISTORY, Snowflake offers us a great resource for investigating the performance of our queries by giving us easy access to metadata like run times, partitions scanned, and data spilled.
The new query hash feature makes it easy to group queries and identify the kind of queries that warrant closer investigation. With tools like Snowflake’s Query Profiler and a good understanding of how Snowflake executes queries, we can come up with more efficient queries for our daily workloads. This not only reduces cost but also increases the productivity of the team and the satisfaction of our data consumers.