Increasing Query Performance Using the Snowflake Query History

Picture this: You are responsible for running a big data platform in Snowflake and over the last few months you noticed a steady increase in the consumed credits. There are no new use cases that you know of. You have already talked to the data onboarding team about the amount of data that is ingested every day and they confirmed to you, that the amount of ingested data has been more or less constant over the last half year. So the old use cases are not dealing with an increase in data volume per day. How do you find out what is causing the increase in consumed credits?

Other than monetary cost, there are many different reasons why you should be concerned about the performance of your data pipelines. Faster query execution means users can retrieve the information they need more quickly. This is crucial in applications where real-time or near-real-time responses are required. Faster access to data also means quicker decision-making, which can be crucial in situations where timely information is of utmost importance (e.g., in emergency response systems). Finally, well-optimized query performance enables the system to handle more concurrent users or larger datasets without a significant drop in response times.

In this article, we take a look at Snowflake’s QUERY_HISTORY view. How can it be used to identify costly queries and what clues does it provide for fixing problematic queries?

Investigating Query Performance in the QUERY_HISTORY View

The ACCOUNT_USAGE and READER_ACCOUNT_USAGE schema both contain a view called QUERY_HISTROY. There are some differences between the two views (detailed here), but for our purposes either will work.

The QUERY_HISTROY contains metadata for all queries that have been run within the Snowflake account over the last 365 days. As with all ACCOUNT_USAGE views, there is some latency with which the information is available. In this case, it is about 45 minutes.

A note about credit usage: Since Snowflake does not bill you for computation time directly, estimating the credits consumed by a specific query is an inexact science. Snowflake bills you for warehouse uptime. So you might be tempted to multiply the time elapsed during the query compilation and execution for a specific query (in seconds) with the credits per second for the warehouse it ran on. There are however two problems with this approach. Warehouses can run several queries in parallel, so you might be overestimating the costs for each specific query. On the other hand, running a query might wake the warehouse up from suspension and leave it in the active state for an unnecessarily long time, leading to underestimations of the cost caused by a specific query. For these reasons, we will simply use the query run time (TOTAL_ELAPSED_TIME in the QUERY_HISTORY view) as an indicator of how costly it is.

Monthly Query Run Time

There are several different dimensions along which we can investigate queries in order to find problematic queries as a target for performance improvements. First, we’ll take a look at a monthly breakdown of the query run time:

SELECT
    DATE_TRUNC('MONTH', "START_TIME") as MONTH,
    COUNT(*) as QUERY_COUNT,
    MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S,
    MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S,
    SUM(TOTAL_ELAPSED_TIME/1000) as SUM_ELAPSED_TIME_S
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
GROUP BY MONTH

SELECT

DATE_TRUNC('MONTH', "START_TIME") as MONTH,

COUNT(*) as QUERY_COUNT,

MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S,

MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S,

SUM(TOTAL_ELAPSED_TIME/1000) as SUM_ELAPSED_TIME_S

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

GROUP BY MONTH

MONTH	QUERY_COUNT	MIN_ELAPSED_TIME_S	MAX_ELAPSED_TIME_S	SUM_ELAPSED_TIME_S
2023-09-01	973	0.021	18,939.449	1,167,132.897
2023-08-01	877	0.016	20,769.261	1,023,945.685
2023-07-01	541	0.017	13,981.067	983,629.880
2023-06-01	661	0.018	18,257.220	269,772.616
2023-05-01	695	0.025	12,104.698	297,825.463
2023-04-01	618	0.019	8,248.591	190,993.656
2023-03-01	798	0.021	9,178.900	394,948.741
2023-02-01	695	0.020	6,417.905	265,338.306
2023-01-01	618	0.021	6,716.117	195,266.476
2022-12-01	798	0.019	5,772.717	106,1075.433
2022-11-01	878	0.018	8,700.304	160,056.014
2022-10-01	486	0.021	5,487.664	104,637.947

The TOTAL_ELAPSED_TIME combines the compilation time and the execution time. Query compilation encompasses for example parsing, fetching metadata, and query rewriting. We can see that the run time for queries has indeed increased over the last 4 months. Based on this information, we can investigate the last month in more detail.

Query Run Time by Query Type

We can break down all queries from the last month by query type.

SELECT
    QUERY_TYPE,
    COUNT(*) as QUERY_COUNT,
    MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S,
    MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S,
    SUM(TOTAL_ELAPSED_TIME/1000) as SUM_ELAPSED_TIME_S
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'
GROUP BY QUERY_TYPE
ORDER BY SUM_ELAPSED_TIME_S
LIMIT 5

SELECT

QUERY_TYPE,

COUNT(*) as QUERY_COUNT,

MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S,

MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S,

SUM(TOTAL_ELAPSED_TIME/1000) as SUM_ELAPSED_TIME_S

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'

GROUP BY QUERY_TYPE

ORDER BY SUM_ELAPSED_TIME_S

LIMIT 5

QUERY_TYPE	QUERY_COUNT	MIN_ELAPSED_TIME_S	MAX_ELAPSED_TIME_S	SUM_ELAPSED_TIME_S
SELECT	302	0.023	18,939.449	339,173.783
CREATE_TABLE_AS_SELECT	231	0.331	17,698.261	253,989.821
DELETE	80	0.481	1,257.196	2,955.905
UPDATE	23	0.345	651.831	1,580.721
SHOW	32	0.021	7.328	28.765

The query type gives us a good indication of what kind of queries typically run for how long. Here, SELECT and CREATE_TABLE_AS_SELECT take by far the longest. This is typical since SELECT queries often need to scan large tables or perform complex aggregations.

Using the Query Hash

In order to really track down what kind of queries are the most costly, we can use one of Snowflake’s newer features. With the 2023_06 behavior change bundle the QUERY_HISTORY view gained some new columns that help us to find non-obvious patterns in queries. Snowflake calculates a query hash for each query. Even queries that are not syntactically identical receive the same hash. In particular, the column QUERY_PARAMETERIZED_HASH is interesting for us, because this hash ignores differences in whitespace as well as literal arguments within the query. This allows us to group queries by their general structure and get a better picture of the kind of queries that cause the most problems.

Also pay attention to the QUERY_PARAMETERIZED_HASH_VERSION column, since hashes produced by different versions of the hashing logic should not be used for grouping queries. To keep queries simple, we assume here that all hashes are produced by the same logic.

Let’s look at some queries that give us some insight into several performance indicators.

Query Performance Indicators

We first look at run times for different kinds of queries.

SELECT
    QUERY_PARAMETERIZED_HASH,
    COUNT(*) as QUERY_COUNT,
    MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S,
    MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S,
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'
GROUP BY QUERY_PARAMETERIZED_HASH
ORDER BY MAX_ELAPSED_TIME_S
LIMIT 5

SELECT

QUERY_PARAMETERIZED_HASH,

COUNT(*) as QUERY_COUNT,

MIN(TOTAL_ELAPSED_TIME/1000) as MIN_ELAPSED_TIME_S,

MAX(TOTAL_ELAPSED_TIME/1000) as MAX_ELAPSED_TIME_S,

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'

GROUP BY QUERY_PARAMETERIZED_HASH

ORDER BY MAX_ELAPSED_TIME_S

LIMIT 5

QUERY_PARAMETERIZED_HASH	QUERY_COUNT	MIN_ELAPSED_TIME_S	MAX_ELAPSED_TIME_S
00d8f496857f970b8187cb89008b9ffb	30	15,230.136	18,939.449
01630b795a0b07452936076314b999ee	24	0.026	1508.126
01e81ce2a48c0ac6d12390fd25e482c6	12	48.491	159.186
01ff4d44bd11bfb5add87d88e2ce8019	1	123.426	123.426
02f54b0f2bfc98844a87cfabf32ba360	3	39.347	39.347

We sorted the results by the maximum run time for each query hash, because it gives a good first indication of what queries likely cause the most costs. Instances, where the minimum run time is near instant, indicate that the query has been answered from cached results. But also run times of less than 10 seconds warrant some investigations. If a query is answered very fast on a large warehouse, it is likely that the query can also be executed quickly in a smaller warehouse, which saves cost.

Data Spilling

There are several reasons why a query sometimes takes a long time to be answered. The first obvious reason is that it requires complicated operations and aggregations to produce the result. If a query takes disproportionately long, it is time to look for potential optimizations. Let’s look more closely at other performance indicators.

SELECT
    QUERY_PARAMETERIZED_HASH,
    COUNT(*) as QUERY_COUNT,
    MAX(BYTES_SPILLED_TO_LOCAL_STORAGE) as SPILLED_L_STORAGE,
    MAX(BYTES_SPILLED_TO_REMOTE_STORAGE) as SPILLED_R_STORAGE,
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'
GROUP BY QUERY_PARAMETERIZED_HASH
LIMIT 5

SELECT

QUERY_PARAMETERIZED_HASH,

COUNT(*) as QUERY_COUNT,

MAX(BYTES_SPILLED_TO_LOCAL_STORAGE) as SPILLED_L_STORAGE,

MAX(BYTES_SPILLED_TO_REMOTE_STORAGE) as SPILLED_R_STORAGE,

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'

GROUP BY QUERY_PARAMETERIZED_HASH

LIMIT 5

QUERY_PARAMETERIZED_HASH	QUERY_COUNT	SPILLED_L_STORAGE	SPILLED_R_STORAGE
00d8f496857f970b8187cb89008b9ffb	30	2,899,818,373	1,422,512,417
01630b795a0b07452936076314b999ee	24	224,158,589	456,256,215
01e81ce2a48c0ac6d12390fd25e482c6	12	22,451,456	0
01ff4d44bd11bfb5add87d88e2ce8019	1	242,112,473	0
02f54b0f2bfc98844a87cfabf32ba360	3	48,245,168	0

The queries with the longest run times in our results also show that quite a lot of data is spilled to local and remote disks. This occurs when the warehouse executing the query runs out of memory. In these cases, the data required for answering the query is temporarily stored on disk, which particularly in the case of remote disk spilling means slow read and write processes that prolong the execution time. For queries that spill a lot of data to remote storage, it can be useful to experiment with larger warehouses. Virtual Warehouses not only double the number of computing cores with each step between T-shirt sizes but also provide more RAM and local disk space. This avoids remote spilling and typically provides a big boost in execution time.

Micro-partition Usage

Another performance indicator from the query history view is the ratio between the PARTITIONS_SCANNED and the PARTITIONS_TOTAL.

SELECT
    QUERY_PARAMETERIZED_HASH,
    COUNT(*) as QUERY_COUNT,
    MAX(PARTITIONS_SCANNED) as PARTITIONS_SCANNED,
    MAX(PARTITIONS_TOTAL) as PARTITIONS_TOTAL
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'
GROUP BY QUERY_PARAMETERIZED_HASH
LIMIT 5

SELECT

QUERY_PARAMETERIZED_HASH,

COUNT(*) as QUERY_COUNT,

MAX(PARTITIONS_SCANNED) as PARTITIONS_SCANNED,

MAX(PARTITIONS_TOTAL) as PARTITIONS_TOTAL

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

WHERE DATE_TRUNC('MONTH', "START_TIME") = '2023-09-01'

GROUP BY QUERY_PARAMETERIZED_HASH

LIMIT 5

QUERY_PARAMETERIZED_HASH	QUERY_COUNT	PARTITIONS_SCANNED	PARTITIONS_TOTAL
00d8f496857f970b8187cb89008b9ffb	30	140,927	251,727
01630b795a0b07452936076314b999ee	24	240,933	251,597
01e81ce2a48c0ac6d12390fd25e482c6	12	3,482	249,471
01ff4d44bd11bfb5add87d88e2ce8019	1	22,951	247,104
02f54b0f2bfc98844a87cfabf32ba360	3	15,539	245,689

These two columns give you an idea about how many micro partitions were scanned versus how many micro partitions all tables included in the query are composed of. If a lot of micro partitions that ultimately did not contribute to the result were scanned, it is worth checking the clustering of the tables that are used in that query. Make sure to understand what micro partitions are and how Snowflake performs automatic clustering. For large tables (> 1TB) or tables that are frequently queried with WHERE, JOIN or ORDER_BY clauses, it might be worthwhile to specify a clustering key, so that micro partitions can be pruned more effectively during query execution.

Nailing Down the Root Cause

There are lots of different reasons why queries take longer than you would expect and equally many ways to optimize them. In our example analysis we have seen, that the amount of data that is spilled to remote storage is quite considerable. So we might investigate further in that direction. From this point on you can start investigating individual queries. We have isolated similar queries using the QUERY_PARAMETERIZED_HASH. Using this hash we can get individual query IDs that we can look at more closely.

SELECT
    QUERY_ID,
    QUERY_TEXT,
    QUERY_PARAMETERIZED_HASH
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE QUERY_PARAMETERIZED_HASH = '00d8f496857f970b8187cb89008b9ffb'
LIMIT 1

SELECT

QUERY_ID,

QUERY_TEXT,

QUERY_PARAMETERIZED_HASH

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

WHERE QUERY_PARAMETERIZED_HASH = '00d8f496857f970b8187cb89008b9ffb'

LIMIT 1

QUERY_ID	QUERY_TEXT	QUERY_PARAMETERIZED_HASH
01af43ee-0302-d906-0000-1bb5239e7346	SELECT count(*) as count_transactions, max(st.value) as max_value, sum(st.value) as sum_value from stg_transactions st where st.transaction_timestamp >= '2023-06-01'	00d8f496857f970b8187cb89008b9ffb

Snowflake offers great tooling for analyzing queries, like the Query Profiler. It can give you insight into how the query is executed step by step.

One common problem the query profiler can point out to you is misplaced WHERE clauses. Filter steps should happen as early as possible within a query execution. Filtering late can cause larger result sets during aggregations and joins, which in the worst case cause data spill to remote storage.

Storing and Updating Costly Query Results

Looking at the queries with the most data spilling in our example, we can see that the filter on the time column is quite coarse and results in transaction data starting from 2023-06-01 being aggregated every day. While the amount of transaction data added every day has not increased, obviously the total amount of transactions increases every day. Here it would make sense to perform the aggregation less often (maybe once a week) and store the results in a new table. If the aggregation needs to incorporate the most current transactions every day, sums, counts, min, and max aggregations can easily be updated daily without reading all the transactions every time.

Conclusions

Performance optimization of queries in a big data setting is oftentimes a long and tedious process. With the QUERY_HISTORY, Snowflake offers us a great resource for investigating the performance of our queries by giving us easy access to metadata like run times, partitions scanned, and data spilled.

The new query hash feature makes it easy to group queries and identify the kind of queries that warrant closer investigation. With tools like Snowflake’s Query Profiler and a good understanding of how Snowflake executes queries, we can come up with more efficient queries for our daily workloads. This not only reduces cost but also increases the productivity of the team and the satisfaction of our data consumers.

Data Engineering Challenges: Umgang mit personenbezogenen Daten

Das Training geht auf die Herausforderungen ein, die sich für Data Engineers aus der DSGVO für ihre Datenprodukte ergeben. Gemeinsam erarbeiten wir den gezielten Einsatz von Technologien, um diese Herausforderungen erfolgreich [auf einer Cloud Platform] zu bewältigen.

Zum Training

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Increasing Query Performance Using the Snowflake Query History

Investigating Query Performance in the QUERY_HISTORY View

Monthly Query Run Time

Query Run Time by Query Type

Using the Query Hash

Query Performance Indicators

Data Spilling

Micro-partition Usage

Nailing Down the Root Cause

Storing and Updating Costly Query Results

Conclusions

Data Engineering Challenges: Umgang mit personenbezogenen Daten

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

How to Use Mimesis and dbt to Test Data Pipelines

Data Quality Made Easy with Soda

Implementing a Practical Use Case With the Snowflake Native App Framework

Increasing Query Performance Using the Snowflake Query History

Investigating Query Performance in the QUERY_HISTORY View

Monthly Query Run Time

Query Run Time by Query Type

Using the Query Hash

Query Performance Indicators

Data Spilling

Micro-partition Usage

Nailing Down the Root Cause

Storing and Updating Costly Query Results

Conclusions

Data Engineering Challenges: Umgang mit personenbezogenen Daten

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

How to Use Mimesis and dbt to Test Data Pipelines

Data Quality Made Easy with Soda

Implementing a Practical Use Case With the Snowflake Native App Framework

Newsletter