Eine grafische Darstellung einer Data Platform (Snowflake), die an einen Monitor (Java) angebunden ist.
Data Engineering

Effective Snowflake Data Ingestion with Java

Lesezeit
7 ​​min

Snowflake is becoming increasingly popular as a data platform, and thus the need for ways to integrate Snowflake into production systems is also growing. As type-safe programming languages like Java lead to more stable productive environments and are often preferred over untyped languages (e.g. Python), this implies the challenge to connect Snowflake databases in those type-safe languages. While Snowflake comes with dedicated methods for data loading from Cloud providers (e.g. Snowpipes), we have to rely on other implementation approaches when ingesting data directly from our type-safe Java apps.

So in this blog post, we will cover the following native data ingestion methods for Java: Batch and Merge insertion using the JDBC template and Stream insertion via the Snowflake Streaming API.

We will also compare their performance in terms of ingestion time and Snowflake compute usage. This provides you with a practical guide to choosing the right implementation for your application and is applicable to all languages that provide a Snowflake JDBC driver (in particular it should be applicable to other JVM-based languages like Kotlin or Scala).

Preliminary remarks

Throughout this post, we will use a Snowflake database with a table named INGEST_TABLE, which is created in SnowSQL with the following command

The Variant column, which represents JSON formatted data, is used to create proper load on the database. We implement a corresponding Java class named “SnowflakeTableEntity“:

Snowflake Connection in Java

Like most other database systems, Snowflake provides us with a JDBC driver for Java, which we will use throughout this blog post. To connect our JDBC driver to Snowflake, we first have to authenticate it to Snowflake.
For that, we have to create an RSA key pair (e.g. by invoking ssh-keygen) and assign the public key to a Snowflake User with the SnowSQL command

where sf_user is the Snowflake User in our example. With the private key of our generated key pair which will be located in a private key file, let’s say ‘rsa_key.p8’, we can configure our data source which we use to instantiate the JDBC driver.
Therefore, we have to convert the private key, which is saved in base64 encoded text into the PrivateKey Java Interface:

Note, how here the —–BEGIN PRIVATE KEY—– and —–END PRIVATE KEY—– parts have to be removed from the private key.
With that, we can already create our JDBC template that uses the corresponding Datasource:

where the parameters of the Datasource are the corresponding snowflake objects and the file input is your private key file. The sfUrl should be of the form

Having set up the Snowflake connection correctly, we should now be able to communicate with our Snowflake database. Here is a test selection query:

Data Ingestion with Batch Insert

This method should be rather familiar to users who have already worked with JDBC. It relies on traditional OLTP insert and update methods and uses the JDBC template batchUpdate functionality:

Here, we used Snowflake’s PARSE_JSON function to convert our json string to a variant object.
If we wanted to upsert entries instead of just inserting them, we would have to determine existing entities and update those, while inserting the non-existing ones.
Although the batch insert works theoretically, we’ll see in the Performance Comparison section that one should use the other methods for effective data ingestion.

Data Ingestion with Merge Insert

This solution is based on native Snowflake ingestion methods rather than traditional JDBC methods. Here, we use Snowflake stages for intermediate data storage and then merge these stages into our table. In general, Snowflake stages are intended as gateways to a variety of data sources, such as cloud storage and uploaded data files. Thus, we first create an internal Snowflake stage (alternatively one could use the table stage of our ingest table which Snowflake creates by default, see the Snowflake documentation):

The Snowflake JDBC driver provides us with an API extension to directly load data into this internal stage from a Java stream:

Next, we can insert the uploaded data into our table by merging the stage:

Note, that there is an integrated mechanism of upserting via the matched and not matched clauses. This is a feature that distinguishes this method from the others and makes it the preferred one if an upsert is necessary in the data model.
One thing to consider, however, is the file size of the resulting uploaded data file. As the Snowflake documentation suggests, this file should be around 100-250 MB. We will experiment with different upload sizes in the Performance Comparison section.

Data Ingestion using Snowflake Streaming

This method promises low-latency data ingestion and is suitable for real-time data streams according to the Snowflake documentation. We can ingest data using the Snowflake streaming API. For this, we first have to create a dedicated Snowflake Channel:

Using this channel, we can now ingest data by transforming entities to map objects:

This method does not provide any possibility to update existing data rows, so it is limited to insertion only, which is the major drawback compared to the other methods when considering Snowflake in an OLTP context.

Performance Comparison

To evaluate which method is the most suitable for ingesting data into Snowflake with Java, we tested each method by ingesting several 1000 data entries into an empty table with various batch sizes and two different warehouses (xs and s). In the resulting graph, we omit the run times of the batch insert, as it was slower than the other methods almost by a factor of 100!

a graph showing the execution time by insertion method in milliseconds.

The other two methods, however, performed comparably well while the stream insert was a little bit faster than one might expect. Also note how the insertion speed decreases with higher batch sizes and increases again, after an optimal batch size is reached. This optimal batch size corresponds to a compressed file size of ~100 MB, which is consistent with the Snowflake documentation. In our case the size of the warehouse did not play a significant role, however this might change for different data schemas or sizes.

Recommendations

Based on the results of the previous section, we would recommend using Stream insertion wherever possible. As Stream insertion is an append-only method, you can use Merge insertion for cases where you must consider upsertion for entries with unique values. In each case, you should experiment with different batch sizes to determine the optimal one for your use case. Also, try out different warehouse sizes and use the smallest if you do not notice any performance difference.

Useful Links:

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert