Effective Snowflake Data Ingestion with Java

Snowflake is becoming increasingly popular as a data platform, and thus the need for ways to integrate Snowflake into production systems is also growing. As type-safe programming languages like Java lead to more stable productive environments and are often preferred over untyped languages (e.g. Python), this implies the challenge to connect Snowflake databases in those type-safe languages. While Snowflake comes with dedicated methods for data loading from Cloud providers (e.g. Snowpipes), we have to rely on other implementation approaches when ingesting data directly from our type-safe Java apps.

So in this blog post, we will cover the following native data ingestion methods for Java: Batch and Merge insertion using the JDBC template and Stream insertion via the Snowflake Streaming API.

We will also compare their performance in terms of ingestion time and Snowflake compute usage. This provides you with a practical guide to choosing the right implementation for your application and is applicable to all languages that provide a Snowflake JDBC driver (in particular it should be applicable to other JVM-based languages like Kotlin or Scala).

Preliminary remarks

Throughout this post, we will use a Snowflake database with a table named INGEST_TABLE, which is created in SnowSQL with the following command

CREATE OR REPLACE TABLE INGEST_TABLE(
id int,
name varchar,
price double,
jsonData variant);

CREATE OR REPLACE TABLE INGEST_TABLE(

id int,

name varchar,

price double,

jsonData variant);

The Variant column, which represents JSON formatted data, is used to create proper load on the database. We implement a corresponding Java class named “SnowflakeTableEntity“:

public class SnowflakeTableEntity {
   private int id;
   private String name;
   private Double price;
   private final String jsonData = "{'someBigJson':0}";

   @JsonIgnore
   private final ObjectMapper objectMapper = new ObjectMapper();

   public String toJson() {
       try {
           return objectMapper.writeValueAsString(this);
       } catch (JsonProcessingException e) {
           log.error(e);
           return null;
       }
   }
}

public class SnowflakeTableEntity {

private int id;

private String name;

private Double price;

private final String jsonData = "{'someBigJson':0}";

@JsonIgnore

private final ObjectMapper objectMapper = new ObjectMapper();

public String toJson() {

try {

return objectMapper.writeValueAsString(this);

} catch (JsonProcessingException e) {

log.error(e);

return null;

}

Data Engineering Challenges: Umgang mit personenbezogenen Daten

Das Training geht auf die Herausforderungen ein, die sich für Data Engineers aus der DSGVO für ihre Datenprodukte ergeben. Gemeinsam erarbeiten wir den gezielten Einsatz von Technologien, um diese Herausforderungen erfolgreich [auf einer Cloud Platform] zu bewältigen.

Zum Training

Snowflake Connection in Java

Like most other database systems, Snowflake provides us with a JDBC driver for Java, which we will use throughout this blog post. To connect our JDBC driver to Snowflake, we first have to authenticate it to Snowflake.
For that, we have to create an RSA key pair (e.g. by invoking ssh-keygen) and assign the public key to a Snowflake User with the SnowSQL command

ALTER USER sf_user SET rsa_public_key='MII...';

1	ALTER USER sf_user SET rsa_public_key='MII...';

where sf_user is the Snowflake User in our example. With the private key of our generated key pair which will be located in a private key file, let’s say ‘rsa_key.p8’, we can configure our data source which we use to instantiate the JDBC driver.
Therefore, we have to convert the private key, which is saved in base64 encoded text into the PrivateKey Java Interface:

import java.security.KeyFactory;
import java.security.PrivateKey;
import java.security.spec.KeySpec;
import java.security.spec.PKCS8EncodedKeySpec;
import java.util.Base64;

  /**
   * Creates a RSA private key from a P8 file
   *
   * @param file a private key P8 file
   * @return RSAPrivateCrtKey instance
   * @throws Exception arises if any error occurs
   */
  private static PrivateKey readPrivateKey(File file) throws Exception {
    String key = Files.readString(file.toPath(), Charset.defaultCharset());

    String privateKeyPEM =
        key.replace("-----BEGIN PRIVATE KEY-----", "")         .replaceAll(System.lineSeparator(), "")
            .replace("-----END PRIVATE KEY-----", "");

    byte[] encoded = Base64.getDecoder().decode(privateKeyPEM);

    KeyFactory keyFactory = KeyFactory.getInstance("RSA");
    PKCS8EncodedKeySpec keySpec = new PKCS8EncodedKeySpec(encoded);
    return keyFactory.generatePrivate(keySpec);
  }

import java.security.KeyFactory;

import java.security.PrivateKey;

import java.security.spec.KeySpec;

import java.security.spec.PKCS8EncodedKeySpec;

import java.util.Base64;

/**

* Creates a RSA private key from a P8 file

* @param file a private key P8 file

* @return RSAPrivateCrtKey instance

* @throws Exception arises if any error occurs

private static PrivateKey readPrivateKey(File file) throws Exception {

String key = Files.readString(file.toPath(), Charset.defaultCharset());

String privateKeyPEM =

key.replace("-----BEGIN PRIVATE KEY-----", "") .replaceAll(System.lineSeparator(), "")

.replace("-----END PRIVATE KEY-----", "");

byte[] encoded = Base64.getDecoder().decode(privateKeyPEM);

KeyFactory keyFactory = KeyFactory.getInstance("RSA");

PKCS8EncodedKeySpec keySpec = new PKCS8EncodedKeySpec(encoded);

return keyFactory.generatePrivate(keySpec);

}

Note, how here the —–BEGIN PRIVATE KEY—– and —–END PRIVATE KEY—– parts have to be removed from the private key.
With that, we can already create our JDBC template that uses the corresponding Datasource:

SnowflakeBasicDataSource dataSource = new SnowflakeBasicDataSource();

dataSource.setPrivateKey(readPrivateKeyFile(file));
dataSource.setSchema(schema);
dataSource.setDatabaseName(databaseName);
dataSource.setRole(role);
dataSource.setUser(user);
dataSource.setWarehouse(warehouse);
dataSource.setUrl(sfUrl);

HikariDataSource pooledDataSource = new HikariDataSource();
pooledDataSource.setDataSource(dataSource);

     SnowflakeConnection
 sfConnection = pooledDataSource.getConnection().unwrap(SnowflakeConnection.class);
     JdbcTemplate sfJdbcTemplate = new JdbcTemplate(pooledDataSource);

SnowflakeBasicDataSource dataSource = new SnowflakeBasicDataSource();

dataSource.setPrivateKey(readPrivateKeyFile(file));

dataSource.setSchema(schema);

dataSource.setDatabaseName(databaseName);

dataSource.setRole(role);

dataSource.setUser(user);

dataSource.setWarehouse(warehouse);

dataSource.setUrl(sfUrl);

HikariDataSource pooledDataSource = new HikariDataSource();

pooledDataSource.setDataSource(dataSource);

SnowflakeConnection

sfConnection = pooledDataSource.getConnection().unwrap(SnowflakeConnection.class);

JdbcTemplate sfJdbcTemplate = new JdbcTemplate(pooledDataSource);

where the parameters of the Datasource are the corresponding snowflake objects and the file input is your private key file. The sfUrl should be of the form

jdbc:snowflake://{snowflake-instance-locator}.{snowflake-instance-region}.snowflakecomputing.com)

1	jdbc:snowflake://{snowflake-instance-locator}.{snowflake-instance-region}.snowflakecomputing.com)

Having set up the Snowflake connection correctly, we should now be able to communicate with our Snowflake database. Here is a test selection query:

String query = "SELECT id FROM INGEST_TABLE;";
List<Integer> idList = sfJdbcTemplate.queryForList(query, Integer.class);

1 2	String query = "SELECT id FROM INGEST_TABLE;"; List<Integer> idList = sfJdbcTemplate.queryForList(query, Integer.class);

Data Ingestion with Batch Insert

This method should be rather familiar to users who have already worked with JDBC. It relies on traditional OLTP insert and update methods and uses the JDBC template batchUpdate functionality:

public void insert(List<SnowflakeTableEntity> entities) {
   String sql = "INSERT INTO INGEST_TABLE (ID, NAME, PRICE, JSONDATA) SELECT ?, ?, ?, PARSE_JSON(?);";

   sfJdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {
       @Override
       public void setValues(PreparedStatement preparedStatement, int i) throws SQLException, SQLException {
           SnowflakeTableEntity entity = entities.get(i);
           preparedStatement.setInt(1, entity.getId());
           preparedStatement.setString(2, entity.getName());
           preparedStatement.setDouble(3, entity.getPrice());
           preparedStatement.setString(4, entity.getJsonData());
       }

       @Override
       public int getBatchSize() {
           return entities.size();
       }
   });
}

public void insert(List<SnowflakeTableEntity> entities) {

String sql = "INSERT INTO INGEST_TABLE (ID, NAME, PRICE, JSONDATA) SELECT ?, ?, ?, PARSE_JSON(?);";

sfJdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {

@Override

public void setValues(PreparedStatement preparedStatement, int i) throws SQLException, SQLException {

SnowflakeTableEntity entity = entities.get(i);

preparedStatement.setInt(1, entity.getId());

preparedStatement.setString(2, entity.getName());

preparedStatement.setDouble(3, entity.getPrice());

preparedStatement.setString(4, entity.getJsonData());

}

@Override

public int getBatchSize() {

return entities.size();

}

});

}

Here, we used Snowflake’s PARSE_JSON function to convert our json string to a variant object.
If we wanted to upsert entries instead of just inserting them, we would have to determine existing entities and update those, while inserting the non-existing ones.
Although the batch insert works theoretically, we’ll see in the Performance Comparison section that one should use the other methods for effective data ingestion.

Whitepaper

Data Mesh & Data Products

Erfolgsfaktoren für Data Mesh und Data Products: Nicht alles ist technischer Natur.

Jetzt herunterladen

Data Ingestion with Merge Insert

This solution is based on native Snowflake ingestion methods rather than traditional JDBC methods. Here, we use Snowflake stages for intermediate data storage and then merge these stages into our table. In general, Snowflake stages are intended as gateways to a variety of data sources, such as cloud storage and uploaded data files. Thus, we first create an internal Snowflake stage (alternatively one could use the table stage of our ingest table which Snowflake creates by default, see the Snowflake documentation):

CREATE OR REPLACE STAGE INGEST_STAGE FILE_FORMAT = (TYPE = JSON);

1	CREATE OR REPLACE STAGE INGEST_STAGE FILE_FORMAT = (TYPE = JSON);

The Snowflake JDBC driver provides us with an API extension to directly load data into this internal stage from a Java stream:

public void uploadDataToSnowflakeStage(List<SnowflakeTableEntity> entities) throws SQLException {
String json = entities
           .stream()
           .map(SnowflakeTableEntity::toJson)
           .collect(Collectors.joining("\n"));

   InputStream inputStream = toInputStream(ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8)
);

   String stageName = "@".concat("INGEST_STAGE");
   String pathPrefix = LocalDate.now().toString();
   String destinationFile = String.format("%s.json", LocalDateTime.now());
   boolean compressData = true;

   connection.uploadStream(stageName, pathPrefix, inputStream, destinationFile, compressData);
}

public void uploadDataToSnowflakeStage(List<SnowflakeTableEntity> entities) throws SQLException {

String json = entities

.stream()

.map(SnowflakeTableEntity::toJson)

.collect(Collectors.joining("\n"));

InputStream inputStream = toInputStream(ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8)

);

String stageName = "@".concat("INGEST_STAGE");

String pathPrefix = LocalDate.now().toString();

String destinationFile = String.format("%s.json", LocalDateTime.now());

boolean compressData = true;

connection.uploadStream(stageName, pathPrefix, inputStream, destinationFile, compressData);

}

Next, we can insert the uploaded data into our table by merging the stage:

public void mergeDataIntoTable() {
   String sql = """
           MERGE INTO INGEST_TABLE USING
           (SELECT $1:id id, $1:name name, $1:price price, PARSE_JSON($1:jsonData) jsonData FROM @INGEST_STAGE) tempTable
           ON INGEST_TABLE.id = tempTable.id AND INGEST_TABLE.name = INGEST_TABLE.name
           WHEN MATCHED THEN
               UPDATE SET jsondata = tempTable.jsonData
           WHEN NOT MATCHED THEN
               INSERT (id, name, price, jsondata)
               VALUES (tempTable.id, tempTable.name, tempTable.price, tempTable.jsonData);
           """;

   jdbcTemplate.execute(sql);
}

public void mergeDataIntoTable() {

String sql = """

MERGE INTO INGEST_TABLE USING

(SELECT $1:id id, $1:name name, $1:price price, PARSE_JSON($1:jsonData) jsonData FROM @INGEST_STAGE) tempTable

ON INGEST_TABLE.id = tempTable.id AND INGEST_TABLE.name = INGEST_TABLE.name

WHEN MATCHED THEN

UPDATE SET jsondata = tempTable.jsonData

WHEN NOT MATCHED THEN

INSERT (id, name, price, jsondata)

VALUES (tempTable.id, tempTable.name, tempTable.price, tempTable.jsonData);

""";

jdbcTemplate.execute(sql);

}

Note, that there is an integrated mechanism of upserting via the matched and not matched clauses. This is a feature that distinguishes this method from the others and makes it the preferred one if an upsert is necessary in the data model.
One thing to consider, however, is the file size of the resulting uploaded data file. As the Snowflake documentation suggests, this file should be around 100-250 MB. We will experiment with different upload sizes in the Performance Comparison section.

Data Ingestion using Snowflake Streaming

This method promises low-latency data ingestion and is suitable for real-time data streams according to the Snowflake documentation. We can ingest data using the Snowflake streaming API. For this, we first have to create a dedicated Snowflake Channel:

Properties props = new Properties();

props.put("private_key", readPrivateKeyFile(file));
props.put("user", user);
props.put("database", database);
props.put("schema", schema);
props.put("warehouse", warehouse);
props.put("role", role);
props.put("connect_string", url);
props.put("host", account);
props.put("scheme", "https");
props.put("ssl", "on");
props.put("port", 443);

SnowflakeStreamingIngestClient client = SnowflakeStreamingIngestClientFactory.builder("CLIENT").setProperties(props).build();

OpenChannelRequest request =
       OpenChannelRequest.builder(channelName)
               .setDBName(database)
               .setSchemaName(schema)
               .setTableName(tableName)
               .setOnErrorOption(
                       OpenChannelRequest.OnErrorOption.CONTINUE)
               .build();

SnowflakeStreamingIngestChannel channel = client.openChannel(request);

Properties props = new Properties();

props.put("private_key", readPrivateKeyFile(file));

props.put("user", user);

props.put("database", database);

props.put("schema", schema);

props.put("warehouse", warehouse);

props.put("role", role);

props.put("connect_string", url);

props.put("host", account);

props.put("scheme", "https");

props.put("ssl", "on");

props.put("port", 443);

SnowflakeStreamingIngestClient client = SnowflakeStreamingIngestClientFactory.builder("CLIENT").setProperties(props).build();

OpenChannelRequest request =

OpenChannelRequest.builder(channelName)

.setDBName(database)

.setSchemaName(schema)

.setTableName(tableName)

.setOnErrorOption(

OpenChannelRequest.OnErrorOption.CONTINUE)

.build();

SnowflakeStreamingIngestChannel channel = client.openChannel(request);

Using this channel, we can now ingest data by transforming entities to map objects:

public void insert(List<SnowflakeTableEntity> entities) {
   List<Map<String, Object>> rows = entities
           .stream()
           .map(entity -> toRow(entity))
           .toList();

   InsertValidationResponse response = channel.insertRows(rows, "offsetToken");

   if (response.hasErrors()) {
       throw response.getInsertErrors().get(0).getException();
   }
}

public static Map<String, Object> toRow(SnowflakeTableEntity entity) {
   Map<String, Object> row = new HashMap<>();

   row.put("ID", entity.getId());
   row.put("NAME", entity.getName());
   row.put("PRICE", entity.getPrice());
   row.put("JSONDATA", entity.getJsonData());

   return row;
}

public void insert(List<SnowflakeTableEntity> entities) {

List<Map<String, Object>> rows = entities

.stream()

.map(entity -> toRow(entity))

.toList();

InsertValidationResponse response = channel.insertRows(rows, "offsetToken");

if (response.hasErrors()) {

throw response.getInsertErrors().get(0).getException();

}

public static Map<String, Object> toRow(SnowflakeTableEntity entity) {

Map<String, Object> row = new HashMap<>();

row.put("ID", entity.getId());

row.put("NAME", entity.getName());

row.put("PRICE", entity.getPrice());

row.put("JSONDATA", entity.getJsonData());

return row;

}

This method does not provide any possibility to update existing data rows, so it is limited to insertion only, which is the major drawback compared to the other methods when considering Snowflake in an OLTP context.

Performance Comparison

To evaluate which method is the most suitable for ingesting data into Snowflake with Java, we tested each method by ingesting several 1000 data entries into an empty table with various batch sizes and two different warehouses (xs and s). In the resulting graph, we omit the run times of the batch insert, as it was slower than the other methods almost by a factor of 100!

a graph showing the execution time by insertion method in milliseconds.

The other two methods, however, performed comparably well while the stream insert was a little bit faster than one might expect. Also note how the insertion speed decreases with higher batch sizes and increases again, after an optimal batch size is reached. This optimal batch size corresponds to a compressed file size of ~100 MB, which is consistent with the Snowflake documentation. In our case the size of the warehouse did not play a significant role, however this might change for different data schemas or sizes.

Recommendations

Based on the results of the previous section, we would recommend using Stream insertion wherever possible. As Stream insertion is an append-only method, you can use Merge insertion for cases where you must consider upsertion for entries with unique values. In each case, you should experiment with different batch sizes to determine the optimal one for your use case. Also, try out different warehouse sizes and use the smallest if you do not notice any performance difference.

Useful Links:

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Effective Snowflake Data Ingestion with Java

Preliminary remarks

Data Engineering Challenges: Umgang mit personenbezogenen Daten

Snowflake Connection in Java

Data Ingestion with Batch Insert

Data Mesh & Data Products

Data Ingestion with Merge Insert

Data Ingestion using Snowflake Streaming

Performance Comparison

Recommendations

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

How to Use Mimesis and dbt to Test Data Pipelines

Data Quality Made Easy with Soda

Implementing a Practical Use Case With the Snowflake Native App Framework

Effective Snowflake Data Ingestion with Java

Preliminary remarks

Data Engineering Challenges: Umgang mit personenbezogenen Daten

Snowflake Connection in Java

Data Ingestion with Batch Insert

Data Mesh & Data Products

Data Ingestion with Merge Insert

Data Ingestion using Snowflake Streaming

Performance Comparison

Recommendations

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

How to Use Mimesis and dbt to Test Data Pipelines

Data Quality Made Easy with Soda

Implementing a Practical Use Case With the Snowflake Native App Framework

Newsletter