Data Engineering

How to Use Mimesis and dbt to Test Data Pipelines

Lesezeit
16 ​​min

In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python.

Data quality is of utmost importance for the success of data products. Ensuring the robustness and accuracy of data pipelines is key to achieving this quality. At this point, data pipeline testing becomes essential. However, effectively testing data pipelines incorporates several challenges, including the availability of test data, automation, a proper definition of test cases, and the possibility of running end-to-end data tests seamlessly during local development.

This blog post introduces a framework for testing data pipelines built with dbt (data build tool) using Mimesis, a Python library for fake data generation. Specifically, we will

  • use Pydantic to parse and validate dbt’s schema.yml files
  • implement a generic approach to auto-generate realistic test data based on dbt schemas
  • use the generated data to test a dbt data pipeline.

Combining dbt’s built-in testing capabilities with Mimesis’s data generation abilities allows us to validate data pipelines effectively, ensuring data quality and accuracy.

All files discussed in this article are available in an accompanying GitHub repository. If you want to follow along, make sure to clone the repository.

Prerequisites

We assume the reader to be familiar with the following topics:

  • Python Basics, incl. basic object-oriented programming
  • dbt Basics, incl. model definitions, seeds, basic project structures, and data tests
  • Pydantic Basics, incl. model definitions and parsing YAML files into Pydantic models

If you want to follow along, you must have Python 3.10 installed on your machine. Alternatively, our GitHub repository contains a Development Container Specification that allows you to set up a working environment quickly (requires Docker to be installed on your machine).

Motivation

As discussed in detail in this blog post, data quality is paramount for successful data products. High-quality data is data that meets domain-specific assumptions to a high degree. This includes:

  • Semantic correctness: e.g., an email address matches a particular pattern, or the measured value of a sensor is always within a specific interval.
  • Syntactic correctness: e.g., fields and data types are correct, and constraints are met.
  • Completeness.

Meeting the requirements above heavily depends on the correctness and robustness of data pipelines, which makes testing them so important. Testing data pipelines ensures accurate transformations and consistent schemas. It helps to catch errors early and to prevent downstream impacts. However, effectively running data tests comes with some challenges:

  • Realistic Test Data: Using production data can raise privacy concerns, while manually creating (realistic) test datasets is time-consuming and often incomplete.
  • Dynamic Environments: Adapting to changing schemas or new sources can introduce errors.
  • Automation in Testing: Data pipelines are often time-consuming and costly. Developer teams need the means to run data tests many times throughout the development lifecycle automatically and ensure the detection of problems early on.
  • Local Test Execution: Not only should data pipeline tests be automated, but they should also be accessible to developers to run them on their local machines during development. This requires developers to easily generate test data and be able to execute data pipeline tests end-to-end.

Tools like dbt simplify data pipeline testing by providing a framework for modular, testable SQL-based workflows. But to truly test a pipeline effectively, we also need realistic datasets to simulate real-world scenarios – this is where libraries like Mimesis, a powerful fake data generator, come into play. The framework we introduce in this blog post aims at tackling the challenges above. Our approach allows developers to quickly auto-generate test data based on a schema definition and run data pipeline tests both locally and as part of CI/CD.

Mimesis – Fake Data Generation

Mimesis is a Python library that generates realistic data for testing purposes. It supports a wide range of data types, making it ideal for testing data pipelines without relying on production datasets. It also offers built-in providers for generating data related to various areas, including food, people, transportation, and addresses across multiple locales.

Let’s look at how easy it is to generate fake data with Mimesis.

Furthermore, we can use the Fieldset class to generate multiple values simultaneously for more significant amounts of data.

Now that we have a basic understanding of how Mimesis operates, we’re all set to generate fake data and use it to test our dbt pipelines.

Combine dbt and Mimesis for Robust Data Pipeline Testing

Combining dbt’s transformation and testing capabilities with Mimesis’s ability to generate realistic test data allows us to create a strong framework for building reliable, scalable data pipelines. In the following sections, we’ll make our way up to testing our dbt pipelines with auto-generated fake data step-by-step.

First, we’ll set up our environment and install the necessary dependencies. Next, we’ll look into parsing dbt schemas using Pydantic. Finally, we’ll explore using the Pydantic models to automatically generate fake data for an arbitrary dbt seed or model.

Setting Up Your Environment

Firstly, clone the GitHub repository and navigate to the root directory:

There are two options to set up your environment:

Option 1: Use the Development Container (recommended)

The repository includes a Development Container Specification for quick setup. All necessary dependencies are already installed if you run the code inside the Development Container.

Option 2: Manual Setup

Install the required dependencies using Poetry. This assumes you have Python 3.10 installed on your machine.

This project uses DuckDB. To create a DuckDB database, you must install DuckDB CLI on your machine. You can follow this guide to install it on your OS. Next, run the following command to create a database file inside the dbt_mimesis_example directory:

Testing dbt Pipelines Using Mimesis

Having a working environment, we are ready to look at the code we will use to generate realistic data to test our dbt pipelines.

dbt data lineage graph
Lineage graph of our dbt project

The lineage graph of the dbt_mimesis_example dbt project shows that there are two seeds – namely raw_airplanes and raw_flights – and three downstream models depending on those seeds: airplanes, cities, and flights. In dbt, seeds are CSV files typically located in the seeds directory and can be loaded to a data warehouse using the dbt seed command. Hence, to properly test our dbt pipeline, we need to ensure that we have two CSV files: raw_airplanes.csv and raw_flights.csv. To this end, we will use Pydantic to parse the schema.yml file inside the seeds directory. Subsequently, we’ll use the parsed schema definition to auto-generate fake data.

Step 1: Parsing dbt Schemas With Pydantic

Pydantic is a Python library that validates data using type annotations. It allows you to define models as Python classes, validate data against those models, and parse various input formats (e.g., JSON, YAML) into structured Python objects. This makes it an ideal tool for working with dbt’s schema.yml files, as it ensures that the schema definitions are valid and compatible with downstream processes like our test data generation.

Inside the data_generator/models.py, we define a few Pydantic models to parse our dbt schema.yml  files into structured Python objects:

The DBTSchema model represents a list of DBTTable objects. Subsequently, a DBTTable consists of a name and a list of columns represented by the DBTColumn model. Finally, a DBTColumn has a name, a data type, an optional list of data tests, and an optional meta-dictionary containing metadata about the column.

We can now use these models to parse our YAML-based dbt schema definition into structured Python objects:

Step 2: Auto-Generating Test Data with Mimesis

The next step is to use the structured Python objects representing the dbt schema to generate fake data. To this end, we create a TestDataGenerator class inside data_generator/generator.py that implements functionality to generate the data. Below, we’ll break it down step by step.

Initialize the class

Let’s define a constructor for our class. We need some key attributes like our schema and the locale, as well as a mapping of the data types in our schema to the corresponding Mimesis data types. Furthermore, we want to leverage the powerful providers Mimesis offers, so we add a key attribute field_aliases, which allows us to map column names to the providers. The class has two more attributes that are initiated with None and an empty dictionary, respectively. We can use the iterations attribute later to decide how many rows should be generated for a given table. Similarly, the reproducible_id_store will help store primary keys for cross-referencing.

Generate Random Row Counts

It’s not very realistic if each generated table has the same number of rows. Therefore, we’ll add a method _generate_random_iterations that assigns a random number of rows to each table within a specified range (i.e., between min_rows and max_rows). We store the resulting dictionary in the iterations class attribute mentioned earlier.

Generate Unique Values

In some cases, our schema contains a uniqueness constraint in the form of a data test unique, or a column is defined as a primary key column. In these cases, we need a way to generate unique values. The _generate_unique_values method expects a DBTTable object and a DBTColumn for which the values should be generated as inputs and returns a set of values generated using Mimesis.

Certain Mimesis providers may not generate enough unique values. For instance, the Airplane provider can only produce ~300 unique airplane models. This becomes problematic if the number of iterations specified for the given table is larger than the maximum number of unique values available. Therefore, we must handle this edge case and adapt the maximum number of rows generated for the particular table to the maximum number of unique values available.

Generate Data for a Table

Next, we need a method that creates synthetic data for a single table by iterating over its columns. The _generate_test_data_for_table method takes a DBTTable object as input and returns a Pandas data frame with the generated fake data for the corresponding table.

In the context of relational databases, there are usually relationships between tables. Whenever a table references a primary key from another table, it is referred to as a foreign key. Thereby, a value in a foreign key column must either be null or present as a primary key in the referenced table. This logical dependency is referred to as referential integrity. Mimesis does not natively support referential integrity when generating data. Therefore, we apply some logic to consider primary and foreign keys during data generation. To this end, we use the meta field, an optional part of dbt’s schema definitions. Specifically, we added metadata describing whether a column is a primary or foreign key to the dbt_mimesis_example/seeds/schema.yml file.

Within the _generate_test_data_for_table method, we also check whether a column is a primary key, a foreign key, or a regular column and handle it accordingly.

Handle Foreign and Primary Keys

In case the column is defined as a primary or foreign key column, the _handle_key_column method is called.

First, it checks whether a set of values for the (referenced) primary key is already available in the reproducible_id_store class attribute (as you might remember, this is an initially empty dictionary). If not, it generates a unique set of values for the primary key and adds it to the dictionary. Finally, it returns either a random sample of the set of values available from the referenced primary key column or the set of values itself – depending on whether it’s a foreign or a primary key.

Handle Regular Columns

Alternatively, the column is handled as a regular, non-key column. In that case, it returns a set of unique values if the unique data test is set for the column. Otherwise, it returns a list of generated values without ensuring uniqueness. It might also include null values – depending on whether or not the not_null data test is part of the column specification.

Generate Data for the Entire Schema

Finally, we want to implement logic to generate data for an entire schema. The generate_data method orchestrates the whole process by calling the helper methods we implemented to create data for all tables in a schema.

Full Picture

We’re all set to use the TestDataGenerator to generate some actual data based on the schema.yml file. Let’s add some field aliases to use Mimesis’s city and airplane providers.

The repository also contains a data_generator/main.py file with a simple CLI implemented using click. It allows you to generate test data for an arbitrary dbt schema using the following command:

In this case, we are generating a random number of rows within the range 100-1000 for each of the seeds in our dbt_mimesis_example/seeds/schema.yml. You can adjust the number of rows by setting the --min-rows and --max-rows flags. Note: When using larger numbers of rows (e.g., > 1.000.000 min rows), generating the data might take some time.

Using Mimesis to Test dbt Pipelines

We successfully generated test data using the described method based on our dbt seeds schema. Now that we have some CSV files in our dbt seeds directory let’s run the dbt seed command to load the data into our DuckDB database and run our dbt pipeline and tests:

If everything went as expected, all tests should have passed, and you should now have five tables in your DuckDB database. Let’s inspect some values:

In our case, the output of the SQL command looks like this. Yours should look similar but with different values, as Mimesis generates them randomly:

duckdb table output
Sample values for the flights table

Conclusion

This blog post explored how to use Mimesis to test data pipelines. We implemented a data generator that automatically creates fake data based on Pydantic models derived from parsed dbt schemas.

If you haven not done so already, check out our GitHub repository. It includes a GitHub Actions pipeline that automates dbt data testing using the approach we discussed. Have you tried applying this method to your dbt pipelines? We would love to hear your feedback! Happy testing! 🥳

If you want to dive deeper into the topics covered in this blog post, we recommend checking out the following resources:

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert