Note: Here is not really an article, but more some notes about how we use dbt in our team. We only focus on the data analysis part. Some tools/services are not mentioned here.

Our technical stack

 
 
  • BigQuery: data storage + development platform (on SQL workspace)
  • Airflow on Google Cloud Composer: code execution platform, job scheduling, monitoring
  • dbt client: SQL scripts execution dependencies, documentation, and testing
  • Github: code versioning, CICD (deployment, code quality, tests)
  • Google Data Studio: dashboards platform
  • Visual Studio Code (or similar): develop code in local, execute tests

 

Google Cloud Composer and Google BigQuery are part of Google Cloud Platform (GCP) services. We handle permissions/access by using Identity and Access Management (IAM). We have 2 projects, one for staging and one for production.

 

Our architecture for analysis

Data analysts mainly use SQL scripts to build their analysis. We run every day several hundred SQL scripts. Each script will create a table into BigQuery (with drop/create or incremental strategy).

 

The analysis is split into 3 parts:

  • cleaning raw data

  • cross data and building some common metrics

  • build visualization tables that we can plug into the dashboards (here we use Data Studio)

The development process

What is compile step?

In debt, you will use the Jinja template inside your SQL code. To evaluate that Jinja template and have only SQL code that you can run on the BigQuery SQL workspace, you need to compile your code

 

Run and test models

Inside debt project, each SQL script represents a model. When you start to have a lot of models, you need to be careful when you run the whole workflow (dbt run without filters). You can quickly consume a lot of resources.

 

To avoid this, we create custom commands in a makefile. Data analysts are not able to run directly dbt commands, but only makefile commands. It's also easier and less technical for them, they don't have to know dbt commands with all possible parameters.

 

We implemented a “magic method” that runs all models that weren’t tested since the last test execution. That also formats the code and some other stuff. To do it, we store the md5 checksum of each model.

 

By doing this, we are sure that data analysts have tested the code before creating a PR. We also add control on GitHub CICD (Github actions), to be sure all models are well tested, otherwise the PR is rejected.

 

Documentation format

For each model, we create a .yml file that contains the documentation of the model (plus in some cases, tests).

 

This is an example of the documentation file:

 

models:
  - name: user
    description: User of the database (client)
    columns:
      - name: name
        description: First name of the user
        meta:
          sensitive: True
      - name: age
        description: Age of the user
        meta:
          sensitive: True

 

The problem here is that it’s quite hard to be sure that the documentation is up to date with the current model. We develop some utils functions in python that will check the model and compare it with the documentation, it will print differences.

 

It will also automatically generate the documentation file with pre-filled columns. We integrate a check in our GitHub CICD (Github actions) to be sure that the documentation is up to date, otherwise the PR can’t be merged.

 

Our current feeling about debt

We start to use debt client for 2 reasons:

  • Handle SQL script dependencies. Before that, we were executing scripts one by one directly on Airflow
  • Have documentation inside the code, and have a nice web UI to explore this documentation. Before that, we were using google spreadsheet (similar to Excel files) to document the code

 

Now we are very happy because we solve those 2 problems without implementing a “homemade solution”. dbt starts to be a standard for this kind of data analysis workflow, and it is always better for the lifecycle of the project to use a standard solution.

However, it is important to talk about the less positive points. dbt is a young product, and we saw it when we try to install it on the Google Composer platform. It was very problematic, in fact, the python dbt library has a lot of dependencies, so we had some conflicts with other default Composer libraries. This is not a positive point for the compatibility of the library, and it can very quickly become a constraint. We fix it by creating a virtual env inside the Airflow task, before running dbt.

 

We also realized that most of the development is focused on dbt Cloud, this is normal because it’s a paid solution. But when you use dbt client, it can be frustrating. As an example, we had some difficulties deploying the documentation. Normally, it was supposed to be a static website. But the way how they implement it, you need a web server.

To conclude, migration to dbt is not so easy if you don’t start a project from scratch. There are many things to consider. Feel free to if you have any technical questions to contact me.

#bigdata #ai #7wdata #artificialintelligence #cloud #fact #engineering #didyouknow #technology #physics #nasa #space #facts #universe #knowledge #dailyfacts #biology #factz #chemistry #astronomy #education #earth #memes #cosmos #amazing #nature #allfacts #tech #innovation #astrophysics