Note: Here is not really an article, but more some notes about how we use dbt in our team. We only focus on the data analysis part. Some tools/services are not mentioned here.
Our technical stack
- BigQuery: data storage + development platform (on SQL workspace)
- Airflow on Google Cloud Composer: code execution platform, job scheduling, monitoring
- dbt client: SQL scripts execution dependencies, documentation, and testing
- Github: code versioning, CICD (deployment, code quality, tests)
- Google Data Studio: dashboards platform
- Visual Studio Code (or similar): develop code in local, execute tests
Google Cloud Composer and Google BigQuery are part of Google Cloud Platform (GCP) services. We handle permissions/access by using Identity and Access Management (IAM). We have 2 projects, one for staging and one for production.
Our architecture for analysis
Data analysts mainly use SQL scripts to build their analysis. We run every day several hundred SQL scripts. Each script will create a table into BigQuery (with drop/create or incremental strategy).
The analysis is split into 3 parts:
-
cleaning raw data
-
cross data and building some common metrics
-
build visualization tables that we can plug into the dashboards (here we use Data Studio)
The development process
What is compile step?
In debt, you will use the Jinja template inside your SQL code. To evaluate that Jinja template and have only SQL code that you can run on the BigQuery SQL workspace, you need to compile your code
Run and test models
Inside debt project, each SQL script represents a model. When you start to have a lot of models, you need to be careful when you run the whole workflow (dbt run
without filters). You can quickly consume a lot of resources.
To avoid this, we create custom commands in a makefile. Data analysts are not able to run directly dbt commands, but only makefile commands. It's also easier and less technical for them, they don't have to know dbt commands with all possible parameters.
We implemented a “magic method” that runs all models that weren’t tested since the last test execution. That also formats the code and some other stuff. To do it, we store the md5 checksum of each model.
By doing this, we are sure that data analysts have tested the code before creating a PR. We also add control on GitHub CICD (Github actions), to be sure all models are well tested, otherwise the PR is rejected.
Documentation format
For each model, we create a .yml
file that contains the documentation of the model (plus in some cases, tests).
This is an example of the documentation file:
models:
- name: user
description: User of the database (client)
columns:
- name: name
description: First name of the user
meta:
sensitive: True
- name: age
description: Age of the user
meta:
sensitive: True
The problem here is that it’s quite hard to be sure that the documentation is up to date with the current model. We develop some utils functions in python that will check the model and compare it with the documentation, it will print differences.
It will also automatically generate the documentation file with pre-filled columns. We integrate a check in our GitHub CICD (Github actions) to be sure that the documentation is up to date, otherwise the PR can’t be merged.
Our current feeling about debt
We start to use debt client for 2 reasons:
- Handle SQL script dependencies. Before that, we were executing scripts one by one directly on Airflow
- Have documentation inside the code, and have a nice web UI to explore this documentation. Before that, we were using google spreadsheet (similar to Excel files) to document the code
Now we are very happy because we solve those 2 problems without implementing a “homemade solution”. dbt starts to be a standard for this kind of data analysis workflow, and it is always better for the lifecycle of the project to use a standard solution.
However, it is important to talk about the less positive points. dbt is a young product, and we saw it when we try to install it on the Google Composer platform. It was very problematic, in fact, the python dbt library has a lot of dependencies, so we had some conflicts with other default Composer libraries. This is not a positive point for the compatibility of the library, and it can very quickly become a constraint. We fix it by creating a virtual env inside the Airflow task, before running dbt.
We also realized that most of the development is focused on dbt Cloud, this is normal because it’s a paid solution. But when you use dbt client, it can be frustrating. As an example, we had some difficulties deploying the documentation. Normally, it was supposed to be a static website. But the way how they implement it, you need a web server.
To conclude, migration to dbt is not so easy if you don’t start a project from scratch. There are many things to consider. Feel free to if you have any technical questions to contact me.
#bigdata #ai #7wdata #artificialintelligence #cloud #fact #engineering #didyouknow #technology #physics #nasa #space #facts #universe #knowledge #dailyfacts #biology #factz #chemistry #astronomy #education #earth #memes #cosmos #amazing #nature #allfacts #tech #innovation #astrophysics