BI Future Blog

vrijdag 16 februari 2024

Certified DBT Developer certification

Wow, I finally managed to pass the exam of DBT Developer certification. I failed two times first because I didn't had much of handson experience with DBT. That will change in the future. So, I had to learn everything from the documentation and from a lot of Udemy tests. Although the tests available on the internet doesn't completely cover the questions on the real exam of DBT, it helped a lot to understand DBT. I did that by researching the answers, googling and reading the documentation of DBT. Last but not least I practiced a bit with DBT on a small mock-up project that I've created.

I also created a summary of all the things I learned in PowerPoint slides. I do this now for a couple of years now and I have an extensive list of PowerPoints created on subjects like DevOps, Snowflake, Microsoft, Azure, Data modelling, Datavault, Git, Scrum, etc and now DBT. I helps me to understand and organize the information.

Here is my list of powerpoints of DBT.

Here are interesting courses and tests on Udemy

And I really enjoyed following this course on Udemy (also because it's for Snowflake).

Qanalabs also has a good list of questions (and explanantions) that helps.

Hennie

zaterdag 30 december 2023

Ephemeral tables in DBT (Part XI)

Introduction

In this blogpost we're going to take a look at a specific type of a materialization, known as ephemeral. According to DBT : materializations are strategies for persisting dbt models in a warehouse. There are five types of materializations built into dbt. They are:

table
view
incremental
ephemeral
materialized view

In this blogpost we'll focus on the ephemeral materialization. Now one thing that pops up out out of my head right away is the continuing discussing between a dimensional model as Kimball has defined and the Snowflake (not the data platform) datamodel. Sometimes there are in multiple dimensions the same aggregations used. In the example of AdventureWorksLT think about ProductCategory. Suppose that a ProductCategory is used in multiple dimensions and you want adopt the DRY principle. Then, you can decide to ProductCategory as Ephemeral materialization. Lets try this out!

This blogpost is a blogpost in a serie of blogposts :

Ephemeral example

Suppose that you have a dimension ProductCategory that is cleaned up and has quite some transformations. You don't want to apply the same logic again and again for different (other) dimensions. Ok, now lets go the Dim_ProductCategory.sql file and make it an ephemeral materialization:

Next, we need to reference this ephemeral materialization in the Product dimension.

The next step is to run the model

Now what has DBT made from the code? Lets check this in the target folder :

An extra CTE is added to the Dim_Product and is around the CTE_DIM_PRODUCTCATEGORY. The ephemeral model is incorporated in another model.

Final thoughts

In this bogpost I've investigated the ephemeral materialization of a model. Interesting to see that it is possible to reuse logic of a model in other models.

Hennie

vrijdag 29 december 2023

dbt Analytics Engineering Certification Exam cheatsheet

Introduction

I'm preparing for the DBT exam and I used the "dbt Analytics Engineering Certification Exam Study Guide" for searching for more information about the topics covered in the exam. I stored the links that I used in this blogpost, just helping you too. I hope you enjoy it.

I'll frequently update this blogpost. I just started with a couple of topics.

Topic 1: Developing dbt models

Identifying and verifying any raw object dependencies

Pro-tips for workflows

Understanding core dbt materializations

Materializations

Conceptualizing modularity and how to incorporate DRY principles

Converting business logic into performant SQL queries

Using commands such as run, test, docs and seed

dbt Command reference

Creating a logical flow of models and building clean DAGs

Defining configurations in dbt_project.yml

Configuring sources in dbt

Using dbt Packages

Topic 2: Debugging data modeling errors

Understanding logged error messages

Troubleshooting using compiled code

Troubleshooting .yml compilation errors

Distinguishing between a pure SQL and a dbt issue that presents itself as a SQL issue
Developing and implementing a fix and testing it prior to merging

Topic 3: Monitoring data pipelines

Understanding and testing the warehouse-level implications of a model run failing at different points in the DAG

Understanding the general landscape of tooling

Topic 4: Implementing dbt tests

Using generic, singular and custom tests on a wide variety of models and sources

Understanding assumptions specific to the datasets being generated in models and to the raw data in the warehouse

Implementing various testing steps in the workflow.

Ensuring data is being piped into the warehouse and validating accuracy against baselines

Topic 5: Deploying dbt jobs

Understanding the differences between deployment and development environments
Configuring development and deployment environments
Configuring the appropriate tasks, settings and triggers for the job
Understanding how a dbt job utilizes an environment in order to build database objects and artifacts
Using dbt commands to execute specific models

Topic 6: Creating and Maintaining dbt documentation

Updating dbt docs
Implementing source, table, and column descriptions in .yml files
Using dbt commands to generate a documentation site
Using macros to show model and data lineage on the DAG

Topic 7: Promoting code through version control

Understanding concepts and working with Git branches and functionalities
Creating clean commits and pull requests
Merging code to the main branch

Topic 8: Establishing environments in data warehouse for dbt

Understanding environment’s connections
Understanding the differences between production data, development data, and raw data

Working with hooks in DBT (Part X)

Introduction

In one of my projects I've built a logging mechanism with stored procedures in Snowflake. It will log the start of a run, the start of load procedure, the end of the load procedure and the end of the run. You can do this also with hooks in DBT.

From DBT docs : Hooks are snippets of SQL that are executed at different times:

pre-hook: executed before a model, seed or snapshot is built.
post-hook: executed after a model, seed or snapshot is built.
on-run-start: executed at the start of dbt build, dbt compile, dbt docs generate, dbt run, dbt seed, dbt snapshot, or dbt test.
on-run-end: executed at the end of dbt build, dbt compile, dbt docs generate, dbt run, dbt seed, dbt snapshot, or dbt test.

This blogpost is a blogpost in a serie of blogposts :

Hooks

First, lets create a RUN table and a MESSAGE table in Snowflake. I've used the table that I'm currently using in my project. The first table RUN is used for storing the complete Run (in my case the complete load for Staging, Datavault and Datamart). Next, for every load procedure I've created a MESSAGE table that will store the start and the end of the load and, if the procedure runs into an error the error is logged.

--Auditing
CREATE TABLE IF NOT EXISTS  RUN (
    RUN_ID              INT AUTOINCREMENT START 1 INCREMENT 1,
    RUN_STARTDATE       TIMESTAMP_NTZ,
    RUN_ENDDATE         TIMESTAMP_NTZ
);

CREATE TABLE IF NOT EXISTS MESSAGE (
    MESSAGE_ID                  INT autoincrement start 1 increment 1,
    RUN_ID                      INT,                -- Runnumber of the complete run    
    MESSAGE_STATUS              STRING,             -- START, END, ERROR, INTERMEDIATE    
    MESSAGE_DESCRIPTION         STRING,             -- Logging message or error message
    MESSAGE_LOCATION            STRING,             -- Individual Stored procedurename
    MESSAGE_LAYER               STRING,             -- Database layer : DATAVAULT, DATAMART
    MESSAGE_AREA                STRING,             -- Subject Area : LEGACY, FINANCE, etc
    MESSAGE_LEVEL               NUMBER,             -- Stored procedure level : 1 = MASTER, 50 = DATAVAULT or DATAMART, 100 = Low level stored procedure  
    MESSAGE_DATE                TIMESTAMP_NTZ,      -- Date of logging
    MESSAGE_ROWCNT              NUMBER              -- Affected rows of SQL executed 
);

Now, I'm not rebuilding the complete logic into DBT, but I will take some key elements of the logging and build that into DBT.

on-run-start: "INSERT INTO RUN (RUN_STARTDATE) SELECT CURRENT_TIMESTAMP();"
on-run-end: "INSERT INTO RUN (RUN_STARTDATE) SELECT CURRENT_TIMESTAMP();"

And then run the following command in the project folder :

dbt run -m Dim_Customer

Resulting in the following :

How does it look in Snowflake when the command has been executed?

It's not completely the way it should work because I normaly use one line for the start and the end of the run. But for the sake of this demo I just wanted to show the hook functionality. Maybe, in a later blogpost, I'll elaborate this into a real working example.

Now the lets check the other hooks in DBT : pre-hook and post-hook. Here I added some code in the dbt_project.yml file : "INSERT INTO MESSAGE table".

and then execute the dbt run command :

dbt run -m Dim_Customer

Resulting in :

In contrast with the run hooks it doesn't return a confirmation that the model hooks were executed, but in the log table it shows that there are rows inserted.

Using Macros

Now it seems that is not really possible to use {{ this }} in the dbt_project file. I think it's because of that the dbt_project is not from a model perspective. If you use the macro for the logging, you can use this {{ this }}. First, create a macro in the project :

I copied the code from the hook in the macro and I added some macro magic around the code. I added also {{ this }} for the reference to the model that I'm using. You will see this later in the snippet of the table in Snowflake.

The hook is also changed now :

And if I run the DBT RUN command, the following nothing special happens :

And now lets check the table in Snowflake for values :

And here you see that the {{ this }} is the name of the model in your project.

Final thoughts

Now there are a lot of improvements possible with this logging in DBT, but I hope that this blogpost gives you some insight on the hooks.

Hennie

donderdag 28 december 2023

Adding documentation to your DBT project (Part IX)

Introduction

Sometimes you get the question from a collegue about that he or she wants to know more about how a specific measure is generated. DBT has a documentation option you can use for this. Lets try this and I'll show this in this blogpost.

This blogpost is a blogpost in a serie of blogposts :

Documentation

There are two important commands for documentation : generate and serve. Lets find out more about them. For using the documentation, you can add a description tag to your YAML files.