maandag 27 juni 2022

DBT and Snowflake (part I)

Introduction

Hi Everyone! It has been a while since I posted a blogpost. Time to post something new. Today I would like to discuss dbt with Snowflake. What is it? And how do you set up dbt with Snowflake. For this blogpost I'll use the tutorial of dbt to follow the step by step guide. 

As mentioned in the tutorial, the basic steps are :
  • Set up a warehouse with sample data.
  • Connect the warehouse to dbt Cloud.
  • Add a Git repository to dbt Cloud (not in the blogpost).
  • Execute a dbt transformation using dbt run.
  • Schedule a job or transformation.

Let's go..

Set up a warehouse with sample data

First step is to create dbt in the cloud for the most common data platforms : Bigquery, Databricks or Snowflake. I choose Snowflake.


According to the tutorial I have to create an Snowflake account, but I already have an account for snowflake, so I skip this section and go straight to the section about loading the data.

CREATE WAREHOUSE IF NOT EXISTS transforming;
CREATE DATABASE IF NOT EXISTS raw;
CREATE DATABASE IF NOT EXISTS  analytics;
CREATE SCHEMA IF NOT EXISTS raw.jaffle_shop;
CREATE SCHEMA IF NOT EXISTS raw.stripe;

create table raw.jaffle_shop.customers 
( id integer,
  first_name varchar,
  last_name varchar
);

copy into raw.jaffle_shop.customers (id, first_name, last_name)
from 's3://dbt-tutorial-public/jaffle_shop_customers.csv'
file_format = (
  type = 'CSV'
  field_delimiter = ','
  skip_header = 1
  );
  
  create table raw.jaffle_shop.orders
( id integer,
  user_id integer,
  order_date date,
  status varchar,
  _etl_loaded_at timestamp default current_timestamp
);

copy into raw.jaffle_shop.orders (id, user_id, order_date, status)
from 's3://dbt-tutorial-public/jaffle_shop_orders.csv'
file_format = (
  type = 'CSV'
  field_delimiter = ','
  skip_header = 1
  );

create table raw.stripe.payment 
( id integer,
  orderid integer,
  paymentmethod varchar,
  status varchar,
  amount integer,
  created date,
  _batched_at timestamp default current_timestamp
);

copy into raw.stripe.payment (id, orderid, paymentmethod, status, amount, created)
from 's3://dbt-tutorial-public/stripe_payments.csv'
file_format = (
  type = 'CSV'
  field_delimiter = ','
  skip_header = 1
  );


The tables are created and loaded from a s3 bucket. That is something that dbt has setup to help you loading some testdata in the tables.

Let's see if we have some data.



We are now finished with setting up Snowflake and the tutorial code and data.

Connect the warehouse to dbt Cloud

It seems that there are two ways to connect Snowflake with dbt : partner connect or setup the Snowflake connection manually. I choose option 1. The tutorial is written based on the classic interface, so let's see if we can find the partner connect in the new interface of Snowflake.

It's available under the option Admin.



Enter the following information in the screen and press on connect :


Now the partner account has been created :


Enter the credentials for dbt :



And now I've created and activated the free trail account:


It's possible to turn the free trail account into another type of account.



Go to the billing and select a plan now :



Here is an overview of the options (I choose the developer plan) and confirmed the choice.


It seems there is a need for a small tweak in dbt. Change the name of the warehouse and database to analytics and transforming, respectively.


And now we can initialize the project and start the development. After pressing Start developing the following window appears.



Execute a dbt transformation using dbt run

Now let's build a model and a model is a select statement (!). Strange but ok. W'll see why dbt a select statement calls a model. Let's press on Create a new branch.



And now we have to create a file in de models folder called models/customers.sql



With the following content :


with customers as (

    select
        id as customer_id,
        first_name,
        last_name

    from raw.jaffle_shop.customers

),

orders as (

    select
        id as order_id,
        user_id as customer_id,
        order_date,
        status

    from raw.jaffle_shop.orders

),

customer_orders as (

    select
        customer_id,

        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

)

select * from final

 And with dbt run we can check the model (query) :


Here is an example screen where I've created three models stg_customers, stg_orders and customers and customers references the stg_customers and stg_orders. stg_customers references the other models with a ref tag.




Now we can add a schema file to the project. It describes the structure of the tables and with some more constraints like 'unique'. I wonder whether dbt checks the uniqueness of the column. Run the test with dbt test.




The code (plus documentation):
version: 2

models:
  - name: customers
    description: One record per customer
    columns:
      - name: customer_id
        description: Primary key
        tests:
          - unique
          - not_null
      - name: first_order_date
        description: NULL when a customer has not yet placed an order.

  - name: stg_customers
    description: This model cleans up customer data
    columns:
      - name: customer_id
        description: Primary key
        tests:
          - unique
          - not_null

  - name: stg_orders
    description: This model cleans up order data
    columns:
      - name: order_id
        description: Primary key
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']


And it seems that dbt checks the uniqueness of the key. 



Schedule a job or transformation

Next step is to run a job. First we need to commit the code and create a new environment.



And now we can create a new job.




Final thoughts

I've only scratched the surface of the tool dbt with this blogpost, but dbt shows a lot of maturity in the development, deployment and even testing the structures. 

Hennie