BI Future Blog: augustus 2023

donderdag 31 augustus 2023

Identifier in Snowflake

Introduction

Identifier is also a nice feature of Snowflake. I forgot how many times I wanted to SELECT * FROM <dynamic table> IN SQL SERVER scripting. This is not possible without dynamic scripting dynamic SQL) in SQL Server. In Snowflake you can use the keyword IDENTIFIER.

IDENTIFIER

Let me give you an example with a variable.

DECLARE 
    table_name STRING;
    N INTEGER DEFAULT -4;
BEGIN

    SET table_name := 'Calendar';
    SELECT COUNT(*) INTO :N FROM IDENTIFIER(:table_name);
    RETURN N;
END

With this you can parameterize the query and dynamically reference a different table.

Final thoughts

A small example on how to work with IDENTIFIER.

Hennie

woensdag 30 augustus 2023

Building a Date dimension in Snowflake

Introduction

In the past, I was used to use a Date dimension (or I call it sometimes a Calendar dimension) in SQL Server and I've built it many times in SQL Server, but now for an exercise I decided to build it in Snowflake. Disclaimer is that it is not very performant at the moment because it's a row by row insertion and that is one thing where Snowflake doesn't perform not very well. But so be it. I don't expect that this stored procedure is executed very often. There, it could be an option to use it in a project. But on the other hand, I will look for faster solutions and I've already found some interesting options. But that is something for a future blogpost.

DISCLAIMER : You can use this procedure, but there are better ways to do this in Snowflake. I'll come back with a better solution. It was an exercise for me to rebuild the SQL Server stored procedure in Snowflake.

Look for a better way here

The stored procedure

Here is the stored procedure :

CREATE OR REPLACE PROCEDURE SP_CreateCalenderDimension(t_startdate TIMESTAMP_NTZ(9), t_enddate TIMESTAMP_NTZ(9))
RETURNS VARCHAR NOT NULL
LANGUAGE SQL
AS
-- DECLARE
--     t_startdate TIMESTAMP_NTZ(9) DEFAULT '2023-01-01';
--     t_enddate TIMESTAMP_NTZ(9) DEFAULT '2023-12-31';
BEGIN
    LET t_loopdate TIMESTAMP_NTZ(9) := t_startdate;
    LET i_calendarid INT;
    LET d_calendershortdate DATE;
    LET i_dayweek INT;                -- Monday is the first day of the week
    LET i_daymonth INT;
    LET i_dayyear INT;
    LET s_shortdayname STRING;
    LET s_longdayname STRING;
    LET i_isoweekyear INT;
    LET i_monthnumber INT;
    LET i_year INT;
    LET i_isoyear INT := TO_NUMBER(DATE_PART(year, :t_startdate));

    LET s_yearname STRING;
    LET i_quarter INT;
    LET s_yyyymm STRING(6);
    LET s_mmyyyy STRING(6);
    LET i_weekday INT;
    LET i_firstdayofmonth DATE;
    LET i_lastdayofmonth DATE;
    LET i_firstdayofquarter DATE;
    LET i_lastdayofquarter DATE;
    LET i_firstdayofyear DATE;
    LET i_lastdayofyear DATE;
    LET i_isomonthnumber INT;
    LET i_isoquarter INT;
    LET i_isoweekofyear INT;
    LET s_wwyyyy STRING(6);
    LET s_yyyyww STRING(6);
    LET b_changeisoweekyear CHAR(1) := 'N';
    
    CREATE OR REPLACE TABLE SP_DB.PUBLIC.CALENDAR (
    	CalendarID INT NOT NULL,
    	CalendarDate TIMESTAMP_NTZ(9) NOT NULL,
    	CalendarShortDate DATE NOT NULL,
    	CalendarDayOfWeek INT  NOT NULL,
    	CalendarDayOfMonth INT NOT NULL,
--        CalendarDayOfQuarter INT  NULL,	   
    	CalendarDayOfYear INT NOT NULL,
    	CalendarShortDayName STRING NOT  NULL,
    	CalendarLongDayName STRING NOT NULL,
        --  CalendarIsWorkDayOfMonth
        --  DaySuffix
        --  DayOfQuarter
        CalendarFirstDayOfMonth DATE NOT NULL,
        CalendarLastDayOfMonth DATE NOT NULL,
        CalendarFirstDayOfQuarter DATE NULL,
        CalendarLastDayOfQuarter DATE NULL,
        CalendarFirstDayOfYear DATE NULL,
        CalendarLastDayOfYear DATE NULL,
        CalendarIsWeekDay INT NULL,
        --  IsHoliday
        --  HolidayDescription

    -- Week
        CalendarISOWeekOfYear NUMBER(38,0) NULL, 
        CalendarWWYYYY STRING(6) NULL, 
        CalendarYYYYWW STRING(6) NULL, 
    
    -- Month
        CalendarMonthNumber NUMBER(38,0) NULL,
        CalendarISOMonthNumber NUMBER(38,0) NULL,    -- 445 pattern
        --  MonthName
        --  MonthOfQuarter
        CalendarMMYYYY STRING(6) NULL,         
        CalendarYYYYMM STRING(6) NULL,
        --  MonthYear
        --  YearMonth
        --  ISOMonth

    -- Quarter
        CalendarQuarter NUMBER(38,0)  NULL,
        CalendarISOQuarter NUMBER(38,0) NULL, 
        --  QuarterName

    -- Year
        CalendarYear NUMBER(38,0)  NULL,
        CalendarISOYear NUMBER NULL,
        CalendarYearName STRING NULL
    );
  
    WHILE (t_loopdate <= t_enddate) DO    
        -- day
        SET i_calendarid := TO_NUMBER(TO_CHAR(:t_loopdate,'yyyymmdd'));
        SET d_calendershortdate := TO_DATE(:t_loopdate);
        SET i_dayweek := DATE_PART(dayofweekiso, :t_loopdate);      
        SET i_daymonth := DATE_PART(day, :t_loopdate);
        SET i_dayyear := DATE_PART(dayofyear, :t_loopdate);
        SET s_shortdayname := DAYNAME(:t_loopdate);
        SET s_longdayname := CASE 
            WHEN :s_shortdayname = 'Sun' THEN 'Sunday' 
            WHEN :s_shortdayname = 'Mon' THEN 'Monday' 
            WHEN :s_shortdayname = 'Tue' THEN 'Tuesday' 
            WHEN :s_shortdayname = 'Wed' THEN 'Wednesday' 
            WHEN :s_shortdayname = 'Thu' THEN 'Thursday' 
            WHEN :s_shortdayname = 'Fri' THEN 'Friday' 
            WHEN :s_shortdayname = 'Sat' THEN 'Saturday' 
        END;
        SET i_weekday := CASE WHEN DATE_PART(dayofweekiso, :t_loopdate) NOT IN (6,7) THEN 1 ELSE 0 END;
        SET i_firstdayofmonth := TO_DATE(DATEADD(day, - (DATE_PART(day, :t_loopdate) - 1), :t_loopdate));
        SET i_lastdayofmonth := LAST_DAY(:t_loopdate, 'month');
        SET i_firstdayofquarter := TO_DATE(DATEADD(quarter, DATEDIFF(quarter, '2000-01-01'::TIMESTAMP, :t_loopdate), '2000-01-01'::TIMESTAMP));
        SET i_lastdayofquarter :=  LAST_DAY(:t_loopdate, 'quarter');
        SET i_firstdayofyear := TO_DATE('01/01/' || TO_VARCHAR(DATE_PART(year, :t_loopdate)));
        SET i_lastdayofyear := LAST_DAY(:t_loopdate, 'year');

        -- week
        SET i_isoweekyear := DATE_PART(weekiso, :t_loopdate); 
 
    	-- Determine the month based on the weeknumber (based on a 4-4-5 period)
    	SET i_isomonthnumber  := CASE	WHEN :i_isoweekyear >= 1 AND :i_isoweekyear <= 4 Then 1		--4
    									WHEN :i_isoweekyear >= 5 AND :i_isoweekyear <= 8 Then 2		--4
    									WHEN :i_isoweekyear >= 9 AND :i_isoweekyear <= 13 Then 3	--5
    									WHEN :i_isoweekyear >= 14 AND :i_isoweekyear <= 17 Then 4	--4	
    									WHEN :i_isoweekyear >= 18 AND :i_isoweekyear <= 21 Then 5	--4
    									WHEN :i_isoweekyear >= 22 AND :i_isoweekyear <= 27 Then 6	--5
    									WHEN :i_isoweekyear >= 28 AND :i_isoweekyear <= 31 Then 7	--4
    									WHEN :i_isoweekyear >= 32 AND :i_isoweekyear <= 35 Then 8	--4
    									WHEN :i_isoweekyear >= 36 AND :i_isoweekyear <= 40 Then 9	--5
    									WHEN :i_isoweekyear >= 41 AND :i_isoweekyear <= 44 Then 10	--4
    									WHEN :i_isoweekyear >= 45 AND :i_isoweekyear <= 48 Then 11	--4	
    									WHEN :i_isoweekyear >= 49 AND :i_isoweekyear <= 53 Then 12	--5
    							END;
    
        -- Month
        SET i_monthnumber := DATE_PART(month, :t_loopdate);

        SET i_quarter := DATE_PART(quarter, :t_loopdate);
        SET s_yyyymm := TO_VARCHAR(DATE_PART(year, :t_LoopDate)) || RIGHT('0' || TO_VARCHAR(DATE_PART(month, :t_LoopDate)),2);
        SET s_mmyyyy := RIGHT('0' || TO_VARCHAR(DATE_PART(month, :t_LoopDate)),2) || TO_VARCHAR(DATE_PART(year, :t_LoopDate));

        -- Quarter
    	SET i_isoquarter  := CASE WHEN :i_isoweekyear >= 1 AND  :i_isoweekyear <= 13 Then 1	
    	 						  WHEN :i_isoweekyear >= 14 AND :i_isoweekyear <= 27 Then 2	
    	 						  WHEN :i_isoweekyear >= 28 AND :i_isoweekyear <= 40 Then 3	
    	 						  WHEN :i_isoweekyear >= 41 AND :i_isoweekyear <= 53 Then 4		
                            END;

        -- Year
        SET i_year := DATE_PART(year, :t_loopdate);
        SET s_yearname := 'CY ' || TO_VARCHAR(:i_year);


        -- Post processing (dependent on isoyear)
        SET s_wwyyyy := RIGHT('0' || TO_VARCHAR(:i_isoweekyear),2) || TO_VARCHAR (:i_isoweekyear);
    	SET s_yyyyww := TO_VARCHAR (:i_isoweekyear) || RIGHT('0' || TO_VARCHAR(:i_isoweekyear),2);

        IF (:i_isoweekyear = 1 AND :b_changeisoweekyear = 'Y') THEN
            SET i_isoyear := :i_isoyear + 1; 
		    SET b_changeisoweekyear := 'N';
        END IF;
        IF (:i_isoweekyear = 2) THEN 
            SET b_changeisoweekyear := 'Y';
        END IF;
        
        INSERT INTO Calendar (
            CalendarID,                 
            CalendarDate, 
            CalendarShortDate, 
            CalendarDayOfWeek,
            CalendarDayOfMonth, 
            CalendarDayOfYear,
            CalendarShortDayName,
            CalendarLongDayName,
            CalendarFirstDayOfMonth,
            CalendarLastDayOfMonth,
            CalendarFirstDayOfQuarter,
            CalendarLastDayOfQuarter,
            CalendarFirstDayOfYear,
            CalendarLastDayOfYear,

            
            CalendarISOWeekOfYear,
            CalendarYYYYWW,
            CalendarWWYYYY,
            
            CalendarMonthNumber,
            CalendarISOMonthNumber,
            CalendarYear,
            CalendarISOYear,
            CalendarYearName,
            CalendarQuarter,
            CalendarISOQuarter,
            CalendarYYYYMM,
            CalendarMMYYYY,
            CalendarIsWeekDay
            ) 
        SELECT     
            :i_calendarid, 
            :t_loopdate, 
            :d_calendershortdate, 
            :i_dayweek,               
            :i_daymonth, 
            :i_dayyear,
            :s_shortdayname, 
            :s_longdayname,
            :i_firstdayofmonth,
            :i_lastdayofmonth, 
            :i_firstdayofquarter,
            :i_lastdayofquarter, 
            :i_firstdayofyear, 
            :i_lastdayofyear, 
            
            :i_isoweekyear,
            :s_yyyyww,
            :s_wwyyyy,
            :i_monthnumber,
            :i_isomonthnumber,
            :i_year,
            :i_isoyear,
            :s_yearname,
            :i_quarter,
            :i_isoquarter,
            :s_yyyymm,
            :s_mmyyyy,
            :i_weekday;

        SET t_loopdate := DATEADD(DAY, 1, t_loopdate);
    END WHILE;
    
   RETURN t_loopdate;
END;

I haven't found a way to define a default value for the parameters, so I will fill some dates in the callee parameters:

CALL SP_CreateCalenderDimension('2023-01-01', '2023-12-31');

Inspection of the calendar table :

SELECT * FROM Calendar;

Final Thoughts

In SQL Server it's performing well, but in Snowflake it's performing less. On the one side it's a stored procedure I do expect that it's not necessary to execute often.

But future improvements could be another smarter way of generating the calendar dimension. I will look into that. I already found some examples with GENERATOR. Next improvement could be including (dutch) holidays.

Hennie

Difference between SQL Server and Snowflake

A small blogpost about a nice feature of Snowflake. I stumbled upon an interesting difference between SQL Server and Snowflake. It is about using an alias in the same query. I've never seen this before, but i'm biased to Microsoft of course.

So here is the thing. Suppose you have the following query :


SELECT 1 AS TEST,
    TEST + 1 AS TEST2

In SQL Server it will generate an error

But in Snowflake it will run!

Now why could this be important? Well, when I write queries in SQL, I sometimes have to reuse functionality multiple times in the same query. Here it can help to reuse coding and adjust the code at one place instead of multiple places in the same query.

Hennie

zondag 20 augustus 2023

Medallion architecture

Introduction

If you're working in the field of Data engineering, Analytics and Data Warehousing, you probably heared the phrase "Medallion Architecture". Now, me as a data guy I was/am curious about what is executed in every layer and so I did a little investigation about this subject. I searched the Internet and for every layer I copied and pasted the phrases that I found important. So the credits goes to the people/companies that are mentioned in the resources section. So what are the layers, what is the goal of the layer, what is done and what is not done in each layer? I wrote them all down below.

Landingzone Layer

Key Points :

The landing zone is an optional Layer. It's not necessary to integrate this in your data architecture.
This layer is often needed in situations in which it’s hard to extract data from the target source system.

For example, when working with external customers or SaaS vendors.

In such scenarios, you have a dependency or sometimes receive data in an unpreferred (file) format or structure. File formats could be CSV, JSON, XML, Parquet, Delta, and so on.

Bronze Layer

Key points :

Goal :

Long Term Storage and Recovery. Organizations often need a place to dump huge amounts of structured and unstructured data for processing and Azure Data Lake Storage is often the cheapest and most efficient way to do say. The Raw layer is designed in a way to effectively act as a sink for further processing as well as a store for data for however long it is required. This can act as a backup for other systems as well.
Auditability. Since we don’t do much in terms of additional processing on the data we bring into the Raw layer, the data is unfiltered and unpurified, which is a good thing in terms of auditability. This raw data is always immutable and can act as a reference of what the source system looked like at a given point in time. We usually keep the data in its native format, but this holds true even if we decide to store it in a more modern data format.

The data is stored in a Raw format.
The layer is used for Ingestion of data.
Is appended incrementally and grows over time. For transactional data: can be appended incrementally and grow over time.
Provides the ability to recreate any state of a given data system. Maintains the raw state of the data source in the structure “as-is”. Retains the full (unprocessed) history of each dataset in an efficient storage format, for example, Parquet or Delta. Retaining the full, unprocessed history of each dataset in an efficient storage format provides the ability to recreate any state of a given data system.
Can be any combination of streaming and batch transactions.
Files in columnar format (should) (Parquet, Delta, Avro).
Additional metadata (such as source file names or recording the time data was processed) may be added to data on ingest for enhanced discoverability, description of the state of the source dataset, and optimized performance in downstream applications. * May include extra metadata, such as schema information, source file names or recording the time data was processed.
Additional fields (eg. finding dupplicates).
Data is immutable (read-only).
Managed using interval partitioned tables, for example, using a YYYYMMDD or datetime folder structure.
Datatypes always string. Comment : This seems the same advice in a standard data warehouse solution.
Organized by Source.
Data Landed in Native Format :

In an effort to load data as-is, we often keep the data format in the Raw Layer the same as it is in the source. This means our Raw Layer often has CSV and TXTs, depending on the type of source.
That being said, if we manage to bring in the data in Parquet format instead, that would be an added bonus.

Additional Checks :

Schema Validated.
Lake Hierarchy Applied.
Timeliness Assessed.
Elements Tagged.
Completeness and Accuracy Accepted.

Silver Layer

Key points :

Goal :

Enrichment. The Silver layer is often where most of the effort is spent on wrangling the data into a meaningful state. Multiple sources are brought together, the data is refined, and we make sure that it is stored in the most efficient way possible so that the following layers can access data effectively.
Discovery. Since the data on the Silver layer is cleaner and more organized, this is generally where most of the data discovery is done. This involves data engineers, data scientists, data analytics and even sometimes business users digging into the data directly to try to gain more context into the data available in the higher layers. The Silver layer provides a balance between the perk of having long term data retention (Bronze) and having cleaner and organized data (Gold).
Reusability. The Silver layer is also where reusable data sets are created and stored so that different teams creating different higher layers (data marts or warehouses) can reuse the same data sets without having to go through the trouble of creating it again themselves.

Apply Delta Format to the Silver Layer.
Perform consolidation and standardization.
Filter the data.
Clean the data.
Augment the data. The goal of data augmentation is to increase the diversity and variability of the training data, which can lead to improved model performance and generalization on unseen data.
Standardize the data.
The data is validated.
The data is normalized. Data modellering 3NF-a-Like, Data Vault-a-Like.
This layer is write-Optimized.
Organized in domains. In the Silver layer the data is organized by Domain, such as Customer or Product. Comment : This seems to me entities (not domains).
Files in columnar format (should) (Parquet, Delta).

Gold Layer

Keypoints :

The data is structured for read optimization. Consumer Specific Data Models -> One of the ways of getting the data ready for consumption is molding it into a data model that allows for quick and efficient access while being scalable as well.
Business level aggregates are applied
Files in columnar format (should) (Parquet, Delta). Data is stored in an efficient storage format, preferably Delta.
Gold tables represent data that has been transformed for consumption or use cases.
Gold uses versioning for rolling back processing errors.
Historization is applied only for the set of use cases or consumers. So, Gold can be a selection or aggregation of data that’s found in Silver.
In Gold you apply complex business rules. So, it uses many post-processing activities, calculations, enrichments, use-case specific optimizations, etc.
Data is highly governed and well-documented.
Enforce Quality. Although, some standardizations is applied on the Silver Layer, but it’s important to remember that the Gold layer will be the version of truth that is exposed to customer dashboards and other downstream reporting systems. So, we want to make sure that the data is consumable and clean, not just in terms of the data quality of specific fields but in terms of the entire data presented. We can do things like adding filters to reduce the amount of data being pushed forward if it does not benefit that consumer’s use cases and we can aggregate data where applicable too.
Security. Since the Gold Layer will be the go-to place for most of the organization to get their data, it’s a good idea to focus on the security of this data too. Things like Row Level Security and Data Masking are often used here in addition to the normal Role Based Access Control mechanisms.

Resources

For this blogpost I've used the following resources. I took the keypoints from these resources, deduplicate the items or combined them into a new keypoint:

https://www.mssqltips.com/sqlservertip/7689/data-lake-medallion-architecture-to-maintain-data-integrity/
https://www.jamesserra.com/archive/2019/10/databricks-delta-lake/
https://www.advancinganalytics.co.uk/blog/medallion-architecture
https://www.databricks.com/glossary/medallion-architecture
https://microsoft.github.io/WhatTheHack/060-SynapseAndDatabricks/
ChatGPT

Final thoughts

After years where people claimed that the "Data Warehouse is dead", it seems to me that a lot is coming back to the Data Warehouse principles according to the defintion of Inmon. Even Datavault is mentioned as a proposed integration solution in the silver layer.

Some people define a Data Warehouse as a technical platform/solution (for instance, in Microsoft Fabric the SQL engine is called Data Warehouse) but a Data Warehouse is not a technical solution but a solution that you build on top of platforms like Data bricks, Snowflake or Microsoft Fabric.

Hennie