BI Future Blog

zondag 2 februari 2020

Azure Series : Passing a parameter to an Azure Data Factory pipeline from SQLDW

Introduction

I've the following scenario for you: there are all kind of deliveries of information in an information factory and when these arrives, We want to trigger an event and load tables that are dependent on this information. So I have a process and I have a delivery(number). This is a base for loading certain tables for some of the customers.

So I have a SQL DWH (Synapse now), a Blob Storage and an Azure Data Factory configured. I've created a solution in the following manner:

Create a file in a blobcontainer with the following format '<process>_<deliverynumber>.trg' from within a stored procedure in Azure SQL Data warehouse.
When a file is detected in a blob container an eventtrigger in Azure Data Factory is executed.
This runs an Azure Data Factory pipeline.
The Azure Data Factory Pipeline will gather the filename and splits the process and delivery information from the filename.
This passed to a stored procedure and the data is written in a table (for demo purposes).

This is described in this blogpost.

1. Create a trigger file

The first step is creating a file in the Blob Storage. This is done with the following statement.

IF EXISTS (SELECT * FROM sys.external_tables 
  WHERE object_id = OBJECT_ID('TestHennieCallPipeline') )  
    DROP EXTERNAL TABLE TestHennieCallPipeline 
    
CREATE EXTERNAL TABLE  TestHennieCallPipeline WITH (
    LOCATION = 'TestHennieCallPipeline_123456790.trg',
    DATA_SOURCE = [BLOB_TRIGGER_CONTAINER],
    FILE_FORMAT = [DoublePipeDelimited_FR2_Outbound]
   ) as select 1 as i from [someschema].[sometable] where 1=2;

Excuting this command will create a file in the blobcontainer (and also a directory).

Creating an event trigger

Now, when the file is created an Event trigger in Azure Data Factory is executed. Here is an example of the Event trigger setup.

This trigger has a parameter but it is not possible to create a parameter in the trigger section. This has to be done in the pipeline pane. There is an option to create triggers available for that.

So, choose for Edit/New Trigger and the following window is shown. The pipeline that will respond on the triggerevent 'TestHennie'. Parameters is equal to 1 because I've already created a parameter.

Click on the name, in this case 'TestHennie', and click two times on Continue until the following screen appears. Here you can enter a parameter and I've included a system variable here.

The system variable is needed for retrieving the name of the file.

@triggerBody().fileName

Create a ADF Pipeline

One of the next steps is creating an Azure Data Factory pipeline for this experiment, I have created two activities: a set variable and stored procedure activity.

The first one is for setting values in variables and the other one is for debugging. The stored procedure will insert the data into a table.

Create a parameter in the pipeline

Creating a parameter in the pipeline is the following step in this experiment. I've created a parameter SourceFile. This parameter is used in the trigger.

Create a user variable in the pipeline

Now in order to do some manipulation of the filename that is retrieved and stored in the parameter, I've added an user variable to the pipeline. This will be used in the Set Variable activity.

Set the variable as a parameter to a stored procedure

Then, I've created an expression in the Set Variable Activity. In this actvity it's possible to use different kinds of funtions and these can be mixed together in order to get the desired result.

Next, I've built an expression and it would be nicer if I could have used less functions and perhaps it's possible. But for now, it will do.

@substring(pipeline().parameters.SourceFile, 
    add(indexof(pipeline().parameters.SourceFile,'_'),1),
    sub(indexof(pipeline().parameters.SourceFile,'.'), 
     add(indexof(pipeline().parameters.SourceFile,'_'),1)
    )
    )

Pass the variable as a parameter to a stored procedure

I've created a stored procedure that adds a record to a table. This way it's possible to investigate the values.

Here the simple table and procedure and some scripting

CREATE TABLE _Hennie (
  Deliverynumber  nvarchar(50)
  )

SELECT * FROM  _Hennie

CREATE PROCEDURE TestHennie (@deliverynumber varchar(50))
AS
BEGIN
 INSERT INTO _Hennie (Deliverynumber) VALUES (@deliverynumber)
END


TRUNCATE TABLE  _Hennie

And the result is that the deliverynumber is derived from the file name based on the expression and written in the table.

Thats it!

zaterdag 11 januari 2020

The books I've read in 2019

Introduction

This blogpost is about the books I've read in 2019. I started the year 2019 with a lot of reading about data modeling, I followed some courses on data modeling and studied a lot on the principles of data modeling and time in databases. I switched to reading about DevOps and the principles of DevOps later.

Fully Communication Oriented Information Modeling

This book is written by Guido Bakema, Jan Pieter Zwart and Harm van der Lek. They write about the method of data modelling based on the work of Nijssen and it is about a formalized process starting from analysing the area of interest and deriving a data model. The one thing I learned from the book (and the course) is that we can phrase data and their relations in sentences (which is called verbalization), then we can create a conceptual model from it, derive a data model and test whether the data model is correctly modelled by regenerating the sentences from the model. I like the way the method is applied and how the model is derived and I think it is a pity that this method is not very broadly used.

Datamodeling made simple

This book is a hands-on book on Powerdesigner in combination with data modelling. The book helped me a lot with understanding Powerdesigner. Although I worked in a couple of projects with Powerdesigner there are always things to learn about the subject. It is such an extensive system. The book is well written and helps you get started but it is also for the more experienced data modellers a helpful book. It is one of the few books on Powerdesigner that is so widely used in the data modelling world.

DevOps handbook

I bought this book for a certification in DevOps (EXIN). It is an informative book and an introduction to DevOps. It covers some of the highlights of DevOps. The book tells you more about what DevOps is and what the fundamentals of DevOps are. Oleg writes that DevOps is the logical next step of Lean and scrum development. It explained to me that building IT solutions is like adding value to a stream (just like in a factory) and this is also true for Data warehousing, BI, Analytics, data science and building other data platforms. Although the book is informative, it is written from an operations viewpoint and not from a developers viewpoint. But, I learned a lot.

The Phoenix project

This book reads like a roman. It tells the story about a project in a company that starts with all kind of problems with delivering software to customers. The book is written by Gene Kim, Kevin Behr and George Spafford. And they did a tremendous job writing the three principles of DevOps in a novel like this. It all starts with delivering software inflow, release software in smaller steps and try to reach a consistent flow. The next step is about feedback about the code that is deployed (as early as possible) and the last step is learning and experimentation is built and release management. Try different things to deliver faster or with better quality. This book is a must-have for anyone interested in DevOps.

The Unicorn project

This is the next book by Gene Kim and is about the Unicorn project. This is the successor of the Phoenix project, but you can read it separately from the Phoenix project. Again it is a novel about a fictitious person that is involved in a project with all kind of problems, starting with not able to build a complex project and trying to gather all kind of information from different people and managers, walking in recognizable issues in an organization. It's a very educational and entertaining book and it helps to set focus on delivering software in an agile way, reduce the wastes, solve bottlenecks (Goldratt) and gradually deliver software faster and of higher quality. Again a must-read book about DevOps!

The DevOps handbook

What a comprehensive book about DevOps! This book is truly a handbook of DevOps and covers all of the basics and advanced topics about DevOps. This book is based on the three ways principle of The Phoenix project: The principle of flow, the principle of feedback and the principle of continual learning and experimentation and these are the true principles of DevOps: establish a continuous flow of releases, gather feedback and by experimentation and learning improve the value chain. This book, together with the Phoenix project, explains the true nature of DevOps.

Continuous delivery

This book is a must-read, just like "DevOps Handbook". It's written by Jez Humble and David Farley and I think it is one of the first well-written books about DevOps. It is a bit outdated because it is written in 2009, but still, a lot of topics are described that are very interesting. For instance, the book taught me that release branching is the preferred way to do branching. It is advisable to have one trunc/master and check in every (working!) code in the master and make sure that the trunc/master branch is always deployable. A lot is written about unit testing and why unit testing is not the only thing that is needed; you need other tests too. Although outdated and written for 'normal' IT projects (I mean: not written for data platforms) there is a huge amount of aha's in there.

SQL Server 2019 Revealed

The book is written by Bob Ward and it is, of course, about SQL Server 2019, the latest version of SQL Server on-premise. This book is all about the new stuff of SQL Server 2019. The new features are well explained and clearly written. One of the important topics is the big data cluster that is a huge leap in the big data with SQL Server. SQL Server is no longer a DBMS but a data platform. There are chapters about Linux (that was a couple of years ago an April 1st joke), Kubernetes, Polybase and lots and lots more. A must-have book when you want to know more about SQL Server 2019.

Agile Data warehouse design

I wanted to sharpen my requirements skills with this book and although I bought it a couple of years ago, I never read the book in detail. This year I decided to read it. This book is based on gathering requirements for dimensional modelling (data warehouse) by using (what Lawrence calls) Modelstorming. It is about the 7Ws (Whom What, When, Where, how many and how). This book goes one step further in the requirements phase than Kimball. In the data warehouse toolkit, the requirements phase is not very broadly described and probably was not the focus of the book.

Final Thoughts

2019 was a year that shifted my attention to Azure and DevOps. The focus on projects is more and more on delivering data platforms agile in the cloud (Azure). Methods like Datavault together with DevOps enables to deliver faster and more high-quality data to the users.

Hennie

zondag 15 december 2019

Azure series : Comparing Azure SQLDB vs SQLDW vs AAS

Introduction

There are two SQL Database solutions, available at this moment : Azure SQL Database and Azure SQL Datawarehouse (which is now part of the Synapse analytics solution) and I was wondering what are the differences between the two solutions. For the sake of completeness I've also included Azure Analysis Services.

So my question is : "What is the best Azure SQL data platform solution for implementing a data warehouse solution?" In the schema below I've gathered the pros and the cons of the three solutions (SQLDB, SQLDW and AAS) but focussed on the SQLDB and SQLDW.

I've used the folowing resources for this blogpost :

Compare Azure SQL Database vs. Azure SQL Data Warehouse: Definitions, Differences and When to Use by Barry Luijbregts
Data Compression by Microsoft
IS AZURE SQL DATA WAREHOUSE A GOOD FIT? (UPDATED) by Melissa Coates.
What is Azure Synapse Analytics? by TimExtender

Comparing the three data platforms

In the diagram below, I've compared the three solutions based on the characteristics of these technologies.

Final thoughts

All the data platforms have advantages and disadvantages. Depending on your requirements it's probably not always a good fit to choose one of the data platforms. Sometimes you can choose a combination of tooling to make it an acceptable fit for purpose.

Love to hear your opinion...

Hennie

zondag 3 november 2019

Azure series : Elastic queries

Introduction

Not all people know that it is possible to run queries on other Azure SQL Databases. Normally with on-premise databases, we are used to use the following cross database query :


SELECT * FROM SourceDB.[SalesLT].[Customer]

But it will run into an error in Azure :

Msg 40515, Level 15, State 1, Line 16
Reference to database and/or server name in 'SourceDB.SalesLT.Customer' is not supported in this version of SQL Server.

From Microsoft : "The solution is using elastic queries. The elastic query feature enables you to run a Transact-SQL query that spans multiple databases in Azure SQL Database. It allows you to perform cross-database queries to access remote tables, and to connect Microsoft and third-party tools (Excel, Power BI, Tableau, etc.) to query across data tiers with multiple databases. Using this feature, you can scale out queries to large data tiers in SQL Database and visualize the results in business intelligence (BI) reports"

For this blogpost I've used the following link by David Postlethwaite and there other great resources on this topic too.

This is the setup of the databases:

I've gathered the steps to create elastic queries in this blogpost.

1. Create the login

First start with creating a login in the master database for the user we are going to use.


-- Go to Master database
USE Master
GO

CREATE LOGIN ElasticUser WITH PASSWORD = '6rJbb3Gh@Bq):ngE'
GO

2. Create the user in the source database

Create the user in the SourceDB database and assign it to the db_owner role.

USE SourceDB
GO

CREATE USER ElasticUser FOR LOGIN ElasticUser
GO

ALTER ROLE [db_owner] ADD MEMBER ElasticUser
GO

2. Create the user in the destination database

Then create the user in the DestinationDB database and again assign it to db_owner role


USE DestinationDB
GO

CREATE USER ElasticUser FOR LOGIN ElasticUser
GO

ALTER ROLE [db_owner] ADD MEMBER ElasticUser
GO

Create the master key

Create the Master Key in the DestinationDB database with a strong passowrd. This will create a symmetric key in order to protect the private keys in the database.


/*
DROP MASTER KEY 
GO
*/

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '6rJbb3Gh@Bq):ngE';

Create the database scoped credential

Then create the database scroped credential with the CREATE DATABASE SCOPED CREDENTIAL statement. The credential is used by the database to access to the external location anytime the database is performing an operation that requires access.


/*
DROP DATABASE SCOPED CREDENTIAL henniecredential
GO
*/

CREATE DATABASE SCOPED CREDENTIAL ElasticCredential WITH IDENTITY = 'ElasticUser',
SECRET = '6rJbb3Gh@Bq):ngE';

Create the external data source

The CREATE EXTERNAL DATA SOURCE is used for the connectivity and is used by the elastic queries. The script for creating the external data source is as follows:

/*
DROP EXTERNAL DATA SOURCE [sourceDB]
GO
*/

CREATE EXTERNAL DATA SOURCE sourceDB WITH
(TYPE = RDBMS,
LOCATION = 'server-280742145.database.windows.net', 
DATABASE_NAME = 'SourceDB',
CREDENTIAL = ElasticCredential
) ;
GO

Create the schema

I'm using the Customer table from the AdventureWorksLT database and the table is created in the SalesLT schema and therefore we need to create a schema with the same name in the destinationDB.


/*
DROP SCHEMA IF EXISTS SalesLT
GO
*/
CREATE SCHEMA SalesLT
GO

Create the external table

The following statement creates the external table for the SalesLT.Customer table.


/*
DROP EXTERNAL TABLE [SalesLT].[Customer]
GO
*/

CREATE EXTERNAL TABLE SalesLT.[Customer](
 [CustomerID] [int] NOT NULL,
 [NameStyle] bit NOT NULL,
 [Title] [nvarchar](8) NULL,
 [FirstName] nvarchar(50) NOT NULL,
 [MiddleName] nvarchar(50) NULL,
 [LastName] nvarchar(50) NOT NULL,
 [Suffix] [nvarchar](10) NULL,
 [CompanyName] [nvarchar](128) NULL,
 [SalesPerson] [nvarchar](256) NULL,
 [EmailAddress] [nvarchar](50) NULL,
 [Phone] nvarchar(25) NULL,
 [PasswordHash] [varchar](128) NOT NULL,
 [PasswordSalt] [varchar](10) NOT NULL,
 [rowguid] [uniqueidentifier] NOT NULL,
 [ModifiedDate] [datetime] NOT NULL,
 )
WITH
(
DATA_SOURCE = sourceDB
);

After running the DDL statements, you can access the remote table “Customer” as though it were a local table. Azure SQL Database automatically opens a connection to the remote database, processes your request on the remote database, and returns the results.

Test the External Table

Test the external query with a select query and see if we can get some results back.


SELECT * FROM  SalesLT.[Customer]

And this results in the following result!!

Now if there is a difference in the datatype for one tiny length or datatype you will receive the following error

The data type of the column 'Phone' in the external table is different than the column's data type in the underlying standalone or sharded table present on the external source.

Final thoughts

This blogpost is about creating external tables using in elastic queries.

Hennie

maandag 21 oktober 2019

Azure series : The Mapping Data Flow activity in Azure Data Factory

Introduction

The Mapping Data Flow activity is an activity that has been added recently (2019) and has a lot of similarities with SSIS dataflow task and so for SSIS developer it has a steep learning curve to learn Azure Data Factory Mapping Data Flow activity.

In this blogpost I'll explore the basics of the Mapping Data Flow activity (everytime, I want to type task instead activity), the operations available in the data flow, and more.

If you want to join in here, prerequisite for this exercise is the Azure data factory is already created and two SQL database are present : SourceDB with AdventureWorksLT installed and one empty database with a table customer.

This is my starting situation.

Let's explore Azure Data Factory and start creating some items.

First steps

Now, the first step is to create a pipeline in ADF and give it a proper name. A pipeline is like a control flow in SSIS (It can control the direction of activities) There two options to create an Azure Data Factory.

If you have chosen to create a pipeline in Azure Data Factory, the following screen is shown. It is a screen that exists of different parts. I've indicated the different parts with a red box and a number.

The following parts are shown :

Top menu
The factory resources
The connections (linked services) and trigger
The components you can choose from.
The template menu
Graph (seems to me a bit of an odd name)
The configuration panel

Lets start this exercise by creating and renaming a pipeline and I'll name it "MyMapingDataFlowPieline".

Next step is to add the datasets and the connections (linked service). One data set for the source table and one for the destination table. I'm using a naming convention 'ds' for Dataset and ls for the Linked service. I'll blog about the naming convention of the components in the future as I'm currently determining the best practice for naming convention. There are some blogposts about naming convention, but they seems not very comprehensive. As a linked service is comparable to a connection, it is possible to have multiple datasets based on a linked service and therefore the naming convention should reflect the source type (eg. MS SQL Database) and not the table (for instance).

The linked services I've created for this blogpost.

The datasets that have been created, so far.

The script that I've used to create the table in the destinationDB. In future blogpost I'll elaborate further on this blogpost and I'll use this table for SCD Type I and SCD type II in Azure Data Factory.


DROP TABLE IF EXISTS [DimCustomer]
GO

CREATE TABLE [DimCustomer](
 [CustomerID] int NOT NULL,
 [NameStyle] varchar(50)  NULL,
 [Title] [nvarchar](8) NULL,
 [FirstName] varchar(50)  NULL,
 [MiddleName]varchar(50) NULL,
 [LastName] varchar(50)  NULL,
 [Suffix] [nvarchar](10) NULL,
 [CompanyName] [nvarchar](128) NULL,
 [SalesPerson] [nvarchar](256) NULL,
 [EmailAddress] [nvarchar](50) NULL,
 [Phone] varchar(50) NULL,
 [PasswordHash] [varchar](128)  NULL,
 [PasswordSalt] [varchar](10)  NULL,
 [rowguid] [uniqueidentifier]  NULL,
) ON [PRIMARY]
GO

So, we have created the Azure Datafactory, the two azure SQL databases, one with AdventureWorksLT and one with DimCustomer table, the pipeline with two linked services and two datasets. We are ready to create the Mapping data flow.

Drag the mapping data flow on the graph canvas and drag a source and a sink on the canvas. In the end it will look the following screenshot:

Now in order to test this, we need to turn on the Data Flow Debug. Turning this option on will take some time.

After a while you can run the mapping data flow task with the Debug option.

And the status of the run is shown in the output tab in the configuration panel.

And If we check the results in SSMS we can see there is data in the table.

Final Thoughts

This is a blogpost about a simple copy process. In the future I'll blog more about the Azure Data Factory and working towards more complicated examples.

Hennie