donderdag 29 oktober 2020

DevOps series : Working with teams in a data warehouse : ownership (part II)

Introduction

This blogpost is a successor of the blogpost "DevOps series : Working with teams in a data warehouse (part I)". In order to control the changes of the objects in a large data warehouse environment with multiple teams, one of the first steps you have to take is to define ownership. Who is the owner of that object? Because if all teams are responsible for the objects, no one is responsible for the objects. Sometimes there is some implicit knowledge who is responsible, and perhaps consultations happens between teams, but it will quickly grow out of hand when the number of teams grow. Therefore, to get control over de objects, teams need clear understanding and take ownership and responsibility for the objects for which they feel and are responsible.


Vertical and horizontal teams 

Teams can be organized in different ways; teams can be responsible for a vertical solution, for instance, from source to dashboard or teams are responsible for a horizontal layer, eg. the data warehouse layer. A mix is also possible: sometimes teams build a data warehouse for it's own information products (eg. dashboards) and later on, other teams build dashboards on top of these data objects.

I've made a distinction between the different parts (I call them modules) of a data warehouse : sources (delivery), Data ware house, and information products. Examples of a source (delivery) are extraction code, data lake, file copy code and staging objects. For your data ware house, think about raw vault, business vault and datamarts. There is a lot of discussion (in my current role)  about raw vault, because hub and links are more business driven but the sats from the different sources are more source driven. We discussed some scenarios like a source vault solution and an integration layer for integrating the BK's and the links. Although there is a lot of resistance from Dan Linstedt and others as well, for source vaults but you can make your deployments independent from other teams with this approach. And in order to compensate the source vault approach an integration layer can be adopted. Other options are possible as well. Next, the information products, these are more end user delivery oriented and think about components like SQL views for the cube or report, cubes (Tabular model) and dashboards (PowerBI). With this approach, teams are more independent and can deliver their code much faster to production.


Now, the preferred team pattern in my opinion is team pattern number 1. This team pattern is responsible to process the data from source to dashboard.The team is able to deliver data from start to end. It is not (or less) dependable on the work of other teams. This is also in line of a DevOps Team. The team is able to deliver stand-alone products to the business.

Although, the other patterns are less preferable, these can happen too. For instance, team pattern 3 happens sometimes when a management dashboard is needed and retrieves the data from all kind of different areas of a company, eg finance, HR, Sales, etc. Because of this reason it's not feasable to create a team that is responsible from source to end. So this creates the need for a team that is dependent on the work of other teams.

Team pattern 2 is an example of a horizontal team and this pattern occurs a lot in companies. It's very common to organize teams in cost efficient ways in order to utilize resources as efficient as possible. Although it seems logical to do this, but this team pattern is not organized around delivering products to the customer but around people. This approach will lead to waiting time. No added value is delivered to the end user until information products are build on top of them (by other teams) and this causes handover moments and inefficiencies.

Final thoughts

This is the second blogpost about working in large data warehouse environments with multiple teams. 


Hennie

zondag 25 oktober 2020

DevOps series : Working with teams in a data warehouse (part I)

Introduction

Perhaps, you have been in this situation: imagine the following problem, you have a data warehouse, and it is shared between a large number of teams and you have implemented a version control and a branching strategy like Gitflow or Microsoft flow and you think that you have it all but somehow code is not flowing in cadence into production. Symptoms are that developers are complaining that code breaks because other teams deployed their code too or it's unknown which team is responisble for data objects.

This blogpost is a collection of my experiences at several customers in the last couple of years. I've been working in companies with multiple teams, mostly working with Microsoft technology in data environments. I've been working in DevOps environments where we were working as a team, delivering information products to our customers in the business, together with other teams. This blogpost is a mix of knowledge and experiences of Data-, DevOps- and Lean principles!

Continuous delivery and a data warehouse

We, as data professionals are working in data environments (eg. data warehouses) where we want a common data platform, where entities are well formed designed and structured. We think about the business and we try to model the data according to the processes and the way of working (in a part of) of the company. This will result in a common data model implemented in a database. 

I've seen that multiple teams are working on separate parts of the database, but sometimes teams are dependent on the work of other teams because they use data entitites of other teams. Sometimes there are entities that are hotspots of usage between different teams. Think about entities like Customer, Employee or Organization. These are common used entities in a data warehouse. But there are also entities that are not used by other teams in their part of the data architecture. 


In the example above, Team 3 has dependencies between objects that are maintained by team 1 and 2. This could be an ETL dependency. Now, if one of the teams changes the referenced object, Team 3 has a problem. The ETL logic will result in a 'failure'. This is a very common situation when teams are working in a common data architecture and this is an annoying problem and it is hard to fix.

Final thoughts

What if we could find a way for teams to deliver their code as much as possible independent from the other teams to production and where dependencies are there, can we manage these dependencies? Are there ideas on how solve the ambiquity, who is responsible for what, and can we use methods like branching by abstraction to manage dependencies between teams?

How can we achieve a complete data platform (data warehouse) with the advantages like subject-oriented, integrated, time-variant and non-volatile (Inmon) and deliver your products as fast as possible to the business. Data warehouses had a bad name in the past, and data vault and other flexible techniques helped to deliver faster to the business. An extra challenge is the number of teams that work on a data ware house. How should they balance their priorities between speed of delivery and the integrity of a data warehouse? If they focus on the speed of delivery, less attention is paid to the integrity and the other way around focus (only) on the integrity, less and less is delivered to the business. Off course integrity of the data warehouse is important, but delivering code to production is as important as well.

This is the first blogpost about DevOps and lean principles and working in data environments like data warehouses. I'm planning to write more about this topic in the future. 

Hennie

donderdag 15 oktober 2020

Generating SSH accessing Azure DevOps with DBeaver

Introduction

For access from Dbeaver to Git I used a PAT key before, but I heard that SSH is a better secured way to access Azure DevOps/Git, therefore I investigated the usage and implementation SSH access to Azure Devops. This blogpost is a walkthrough the process.


Generating the SSH key

First, generate a key with the tool 'ssh-keygen' by opening a CMD window. Enter the following command in the box.


D:\>ssh-keygen -C "hennie.denooijer@xxxx.nl"
Generating public/private rsa key pair.
Enter file in which to save the key (C:\Users\hdeno/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in C:\Users\hdeno/.ssh/id_rsa.
Your public key has been saved in C:\Users\hdeno/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:LtBYadsfaSDFSDFSFDAY90qnea5MXLOJKs5giLlmFHXs hennie.denooijer@xxxx.nl
The key's randomart image is:
+---[RSA 2048]----+
| ...     .o++o.  |
|.  ...    o=*.   |
| . ooo.  . o..   |
|. o.++E   o      |
| o oo.. S+ . .   |
|  + ..... + o o  |
| +   ...+. . + . |
|o . .  = o. = .  |
| o.. .o o. . o   |
+----[SHA256]-----+

This command produces the two keys needed for SSH authentication, your private key (id_rsa) and the public key (id_rsa.pub). Never share the private key! Here is the content of the id_rsa.pub file that default is stored in the folder : C:\Users\{user}\.ssh (I changed some characters):

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDGhi6uphRGsgfWvlnckcpqEMGevUuS1iDoUx645965m
621qoCheJbaPtzVP4ldwVKU8dRtileWkMGjwYjRej0TWXSk4NouooctXGmZIrBy95ZDK6fkmXJEyIWvxO
581+C/qhPxTD+Nmsac2+yXB123D+s2Ky7wsKNFPg2RZ1VrW0RRgoegGCJqfT1EmL0D44R2yWggbTlDcL
E11K/trlGbxCcoAc/TP8YDGB+PLeV4qEqciJasasaaaLhWD72Mnj8M2ar5213wwdItSQlC4Di0sO792
J4RbtLlyBbFCrqNt9nnLIxj8BzWLZBsD7qISa1FYqh5jwpRlRpmuDLDdLRd hennie.denooijer@xxxx.nl


Add the SSH key to Azure Devops

The next step is to add the key to Azure Devops. Goto settings and click on the SSH Public keys. 




Then, Press on + New key to enter a new key to Azure DevOps.




Initially I thought that I could copy the fingerprint string that was generated from the ssh-gen tool. I copied that into the Public Key Data box, but that resulted in an error.


Here is the error :

An error occurred while adding the xxx key: Invalid key: Key must be Base64 encoded with OpenSSH format and RSA type. Valid keys will start with "ssh-rsa".


After some reading of the Microsoft documentation I understand that I have to use the key that is stored in the id_rsa.pub. Copy that in the Public Key Data box.


It is also possible to test whether the SH working by issueing the command ssh -T

D:\Git\SF_PERFTEST_TPC_H_MODULEA>ssh -T git@ssh.dev.azure.com
Warning: Permanently added the RSA host key for IP address 'xx.74.xx.1' to the list of known hosts.
Enter passphrase for key 'C:\Users\hdeno/.ssh/id_rsa':
remote: Shell access is not supported.
shell request failed on channel 0

Microsoft Documentation states the following : "Test the connection by running the following command: ssh -T git@ssh.dev.azure.com. If everything is working correctly, you'll receive a response which says: remote: Shell access is not supported." Be careful : NOT supported. If this is the paraphrase then it is ok.


Using the SSH key in DBeaver

The next step is trying to use the SSH key in DBeaver and here I made a mistake by trying to use the https uri.



After trying some couple of times, reading the documentation, some helpful information from collegaes, I found I had to set the git remote url with the command Git command set-url and I used the sam url as I used with https. 


git remote set-url origin git@ssh.dev.azure.com:v3/xxxx/xxx/_git/SF_PERFTEST_TPC_H_MODULEA


But, that resulted in an error.

Warning: Permanently added the RSA host key for IP address 'xx.xx.xx.103' to the list of known hosts.
remote: Expected _full or _optimized, not '_git'.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


I changed the uri a bit (I removed _git from the uri). Better is to copy the SSH string from Azure DevOps.


git remote set-url origin git@ssh.dev.azure.com:v3/xxx/xxxx/SF_PERFTEST_TPC_H_MODULEA


Then when I try to check in with the tool DBeaver the following window appeared and here I had to enter the paraphrase that I entered before, during the creation of the SSH key. 




And here it is. I'v DevOPs/Git integration with DBeaver, based on SSH key.



Final thoughts

A small howto blog on creating a SSH key for accessing Azure DevOps from a tool like DBeaver.


zaterdag 10 oktober 2020

Azure DevOps and Git integration with DBeaver

Introduction

One of the nice things about working with software is that there is always something new to explore. Now, I'm looking into the possibillities of DBeaver and the integration with Git. The reason is that I want to explore connecting Snowflake and DBeaver/Git for my (local) development environment. This way I have version controlled code locally and I can deploy the code to Snowflake. I have a richer GUI instead of the webbased editor of Snowflake itself. But, it's experiment with DBeaver and Git to learn how it works and whether it is usable for projects I'm involved.


Git plug-in

Now it seems that Git is not fully integrated into DBeaver but you have to install a plugin into DBeaver to have Git integration. Now it was kind of puzzle but I managed to install the plugin. Here is the walkthrough. 


There seems to be more Git plugins, one of Dbeaver itself and one Eclipse and it seems that the Eclipse plugin is breaking the Dbeaver code, but as you may see later in this blogpost, when I select the DBBeaver Git support a couple of Eclipse code is installed. Not sure whether this is the right one. 


Configure

Let's walkthrough the steps I took in order to make it work. A small disclaimer, I'm not stating that I executed the most efficient steps to configure the setup with Git, DBeaver, DevOps. May be there is a better sequence of steps to perform the configuration. Try out yourself. 


1. First start DBeaver, select "Help" and then select "Install New Software".


This will open the install new software window. 


2. In the next step a window opens and here you can choose the repository. I chose the DBeaver Git integration. 




3. When you are done, press the Next button and an overview window is presented. Here you can see that some Git software of Eclipse is installed, also.


4. Click on Next and the next screen is shown. Here you have to review the licenses and accept the licenses.


5. Press on finish when the licenses are reviewed and then a warning appears that authenticy can not be established. 


6. DBeaver is restarted and it appears again. When DBeaver returns nothing seems really changed, when you look closer there are some elements added in the File menu


And below on the screen an extra pane appeared. 


7. Press on the "Create a new local Git repository" and enter a directory name for Git in a folder. In my case it is D:\tmp\git.


Git

8. Now go to your Git folder and start CMD.exe and write the following statements.


I saved the script "Script.sql" that I have created in this folder and with Git status I'm notified that the file is untracked.

9. The next step is to add the file to the local git repository with "Git add". 



10. After the "Git add" execute a "git commit" as you can see in the following screenshot. 


11. Next set up the remote repository Url with Git Remote command.



Next, when I try to push the code to the remote repository, I recieved the following error message because the remote repository contains work that I haven't included in my local repository. It seems that Git thinks that I 'm mixing up repositories.


"Git Pull" gave me a "Fatal" because of unrelated histories. 


Now, I have created a local repository and a remote repository and I commited some code to the repositories and Git doesn't seems to understand how the two projects are related. The error is resolved by toggling the allow-unrelated-histories switch. After a git pull or git merge command, add the following tag: git pull origin master --allow-unrelated-histories

12. Enter "git pull origin master --allow-unrelated-histories".



13. Change something in the code and check if the configuration of Git is correctly done.


14. Commit and push the code to the remote repository. 


Okay, that seems to be working well.


DBeaver

But why did I need to install the Plugin in DBeaver? I configured Git but I didn't use any functionality of the Git plug-in in DBeaver. Let's take a closer look at DBeaver and the Git plug in. 


15. Now in the projects pane there is a little icon, and if you look closely it says "Git". Press on the icon with your mouse.

16. Select the "Existing local repository" and press Next.


17. Select the folder where the Git repository resides. In my case D:\tmp\Git.



18. Choose "Import using the New Project Wizard" and press on Finish.



19. And now a new project is created in DBeaver with the script.sql file in the project.



20. One step is to check whether the code is commited from DBeaver. I Changed something in the code and committed the code the remote repository



21. In the File menu, Choose Git and then "Commit changes to Git"





22. And in the next window press the button "commit and push" to push the changed code to the remote repository. 



23. For the login, create a PAT in Azure Devops and use that as a password with your user account. Press log in.



24. And the code is pushed to the remote repository. Press Close to close the window.


25. And voila! The change is commited into Git and Azure DevOps shows this perfectly.



Final Thoughts

This blogpost is about building a develpoment environment with Git, Azure DevOps and DBeaver. I learned something new and hopefully it can help you too.


Hennie

zondag 4 oktober 2020

Case (in)sensitive string comparison in Snowflake

Introduction

In my previous post, I already talked about case sensitive and insensitive object identifiers in Snowflake. In this blogpost I'll go one step further and research how Snowflake compare strings in queries. In this blogpost I'll answer the question on how Snowflake handles comparisons of strings in case the setting DEFAULT_DDL_COLLATION is set and when it is unset. We start first with the setting when a database is created in a standard manner. 


Setup

For this experiment I've set up a database that is created in a standard way. An example is shown here.


CREATE DATABASE IF NOT EXISTS "DB_CASING_TEST";


Here an example of the table for the experiments and some values that are used for the experiments.


CREATE TABLE IF NOT EXISTS dbo."MiXed tAbLe"(
	ID integer,  
    Name varchar(50)
); 

INSERT INTO dbo."MiXed tAbLe" VALUES (1, 'TeST'), (2, 'TEST'), (3, 'test'), (4, 'Other value');


Experiments

And when I query the table with a case sensitive expression in the WHERE clause (= 'TEST') it will return a case sensitive answer. Only the record with 'TEST'  is returned.



Now, an experiment with the UPPER function in the WHERE clause! This will return 3 values.



The next experiment is with the LIKE operator. This will return one value and thus case sensitive.




Now there is also a case insensitive comparison operator : ILIKE. What will return this operator? Three values and therefore it is a case insensitive result.




It is also possible to use the COLLATE function in the WHERE Clause (and other places). This is also a way to do a case insensitive comparison. Offcourse you have to use the CI option.



The next experiment is about creating the table with columns that are case insensitive. You can specify the collation per column. Here is an example of the CREATE TABLE statement with columns with a collation

CREATE OR REPLACE TABLE dbo."MiXed_tAbLe_Collate"(
 	ID integer,  
    Name varchar(50) COLLATE 'en-ci'
);


I inserted the same values again and did the experiments again and you can see what happened. I've included some of the screenshot below. Here is the first one. This result is now 3 values instead of one as what have happened in the previous experiments.



I can show you all of the queries with the LIKE, ILIKE and UPPER en the COLLATE function. You have to believe me; they all returned 3 values back.


The next experiment is investigating on how the joining works with a case sensitive setting and here are the results and as you might expect, all results are based on case sensitive joining. First setup the base for the experiments.


CREATE OR REPLACE TABLE dbo."MiXed_tAbLe_Join"(
 	ID integer,  
    Name varchar(50) 
);

INSERT INTO dbo."MiXed_tAbLe_Join" VALUES (1, 'TeST'), (2, 'TEST'), (3, 'TESt'), (4, 'Other valuessss');


Here is one experiment with plain and simple joining. I changed the data a bit and you can see here that two values are returned : case sensitive



It is also possible to use the COLLATE function here resulting in all values are joined case insensitive with the other values.


DEFAULT_DDL_COLLATION

There is also another option : DEFAULT_DDL_COLLATION. This setting can be set during the creation of the database. This option is also available on account level.


CREATE DATABASE IF NOT EXISTS "DB_DEFAULT_DDL_COLLATION_EN_CI" DEFAULT_DDL_COLLATION = 'en-ci';


And now if we query the table with the same query we used before it will return 3 values. 



With this setting the comparisons, joining and other operations are now case insensitive.

Final thoughts

With DEFAULT_DDL_COLLATION it is possible to set the database in another collation and one of the options is case insensitive. I urge you to look into the options available.

Hennie