vrijdag 19 juni 2020

Creating AdventureWorks tables in Snowflake

Introduction
I was wondering whether how easy it is to recreate the Adventureworks tables in Snowflake and load the csv files into Snowflake. So I grabbed the CREATE TABLE scripts from the AdventureWorks example (Github) and tried to execute them in Snowflake. In the beginning I ran into a couple of errors. The next step was uploading the csv files, that are accompanied with the AdventureWorks Database, to a Blob storage in Azure. I created a SAS token and started loading the data into the tables. I encountered some errors but I managed to solve them mostly all of them. I had one issue with DimProduct.csv; it seems there is something wrong with the file. 

My first table
I started with a simple table, the DatabaseLog table and I was curieus what would and what would not convert into Snowflake SQL language. Below an example of the table. 

CREATE TABLE dbo.DatabaseLog (
    DatabaseLogID int IDENTITY (1, 1) NOT NULL,
    PostTime datetime NOT NULL, 
    DatabaseUser sysname NOT NULL, 
    Event sysname NOT NULL, 
    Schema sysname NULL, 
    Object sysname NULL, 
    TSQL nvarchar(max) NOT NULL, 
    XmlEvent xml NOT NULL
)

This resulted in the following snowflake table :

--DROP TABLE IF EXISTS dbo.DatabaseLog;

CREATE TABLE IF NOT EXISTS dbo.DatabaseLog (
    DatabaseLogID int IDENTITY (1, 1) NOT NULL,
    PostTime datetime NOT NULL, 
    DatabaseUser varchar(128) NOT NULL , 
    Event varchar(128) NOT NULL,  
    Schema varchar(128) NOT NULL, 
    Object varchar(128) NOT NULL, 
    TSQL varchar NOT NULL, 
    XmlEvent variant NOT NULL
);

Then I subsequently started converting the other CREATE TABLE scripts into Snowflake SQL language. The following issues I had to solve in order to create the tables in Snowflake :
  • The [ and ] was not recognized by Snowflake.
  • Sysname is not supported in Snowflake.
  • Nvarchar is not recognized as a valid datatype.
  • Varchar(Max) is not supported. For this purpose I used the standard varchar and this would create the largest varchar possible.
  • Unsupported data type 'XML'. I used the variant datatype for storing the XML.
  • I had to remove the ON PRIMARY;
  • Bit was unsupported. I replaced this with the Boolean datatype.
  • I removed all of the GO's but that was not really necessary.
  • I replaced nchar with char.
  • Money is not a supported datatype. I haven't found a similar datatype in Snowflake. I replaced it with a float datatype.
  • varbinary(max) was not recognized and I replaced that with varbinary without a max length.
After I solved these issues it was very easy to create the tables in snowflake.

Loading the data into Snowflake
The next step was loading the tables in Snowflake. The first error I ran into was because there was something wrong with the file. When I executed the following statement :

Copy into dbo.DimDate
  from @my_azure_stage
  pattern='DimDate.csv'
  file_format = (type = csv field_delimiter = '|');

The following error message happened :

Numeric value '0xFF0xFE20x0000x0000x0050x0000x0010x0000x0010x00' is not recognized File 'DimDate.csv', line 1, character 1 Row 1, column "DIMDATE"["DATEKEY":1] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.

I found out that the encoding was UCS-2 LE BOM and Snowflake didn't recognize this format.



Next step was changing the definition of the file in the COPY INTO statement. I added ENCODING='UTF-16LE' to the statement as you can see below.

copy into dbo.DimDate
  from @my_azure_stage
  pattern='DimDate.csv'
  file_format = (type = csv field_delimiter = '|' ENCODING='UTF-16LE');

When I added the extra parameter and added this to the other COPY INTO statements for the other files it was easy to import the csv files. Except for one file DimProduct.csv it worked fine. The specific error that happened during loading the file is the following :

NULL result in a non-nullable column File 'DimProduct.csv', line 1, character 2 Row 1, column "DIMPRODUCT"["SPANISHPRODUCTNAME":7] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.

Examining the file in Excel with Text to Columns shows a mixed corrupted file. I didn't want to waste to much of time looking in an issue with the file. The origin of the error seems to be the image column in the file. Google also didn't led me to a solution. So it's a pity I could solve this one. Because Excel also mixed up the file I decided to forget about it. I concluded that it was not an Snowflake issue.

Final Thoughts
Transferring the code from SQL Server to Snowflake is fairly easy. There are some issues but these are  easy to overcome. I took me a couple of hours to convert the table scripts into Snowflake SQL language, creating the load scripts and a bit of debugging. 

Hennie

2 opmerkingen:

  1. Can you please help us. how you come to conclusion of Mixed corrupted file. Same error we are getting . It is same copy into statement without ENCODING='UTF-16LE'. We are getting the error.

    BeantwoordenVerwijderen
    Reacties
    1. It is a while ago, but I remember that I imported the csv in Excel and all columns were mixed up. Columns or lines were not matching to get a clear tabular form.

      Verwijderen