vrijdag 30 juni 2023

Fabric : Unidentified in Lakehouse

I've an interesting thing in Microsoft Fabric, a Unidentified folder :



According to Microsoft

"The Unidentified Area is a part of the managed area of your lake which displays any folders or files in the managed area with no associated tables in SyMS. For example, if a user were to drop an unsupported folder/file in the managed area, eg: a directory full of pictures or audio, this would not get picked up by our auto-detection process and hence not have an associated table. In this case, this folder would be found in this unidentified area. The main purpose of this new section is to promote either deleting these files from the managed area or moving them to the file section for further processing."


Strangly,  I was just creating tables in Lakehouse with some PySpark scripts and somehow the table was organized into a Unidentified folder. But, when I watched it a  couple of seconds later, it was gone. 


Theo Lachev also write a blogpost about the unidentified folder :

"Going quickly through the list, the managed area (Tables) can’t be organized in subfolders. If you try to paint outside the “canvas”, the delta tables will end up in an Unidentified folder and they won’t be automatically registered. Hopefully, this is a preview limitation since I don’t see you can implement a medallion file organization if that’s your thing."


It still not quite clear to me what the "Unidentified" folder means. In my example it seems a temporary issue because it disappeared after a few moments. If someone has more information about this, I'm curious. Let me know.

Fabric : Installing and using Onelake Explorer

Introduction

It's also possible to explore Onelake with Windows explorer. It looks just like Onedrive and it's easy to install and use. A small blogpost about the installation of OneLake Explorer.


Installation

You can download One lake explorer from the Microsoft site:



The msi file is downloaded to the local file system.



Next step is to click on the Install button to install One Lake explorer.




Login into your Fabric account and the OneLake explorer experience is just like OneDrive in your Windows Explorer. Here a small impression of the One Lake explorer.



Final thoughts

OneLake Explorer is easy to install and easy to use. 

donderdag 29 juni 2023

Fabric : Connect SSMS with Fabric


I was getting tired of the webbrowser experience of Fabric and I remembered somewhere in the vids of MSbuild that you can connect SSMS with Fabric. Thanks to this tip I managed to connect and now I have a SSMS experience on Fabric.


Goto the settings





And then you can copy the connectionstring and into the Server name box of the connection properties of SSMS.





And voila, I can connect to warehouses and the SQL Endpoints of lakehouses


Great, now I have a SSMS experience with Microsoft Fabric..

Hennie

maandag 26 juni 2023

Fabric : Shortcuts and referencing SQL Endpoints in Fabric

Introduction

In this blogpost I'm investigating the reference between Warehouse and Lakehouse. You can reference in Warehouse, Lakehouse tables (and views) and viceversa. But it is not the same experience. It happens in different ways. In Warehouse you can reference a table in the Lakehouse with SELECT * FROM <Lakehouse>.<schema>.<tablename> (or<viewname>) and in the Lakehouse you can link a table from the Warehouse with a shortcut. 




Using data of a Lakehouse object in Warehouse

The first experiment is using Lakehouse data in Warehouse. It's a bit confusing because the option is + warehouses, but (in this example) I'm referencing a SQL Endpoint of the Lakehouse in the warehouse. Microsoft sees SQL endpoints as the same as a warehouse in Fabric. Click on the + Warehouses.


The select the SQL endpoint of the Lakehouse.


Now the selected SQL endpoint is in the Warehouse


And now I can query the data from the Lakehouse in the Warehouse. This can be seen in the following screenshot.



Can I change the data of the Lakehouse with Warehouse? Let's see if I can do that.

No, you cannot. The error message is "Data Manipulation Language (DML) statements are not supported for this table type in this version of SQL Server." SQL Server? Anyway it's not possible.


What happens when I change the name of the table in the Lakehouse? I change the name with the Rename option in Lakehouse

Let's rename it into "product_test".



And now check the query in Warehouse. I had to refresh the Lakehouse in Warehouse. If I don't do that, the query executes succesfully surprisingly. It keeps working 'correctly' until I refresh the Lakehouse. It's a bit confusing. May be it's a timing issue? It's also a bit awkward that whenever I leave the Warehouse item, the Lakehouse disappears and I have add it everytime when I enter the Warehouse again. In the end it will give this error.


Yet another thing is the option to remove the item form the dataset. It seems that Fabric is saying that the product table is added to the standard dataset? Let's try to remove it.




It will give an error : 




The error message : "Unable to update BI model with these changes. Please try again later or contact support."

I'm not sure whether the error is correct now because I was renaming the table now and then. May be Fabric is now confused, now;-). Is it supposed to be that when Lakehouse is 'connected' that you can add tables of the Lakehouse to the Warehouse standard dataset? Then, it seems to me that this is not completely correcty implemented yet.


Using data of Warehouse in Lakehouse

Next is how we can use data from the Warehouse in the Lakehouse.


And then I have to tell Microsoft Fabric which location where I can find the object.


Select the Warehouse.



Select the tables (or views). I select the complete schema as a shortcut.



And now the tables are available in the Lakehouse 



Can I query the tables in the notebook? And use these in the notebook? No I can't seem to get this working properly. I'm not sure whether what I'm doing wrong or that this is not properly implemented yet by Microsoft.



I think that I should create a shortcut directly to a table and not at a higher level at the Warehouse. When I reference the table 'sales' directly it's working. 



So the table is shortcutted from the Warehouse in the Lakehouse and I can use this in my notebook. Interesting!


Final Thoughts

The first thing that comes to my mind is why "shortcuts" in Lakehouse and why a "reference a SQL Endpoint" in Warehouse. Why not use "shortcuts" everywhere (and use the SQL endpoint under water?). It would have been a consistent user experience in my opinion if Microsoft would used shortcuts everywhere. It's a bit confusing. But the idea of linking data is great! Now you could say that the data warehouse guys (or girls) develop the truth (for instance a customer table) and you can use these in the Lakehouse for data engineering a Machine learning model.

Hennie



zondag 25 juni 2023

Fabric : Naming conventions for Microsoft Fabric

Introduction 

This is the first concept of defining a naming convention in Microsoft Fabric. I think it's good to define a naming convention for Fabric because the list of components in Fabric can grow enormeously. After a couple of experiments I already have these components and this is nothing when working in a real life situation.



Why should you use a naming convention anyway? By working in an unambiguous way it creates various advantages. These benefits will ensure that:
  • People are less dependent on the implicit knowledge of internal and external employees
  • Interchangeability of employees in project building, debugging and troubleshooting
  • Better overview (making adjustments easier)
  • Faster insight into (possible) problems
  • Faster troubleshooting
  • Auditability
  • Simpler impact analyses
  • Higher availability


You can look at naming conventions in two ways : prefix the items of postfix the items. So prefixing would mean that all the same items are organized together. Postfixing would mean that all the items that start with the same character belongs functional together. So, technical organized or functional organized!


Microsoft Fabric Components

What are the components of Microsoft Fabric (so far) :

  • Capacity : CP_<CapacityName>
  • Workspace : WS_<WorkspaceName>
  • Data pipeline :  DP_<DataPipeline>
  • Dataflow Gen2 :  DF_<DataFlowName>
  • Eventstream :  ES_<EventstreamName>
  • Experiment : EX_<ExperimentName>
  • KQL Database : KD_ <KQLDatabaseName>
  • KQL Query set :  KQ_<KQLQuerysetName>
  • LakeHouse : LH_<LakeHouseName>
  • Model : MD_<ModelName>
  • Notebook :  NB_<NotebookName>
  • Report : RP_<ReportName>
  • Spark Job Defintion :  SD_<SparkJobName>
  • Warehouse : WH_<WarehouseName>


As said before, you can also postfix the items, like <WarehouseName>_WH.


Naming convention Warehouse

Now warehouse is the place where I'm the most familiar with. In Warehouse you have (now) 4 kinds of objects : 
  • Procedures : sp<ProcedureName>
  • Functions : fn<FunctionName>
  • Tables : tbl<TableName>
  • Views : vw<ViewName>
Now I know that in SQL server there were issues calling stored procedures sp<Procedurename> because SQL Server would look first between the system stored procedures, before it would search the stored procedure between the stored procedures that users created. This would lead to a performanceloss. I'm not sure if this is the case in Microsoft Fabric. 

Yet another discussion point is naming views vw and tables tbl, because if you decide to materialize views into tables, you have to rename the objects and that could break the data pipeline.

There are also datatypes in Datawarehouse and when you use these in Procedures en Functions you can use this naming convention in order to quickly see the datatype of a variable. This can be handy.
  • Bigint : big<Variabelename>, example bigPatientId
  • Binary : bin<Variabelename>, example binMessage
  • Bit : bit<Variabelename>, example Isok
  • Char : chr<Variabelename>, example chrPatientName
  • Uniqueidentifier : guid<Variabelename>, example guidkey
  • Varbinary : vab<Variabelename>, example vabMessage
  • Varchar :  chv<Variabelename>, example chvPatientName
  • Date : dt<Variabelename>
  • Time : tm<Variabelename>
  • Datetime2 :  dtm<Variabelename>,  dtmAppointmentdate
  • Float : flt<Variabelename>, example fltmeasurement
  • Integer : int<Variabelename>, example intPatientID
  • Numeric or decimal : dec<Variabelename>, example decAmount
  • Smallint : sin<Variabelename>, example sinSubcategoryID
  • Real : rea<Variabelenaam>, example reaBedrag

Naming convention Data Engineering


To be continued


Naming convention Data Pipeline


To be continued


Naming convention PowerBI


To be continued

Final thoughts

This is a ongoing blogpost where I add naming conventions for the other Microsoft Fabric experiences


zaterdag 24 juni 2023

Fabric : Loading data in the Bronze Layer

Introduction

The Bronze layer is something that we have heard from the Data Lakehouse buzzwords and it fits in the medallion architecture of DataBricks. The Bronze layer is the first layer in the medallion architecture and the other layers are Silver and Gold layer. 


From the DataBricks site : 

 "The Bronze layer is where we land all the data from external source systems. The table structures in this layer correspond to the source system table structures "as-is," along with any additional metadata columns that capture the load date/time, process ID, etc. The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage), data lineage, auditability, reprocessing if needed without rereading the data from the source system."


Ok, it seems to me that this is comparable with the Landingzone or Staging area that we know from the data warehousing. The main difference here is that that in Data warehousing you have define the tables with there respective columns and data type (Schema-on-Write) and in the Data Lakehouse the data is stored in so called files. 


I'm not saying that this is THE new data architecture you have to build, it's in concrete and you have to do it this way. For me it's a learning curve and I'm trying to understand the medallion architecture and see how it compares with a Data Warehouse solution. In my opinion a Data Warehouse is a solution and Data Lakehouse is more technology driven. You can build Data Warehouses in different tools/ platforms. You already see an adoption of Data warehouse best practices in  Data Lakehouse solutions, for instance star schemas. Sessions at DataBricks also talks about building a Data warehouse in a Data Lakehouse.


By the way, this blogpost is inspired by the YouTube videos of Stijn. If you want to know more of this, you can watch the video.


Let's build a Bronze layer

Ok let's go to Microsoft Fabric! I've created a new Lake House called : "LH_Bronze". I'm a bit experimenting with naming conventions;-). When you define a new Lake House it will create three items : a lake house, a standard dataset and a SQL endpoint.



Next is to choice a solution on how to move the data from the Landingzone to the Bronze Layer (I have a bit of mixed feelings about the a Landing zone and a Bronze Layer). Every copy of data is a kind of waste according to the Lean methodology. 


For this reason I choose the Notebook way in Microsoft Fabric. Let's create a notebook in Microsoft Fabric.



And the notebook is created right away. I don't have to enter a name for the Notebook.




First let's add a Lake house tot the Lake house explorer.



And then you can choose to add a new lake house or use an existing lakehouse. I choose to use an Existing lakehouse.


Now in the next window I can select a Lakehouse. I choose the "LH_Bronze" Lake House




Let's save the Notebook and give it a proper name:




Next, in the workspace list of items you can see I've created a Notebook.


Now let's go back to the notebook. The problem is that we don't see the files of the Landingzone. They are in the other Lake house : The landingzone. So first we have to create a shortcut to the Landingzone Lakehouse. 


Create a shortcut to the Landingzone Lakehouse

First go to the LH_Bronze Lakehouse.



And click on the three dots (...) next to the Files and the context menu opens :



Choose New Shortcut in order to create a shortcut. Choose the internal Microsoft Fabric Onelake



Next, you have to select a Lake House or a Data Warehouse. I choose a Lakehouse. There are two options here : The Landingzone Lakehouse and the Bronze Lakehouse (you can create a shortcut to same Lakehouse? Interesting). Click on Next.



And then you have to further specify the location in the Lakehouse where you want to point the shortcut to. 



And click on Create.


And now the Shortcut is created to the Lakehouse Landingzone. The two Lakehouses are kind of connected (from one direction). Next, I choose to create a new file location for the files and I name it BronzeOneLake.




So now we have two "datalakes" or perhaps we would say two locations is in the one lake data lake (!).

Now let's go back to the notebook.


Build a load script in Notebook

Now, we have created a shortcut and let's see how it looks now in the notebook. don't forget to refresh



Now, the next step is to create some scripts let's create some script in order to read the Product into a data frame.


LandingZoneLocation = "Files/LandingZoneOneLake/SalesLT/Product"
Product = spark.read.load(f'{LandingZoneLocation}', format='parquet')
display(Product)

I used here a display function to debug the code and see if the code returned some values.





Next, We add some meta columns to the dataframe : a User that loaded the data, a LoadTimeStamp and a SHA hash. 



from pyspark.sql.functions import *

Productdf = (
    Productdf
    .withColumn('LoadUser', lit(f'{LoadUser}'))
    .withColumn('LoadTimeStamp', current_timestamp())
    .withColumn('RowHash',
    sha2(
        concat_ws('.',
            col('ProductID'),
            col('Name'),
            col('ProductNumber'),
            col('Color'),
            col('StandardCost'),
            col('ListPrice'),
            col('Size'),
            col('Weight'),
            col('RowGuid'),
            col('ModifiedDate')
            ), 256
        )
    )
)
display(Productdf)

Execute the section. 



Next is a script pattern, that will make distinction between an initial load or incremental load. You can set this variable in the beginning of the script and may be it's possible to parameterize this, although there may be better solutions, in my opinion.


import delta

if IsInitial == 1:
    spark.sql(f'DROP TABLE IF EXISTS {DatabaseName}.{TableName};')
    Productdf.write.format('delta').saveAsTable(f'{DatabaseName}.{TableName}')

if IsInitial == 0:
   ProductdfDelta = delta.DeltaTable.forPath(spark, f'{DeltaTableLocation}')
   ProductdfDelta.alias('delta').merge(Productdf.alias('df'), 'delta.ProductID = df.ProductID').whenNotMatchedInsertAll().whenMatchedUpdateAll(condition = 'delta.RowHash <> df.RowHash').execute()

With %%sql it 's possible to query the data with SQL queries.

%%sql

SELECT * FROM LH_Bronze.product



Using the SQL Endpoint for querying the data 

Data Lakehouse offers also a SQL Endpoint to the data in the Lakehouse. There you have SQL functionality as we are used to in Microsoft Fabric Datawarehouse and SQL Server. I'm not sure whether you have full functionality like T-SQL in Microsoft Fabric Datawarehouse. 


It is the same GUI as with Data warehouse experience in Microsoft Fabric

Final thoughts

Yes, well an interesting YouTube video of Stijn helped me on the way on how to build this example of a script on how to load data into a Bronze layer in Microsoft Fabric. 


Hennie

donderdag 22 juni 2023

Fabric : Building a Landing zone in Microsoft Fabric

Introduction

Hi everyone, in this blogpsot I'll experiment with Data Factory in Microsoft Fabric. I'll create a Data lake in the Data Lakehouse experience and I'll load some data from an Azure SQL Database, which I've prepared for this blogpost with a demo database AdventureWorksLT and w'll load the data in the Data lake. 

Let's go and see how it works.


Setting up a  Demo database in Azure SQL Database

Firstly, I've created a SQL database in Azure for this blogpost. Go to Microsoft Azure and search for Azure SQL database and click on Create.


These are the standard settings for a SQL Database and I choose nothing to change here.




For the simplicity for the demo I choose SQL authentication for authorization.




I take the smallest database setting in order to save expenses.



Click through the tabs until you see the the Additional Settings. Here I select the AdventureWorksLT Database.


And we are finished in Microsoft Azure.


Step 1 : Create the Data Lakehouse

The next step is to create a Lake house in Microsoft Fabric. Click on Lake House (Preview).



Enter a name for the Lake house and press Create.



There are "Files" and there are "Tables" folders in Lake House. Now the tables are managed by Fabric and the files are unmanaged. In the files section we are going to build a Data lake.




Create a Subfolder for a location to put the files in the Folderstructure.




Create the Pipeline in data factory

Choose Data factory in the Fabric. Renember to select the right Workspace in which you created a (free trail) capacity. Another workspace will instruct you to create a new Fabric capacity.



And then click on the Data Pipeline. This will open the Data factory.



Next step is enter a name for the Pipeline. I'm already experimenting with some naming convention. Later, I will come up with some proposals for naming conventions.



Choose "Add Pipeline activity"




When you choose to add a pipeline activity, a context menu opens and you can choose the "Lookup" activity.




And now the activity is created on the canvas of Data Factory.


And here we select "Azure SQL Database"




And click on "Continue" when you have selected the "Azure SQL Database" option


And a new window opens where the credentials and connection strings must be entered in order to connect to the SQL Database.




Now let's get some information from Azure SQL Database settings in order to connect the data pipeline with Azure SQL Database.




An error happens because Fabric is not allow to connect to the SQL Database Server.



I add a firewall rule in Azure : 


Above is not correct, I haven't figured out what the best ranges are for the Firewall settings. I'll look into this later.

Next, an error occurred when I try to connect and it keeps appearing.


I decided to recreate the connection and then it worked fine. I think that is one of the things of a public preview version. You have to accept it and try to workaround it.

Now it succeeds with the metadata Lookup. It will load the metadata information of the tables in Pipelines. Enter the Connection to Azure SQL Database, the Connection Type and change the option "Use Query" to Query, enter the meta query to query the Information_Schema and don't forget to uncheck First Row Only 



Next step is to add a For Each activity in the ADF, give it a proper name and click on the plus in order to add a Copy Data activity in the For Each activity.


Then, select the Copy Data Activity.





Don't forget the settings on the ForEach Loop! 




Enter the expression @activity('LKP_SQL_AdventureWorksLT').output.value in the Items box


Now in the Copy Data Activity, go to Source tab and set the right settings in the entry boxes. 






And change the destination tab in the Copy Activity as is shown in the following screenshot :




Change the File path to the follwoing expression:

@concat('LandingZoneOneLake/', item().TABLE_SCHEMA, '/', item().TABLE_NAME, '/')


And run the Data Factory pipeline and check if it worked. I made some errors with the expresion by referencing the wrong fields in the metadata query, but it was fixed fairly quickly.




And in the lake house the following folders and files are created:





Final Thoughts

Overall it was easy to build this solution in Microsoft Azure Fabric. There are still some bugs and you need some workarounds in the Microsft Fabric to make it work. But, I'm quite positive about the way Microsoft Fabric works.

For more inspriation I would to point you on this YouTube video by Stijn. I used this for inspiration for this blogpost.

Hennie