vrijdag 31 juli 2015

DAX: The RELATED() Function

Introduction

Today, a blog about the RELATED() function of DAX (Data Analysis Expressions) in PowerPivot. The RELATED function is a function that combines a field from another table into the table where you enter this this expression. In this blogpost I'll use an example of the Contoso retail example, that you can download from the Microsoft site.

Data model

Below is the datamodel where there is a 1:n relation between DimProduct -> DimProductSubCategory -> ProductCategory. So one product can have one ProductSubCategory and multiple ProductSubCategories can belong to multiple Products.These relationships are called chained relationships because they form a chain that you can follow from the beginning
to the end to relate many tables. And in the multidimensioanal (and tabular) world we can recognize the hierarchy construction.


In the screenshot above you can see fields ProductCategory and ProductSubCategory in the DimProduct table and these are not actually native fields but related columns from the DimProductSubCategory and ProductCategory.

Even if the original data model of Contose Sales has two distinct tables for category and
subcategory (which is correct for a standard database system), the data model is much easier
to query if we integrate certain fields in the DimProductSubcategory and ProductCategory in de DimProduct dimension and hide the DimProductCategory and ProductCategory.

RELATED() function

The RELATED() function relates columns from other tables within the table where you use the RELATED() function. Below you can see the expression that is used in the Dim Product table.



And this is the expression

=RELATED(ProductCategory[ProductCategoryName])

You can only use the RELATED() function  when a relationship is defined.

Conclusion

This is a great function for combining data from multiple tables..

Greetz,

Hennie


Installing PowerBI Desktop V2

Introduction

This week PowerBI Desktop (formerly known as Power BI designer) is released by Microsoft, in case you haven't noticed by now;-) this Desktop version is the Self Service BI stack and is the counterpart of the more IT related stack SSIS, SSRS and SSAS. This new BI suite (PowerBI Desktop) is combination of PowerQuery, PowerPivot, PowerMap and PowerView. One of  the reasons that PowerBI Desktop is a separate solution, is because of the slow release cycles of the Power tooling in Microsoft Office (because of the huge install base of Office extensive testing has to be done in order to release a new version). With PowerBI Desktop faster new versions of the Power tooling is possible.

In this blogpost I'll describe the intallation process of PowerBI Desktop. In later blogposts I'll dive deeper into the functionality of PowerBI Desktop and DAX.

Installation

1. First download the PowerBI Desktop to your system. The start the PBIDesktop_x64.msi (In my case). Then click on Run to start the installation process.



Wait until the initialisation process has ended.



2. Click on Next to begin the installation process.


3. Accept and click on Next


4. Determine the installation path and enter Next


5. Click on the Install button.



The installation process continues with installing the files.


6. The installationprocess has ended and now you have to click on the Finish button.


PowerBI is starting up now


The starting window is opened and ready for usage.





Conclusion

First impression of Power BI Desktop are great. Looking forward to work with PowerBI Desktop
in projects....

Greetz,
Hennie

woensdag 3 juni 2015

Microsoft Azure ML

Introduction

Now Machine Learning (ML) is becoming more and more important, it's a good idea to get a grasp of the capabilities of Machine Learning (ML). In this blogpost I'll investigate machine learning in more detail. What is Machine learning?, what is the process of creating a ML model? and what are common algorithms (in Microsoft Azure ML)?

Datamining techniques was already available in SSAS, but hardly used by customers or BI consultants. Microsoft released a while ago Azure Machine Learning and this tool provides a way of applying historical data to a problem by creating a model and using it to successfully predict future behaviors or trends.

For this blogpost I'll use Microsoft Azure ML studio as an example. Below a screenshot of a try out of the Microsoft Azure ML.On the right, the data flow, clean up and a projection and on the left the Linear Regression algorithm that is applied to the data, such that the model is trained.




Machine learning process

Everybody knows the quiz where someone  needs to guess an answer about what another person is telling about the characteristics about the object, like the game, called Pictionary. E.g: Me: “Its round, green, and edible” You: “It’s an apple!”. So, based from what you know (train the model) you can guess that the answer should be an apple (score and test the model).


And, you have to continue to learn the model by adding new data, like green apples, red apples and yellow apples to improve the model. So, there are a couple of steps:
  1. Get the data.
  2. Preprocess the data.
  3. Define features.
  4. Choose and apply an algorithm.
  5. Predict new incoming data.

1. Get the data

First thing to do is to get the data from a source. There are multiple options for loading the data in ML Studio. 

File Formats
The following file formats are supported in ML:
  • CSV file.
  • TSV file.
  • Plain text.
  • Svmlight file (Support Vector Machine).
  • Attribute relation file format.
  • Zip file.
  • RObject or Workspace.

Reader options
There are also some other input options available, like below:
  • Web Url via HTTP.
  • Hive Query.
  • Azure SQL Database.
  • Azure Table.
  • Azure Blob storage.
  • Data Feed Provider.

2. Preprocess the data

A dataset usually requires some preprocessing before it can be analyzed. You may notice some missing values in columns on different rows and these missing values needs cleaning in order to let the model analyze the data properly

In Microsoft Azure ML there are all kinds of manipulations (transformations?) possible:
  • Filtering.
  • Manipulation like adding columns, adding rows, Clean missing data, group categorical values or project columns
  • Create samples and splitting the data (for a training set and a test set).
  • etc

3. Define features

In machine learning, we are not talking about dimensions (or attributes) but about features. These are individual measurable properties of something you’re interested in. Each column in the dataset is a feature and finding a proper set of features is a tedious and important task for creating a predictive model. For instance some columns can have a strong correlation and therefore it will not add much new information to the model.

We'll select the features (columns) with the Project Columns module. For training the model it's needed that the dependent variable, the variable that we are going to predict, is in the data set.



4. Choose and apply an algorithm

Constructing a predictive model consists selecting an algorithm and train and test this algorithm in order to get the best result. These are algorithms that are currently available in Microsoft Azure ML:
  • Anomaly Detection
    • One-Class Support Vector Machine
    • PCA-Based anomaly Detection
  • Classification
    • Multiclass Decision Forest
    • Multiclass Decision Jungle
    • Multiclass Logistic Regression
    • Multiclass Neural Network
    • One-vs-All Multiclass
    • Two-Class Averaged Perceptron
    • Two-Class Bayes Point MAchine
    • Two-Class Boosted Decision Tree
    • Two-Class Decision Forest
    • Two-Class Decision Jungle
    • Two-Class Locally-Deep Support Vector Machine
    • Two-class Logistic Regression
    • Two-Class Neural Network
    • Two-Class Support Vector Machine
  • Clustering
    •  K-Means Clustering
  • Regression
    • Bayesian Linear Regression
    • Boosted Decision Tree Regression
    • Decision Forest Regression
    • Fast Forest Quantile Regression
    • Lineair Regression
    • Neural Network Regression
    • Ordinal Regression
    • Poisson Regression
So, most of the algorithms focusses on classification and regression. Both algorithms are so called supervised learning algorithms. Classification algorithms are used for predicting responses that can have just a few known values (such as married, single, or divorced) based on the other columns in the dataset. Regression algorithms are used for prediciting values based on contineous variables like age.

An example of an algorithm:


5. Predict new incoming data.

Now that we have trained the model, we have to score the model. In the split we have a training set created by spliting 75% of the data and in order to score the model, we have to compare this with the 25% test set to see how well the model functions.



In this example I've dragged a score model component in the diagram.

Below an example of the scoring experiment


Here you can see the  price and the predicted price based on the features of the data set.

When you compare the price with the scored labels with this diagram (below), you'll see that the low end is quite accurate but not upper right (because of the lack of sufficient data?).



Finally, to test the quality of the results, select and drag the Evaluate Model module to the experiment canvas, and connect the left input port to the output of the Score Model module. With this component it's possible to test two different algorithms for the best fit.

Conclusion

I've covered a small part of Azure Machine learning to get an impression of the possibilities. The possibilities of the free version are great and very useful for machine learning.


Greetz,

Hennie

vrijdag 29 mei 2015

Power Query

Introduction

Powerquery is an Excel add-in that can be used for examining data, transforming the data and combine data from multiple sources.  Powerquery is the ETL tool for Excel. You can extract data from multiple sources like Oracle, DB2, Hadoop, online locations and many other sources. You can transform the data, such as conversions, splitting columns, add columns, refine and merging. PowerQuery helps you to connect to sources accross your organization or public data sources.


Power query is one of the Excel Add-ins family and you have to download and install it and use this together with Excel 2013. I'm using version 2.22.4007.242. 

In this blogpost I'll investigate PowerQuery in to more detail. I've borrowed this example of the blogpost "Tutorial Introduction to Power Query".

The Toolbar

First, let's look at the toolbar (or ribbon). Below the toolbar of  PowerQuery :



There are a couple of options available on the ribbon:
  • Get external Data.
  • Excel Data.
  • Combine.
  • WorkBook Queries.
  • Settings.
  • PowerBI.
  • Help.
If we investigate the Get External Data in more detail, we have the following options:
  • From the Web
  • From File
    • From Excel
    • From CSV
    • From XML
    • From Text
    • From Folder
  • From Database
    • From SQL Server Database
    • From Access Database
    • From SQL Server Analysis Database
    • From Oracle Database
    • From IBM DB2 Database
    • From MySQL Database
    • From PostgressSQL Database
    • From Sybase Database
    • From TerData Database
  • From Azure
    • From SQL Azure Database
    • FromAzure Marketplace
    • From Microsoft Azure HDInsight.
    • From Microsoft Azure Blob Storage.
    • From Microsoft Azure Table Sorage
  • From Other Sources
    • From Sgarepoint List
    • From Odata Feed
    • From Hadoop File (HDFS)
    • From Active Directory
    • From Microft Exchange
    • From Dynamics CRM online
    • From Facebook
    • From SAP BusinessObjects BI Universe
    • Fro Salesforce objcts
    • From SalesForce Objects
    • From ODBC
    • BlankQuery
  • Recent Sources

This is quite an extensive list.

Now for this blogpost I'll use the example of the following blogpost "Tutorial Introduction to PowerQuery". In this tutorial the Top 250 movies of the IMDB.com site is combined with the revenue of the top 50 Revenue movies on Wikipedia.

Load the Movie data from IMDB Top 250 


1. Open an Empty WorkBook in Excel, Go the PowerQuery Tab and click on From Web.




2. Enter the URL and Press on OK.




3. Click on Table 0 and examine the data and press Load




4. The data is loaded in Excel and now we have to delete some columns and clean some stuff.


5. Click on the PowerQuery Tab and click on the Launch editor.


5. Remove some columns (Column0, rating and the last column). Right click on the column "Rank & Title" and press Split Column and By Delimiter.
 
The Results of this transformation looks like this screenshot:
 



7. Right click again on the column Rank& Title.2 and split the column again on the year by using the delimiter (and the option Split on the Right most delimiter. Then replace  the ) by nothing. The result should look look like this.


8. Rename the columns to Rank, Title, Year and IMDB Rating and Trim the Title column. The resulting window should look like the screenshot as shown below.



On the right you see the steps I've taken to come to the results.This is very handy. If you take some steps that you may regret, just delete the step and the action is rolled back.

9. Press on Close& Load.

Load the Revenue data from Wikipdia 

10. The next thing is to load the revenue per movie data from the Wiki page. Press From Web, Press Ok and press Load.


11. The data is loaded into Excel and it should like the screenshot below.
 

12. Press on Launch Editor and let's edit some stuff. I've renamed the two queries to IMDB Top250 and Revenue per movie. I've also deleted the last column


Merge the two datasets into one

13. Now press on Merge on the Toolbar, and configure the Merge operation as follows: select the IMDBTop250 and the Revenue per Movie, Click on the columns on how to merge the two columns and press OK.




14. Click on the button on the Column "NewColumn" and select "WorldWide gross" and press OK.


The final result should look like below.


The two datasets are combined and now you can visualise this in multiple ways.

Conclusion

Pretty impressive Self Service BI tooling of Microsoft. I've investigated PowerQuery for a limited time and I've just scratched the surface. The ease on investigating and combining data is really great. There are far more options of PowerQuery I've to investigate;-) IT seems that the PowerQuery language M is also integrated in SQL Server 2016 SSIS. Worthwhile looking into this...

Greetz,
Hennie


zaterdag 2 mei 2015

SSDT : Using an external reference in a project.

Introduction

In SSDT, a project can reference another database. There are two options here: 
  1. The database is in the solution.
  2. The database is not in the solution.
In this blogpost I'll discuss option 2: the database is not in the solution. In this case you have to create a .dacpac file from the external database and save it on a central place.

If you create a stored procedure in a database and this references a table in another database then an unresolved reference to object warning/error is displayed.


SSDT can't determine whether this code is right (or wrong). Therefore SSDT wants to know whether this tables exists. This can be done by creating a project in the solution for this specific database or by referencig a dacpac file. Why not linking directly to a database? Well, I think that this interferes with the DTAP environment approach (distinction between Development and Production) or offline development. 

Add Database Reference

Now, you can extract a .dacpac from an existing database and save it as a file in the filesystem. Below I've described how you create a database reference to a database that is not icluded in the current solution.

1. Right click on a database, press Tasks and click on extract Data-tier Application



2. Press a couple of times on Next and the .dacpac file is created. 


3. The next step is to add a database reference in the project in SSDT.



4. Fill in the location of the .dacpac file and the other fields and press OK.


The database reference is created and ready for use.


5. The next step is to replace the static databasename and replace this with the Database Reference.


The Database Reference is now also mentioned in the Project properties (SQLCMD Variables)


6. Press Publish and the following window occurs.


7. The stored procedure is now adjusted with the new name.


Conclusion

Pretty cool option of SSDT. You can develop off line your application and publish this solution to a server when you're ready.

Greetz,

Hennie