woensdag 3 juni 2015

Microsoft Azure ML


Now Machine Learning (ML) is becoming more and more important, it's a good idea to get a grasp of the capabilities of Machine Learning (ML). In this blogpost I'll investigate machine learning in more detail. What is Machine learning?, what is the process of creating a ML model? and what are common algorithms (in Microsoft Azure ML)?

Datamining techniques was already available in SSAS, but hardly used by customers or BI consultants. Microsoft released a while ago Azure Machine Learning and this tool provides a way of applying historical data to a problem by creating a model and using it to successfully predict future behaviors or trends.

For this blogpost I'll use Microsoft Azure ML studio as an example. Below a screenshot of a try out of the Microsoft Azure ML.On the right, the data flow, clean up and a projection and on the left the Linear Regression algorithm that is applied to the data, such that the model is trained.

Machine learning process

Everybody knows the quiz where someone  needs to guess an answer about what another person is telling about the characteristics about the object, like the game, called Pictionary. E.g: Me: “Its round, green, and edible” You: “It’s an apple!”. So, based from what you know (train the model) you can guess that the answer should be an apple (score and test the model).

And, you have to continue to learn the model by adding new data, like green apples, red apples and yellow apples to improve the model. So, there are a couple of steps:
  1. Get the data.
  2. Preprocess the data.
  3. Define features.
  4. Choose and apply an algorithm.
  5. Predict new incoming data.

1. Get the data

First thing to do is to get the data from a source. There are multiple options for loading the data in ML Studio. 

File Formats
The following file formats are supported in ML:
  • CSV file.
  • TSV file.
  • Plain text.
  • Svmlight file (Support Vector Machine).
  • Attribute relation file format.
  • Zip file.
  • RObject or Workspace.

Reader options
There are also some other input options available, like below:
  • Web Url via HTTP.
  • Hive Query.
  • Azure SQL Database.
  • Azure Table.
  • Azure Blob storage.
  • Data Feed Provider.

2. Preprocess the data

A dataset usually requires some preprocessing before it can be analyzed. You may notice some missing values in columns on different rows and these missing values needs cleaning in order to let the model analyze the data properly

In Microsoft Azure ML there are all kinds of manipulations (transformations?) possible:
  • Filtering.
  • Manipulation like adding columns, adding rows, Clean missing data, group categorical values or project columns
  • Create samples and splitting the data (for a training set and a test set).
  • etc

3. Define features

In machine learning, we are not talking about dimensions (or attributes) but about features. These are individual measurable properties of something you’re interested in. Each column in the dataset is a feature and finding a proper set of features is a tedious and important task for creating a predictive model. For instance some columns can have a strong correlation and therefore it will not add much new information to the model.

We'll select the features (columns) with the Project Columns module. For training the model it's needed that the dependent variable, the variable that we are going to predict, is in the data set.

4. Choose and apply an algorithm

Constructing a predictive model consists selecting an algorithm and train and test this algorithm in order to get the best result. These are algorithms that are currently available in Microsoft Azure ML:
  • Anomaly Detection
    • One-Class Support Vector Machine
    • PCA-Based anomaly Detection
  • Classification
    • Multiclass Decision Forest
    • Multiclass Decision Jungle
    • Multiclass Logistic Regression
    • Multiclass Neural Network
    • One-vs-All Multiclass
    • Two-Class Averaged Perceptron
    • Two-Class Bayes Point MAchine
    • Two-Class Boosted Decision Tree
    • Two-Class Decision Forest
    • Two-Class Decision Jungle
    • Two-Class Locally-Deep Support Vector Machine
    • Two-class Logistic Regression
    • Two-Class Neural Network
    • Two-Class Support Vector Machine
  • Clustering
    •  K-Means Clustering
  • Regression
    • Bayesian Linear Regression
    • Boosted Decision Tree Regression
    • Decision Forest Regression
    • Fast Forest Quantile Regression
    • Lineair Regression
    • Neural Network Regression
    • Ordinal Regression
    • Poisson Regression
So, most of the algorithms focusses on classification and regression. Both algorithms are so called supervised learning algorithms. Classification algorithms are used for predicting responses that can have just a few known values (such as married, single, or divorced) based on the other columns in the dataset. Regression algorithms are used for prediciting values based on contineous variables like age.

An example of an algorithm:

5. Predict new incoming data.

Now that we have trained the model, we have to score the model. In the split we have a training set created by spliting 75% of the data and in order to score the model, we have to compare this with the 25% test set to see how well the model functions.

In this example I've dragged a score model component in the diagram.

Below an example of the scoring experiment

Here you can see the  price and the predicted price based on the features of the data set.

When you compare the price with the scored labels with this diagram (below), you'll see that the low end is quite accurate but not upper right (because of the lack of sufficient data?).

Finally, to test the quality of the results, select and drag the Evaluate Model module to the experiment canvas, and connect the left input port to the output of the Score Model module. With this component it's possible to test two different algorithms for the best fit.


I've covered a small part of Azure Machine learning to get an impression of the possibilities. The possibilities of the free version are great and very useful for machine learning.



vrijdag 29 mei 2015

Power Query


Powerquery is an Excel add-in that can be used for examining data, transforming the data and combine data from multiple sources.  Powerquery is the ETL tool for Excel. You can extract data from multiple sources like Oracle, DB2, Hadoop, online locations and many other sources. You can transform the data, such as conversions, splitting columns, add columns, refine and merging. PowerQuery helps you to connect to sources accross your organization or public data sources.

Power query is one of the Excel Add-ins family and you have to download and install it and use this together with Excel 2013. I'm using version 2.22.4007.242. 

In this blogpost I'll investigate PowerQuery in to more detail. I've borrowed this example of the blogpost "Tutorial Introduction to Power Query".

The Toolbar

First, let's look at the toolbar (or ribbon). Below the toolbar of  PowerQuery :

There are a couple of options available on the ribbon:
  • Get external Data.
  • Excel Data.
  • Combine.
  • WorkBook Queries.
  • Settings.
  • PowerBI.
  • Help.
If we investigate the Get External Data in more detail, we have the following options:
  • From the Web
  • From File
    • From Excel
    • From CSV
    • From XML
    • From Text
    • From Folder
  • From Database
    • From SQL Server Database
    • From Access Database
    • From SQL Server Analysis Database
    • From Oracle Database
    • From IBM DB2 Database
    • From MySQL Database
    • From PostgressSQL Database
    • From Sybase Database
    • From TerData Database
  • From Azure
    • From SQL Azure Database
    • FromAzure Marketplace
    • From Microsoft Azure HDInsight.
    • From Microsoft Azure Blob Storage.
    • From Microsoft Azure Table Sorage
  • From Other Sources
    • From Sgarepoint List
    • From Odata Feed
    • From Hadoop File (HDFS)
    • From Active Directory
    • From Microft Exchange
    • From Dynamics CRM online
    • From Facebook
    • From SAP BusinessObjects BI Universe
    • Fro Salesforce objcts
    • From SalesForce Objects
    • From ODBC
    • BlankQuery
  • Recent Sources

This is quite an extensive list.

Now for this blogpost I'll use the example of the following blogpost "Tutorial Introduction to PowerQuery". In this tutorial the Top 250 movies of the IMDB.com site is combined with the revenue of the top 50 Revenue movies on Wikipedia.

Load the Movie data from IMDB Top 250 

1. Open an Empty WorkBook in Excel, Go the PowerQuery Tab and click on From Web.

2. Enter the URL and Press on OK.

3. Click on Table 0 and examine the data and press Load

4. The data is loaded in Excel and now we have to delete some columns and clean some stuff.

5. Click on the PowerQuery Tab and click on the Launch editor.

5. Remove some columns (Column0, rating and the last column). Right click on the column "Rank & Title" and press Split Column and By Delimiter.
The Results of this transformation looks like this screenshot:

7. Right click again on the column Rank& Title.2 and split the column again on the year by using the delimiter (and the option Split on the Right most delimiter. Then replace  the ) by nothing. The result should look look like this.

8. Rename the columns to Rank, Title, Year and IMDB Rating and Trim the Title column. The resulting window should look like the screenshot as shown below.

On the right you see the steps I've taken to come to the results.This is very handy. If you take some steps that you may regret, just delete the step and the action is rolled back.

9. Press on Close& Load.

Load the Revenue data from Wikipdia 

10. The next thing is to load the revenue per movie data from the Wiki page. Press From Web, Press Ok and press Load.

11. The data is loaded into Excel and it should like the screenshot below.

12. Press on Launch Editor and let's edit some stuff. I've renamed the two queries to IMDB Top250 and Revenue per movie. I've also deleted the last column

Merge the two datasets into one

13. Now press on Merge on the Toolbar, and configure the Merge operation as follows: select the IMDBTop250 and the Revenue per Movie, Click on the columns on how to merge the two columns and press OK.

14. Click on the button on the Column "NewColumn" and select "WorldWide gross" and press OK.

The final result should look like below.

The two datasets are combined and now you can visualise this in multiple ways.


Pretty impressive Self Service BI tooling of Microsoft. I've investigated PowerQuery for a limited time and I've just scratched the surface. The ease on investigating and combining data is really great. There are far more options of PowerQuery I've to investigate;-) IT seems that the PowerQuery language M is also integrated in SQL Server 2016 SSIS. Worthwhile looking into this...


zaterdag 2 mei 2015

SSDT : Using an external reference in a project.


In SSDT, a project can reference another database. There are two options here: 
  1. The database is in the solution.
  2. The database is not in the solution.
In this blogpost I'll discuss option 2: the database is not in the solution. In this case you have to create a .dacpac file from the external database and save it on a central place.

If you create a stored procedure in a database and this references a table in another database then an unresolved reference to object warning/error is displayed.

SSDT can't determine whether this code is right (or wrong). Therefore SSDT wants to know whether this tables exists. This can be done by creating a project in the solution for this specific database or by referencig a dacpac file. Why not linking directly to a database? Well, I think that this interferes with the DTAP environment approach (distinction between Development and Production) or offline development. 

Add Database Reference

Now, you can extract a .dacpac from an existing database and save it as a file in the filesystem. Below I've described how you create a database reference to a database that is not icluded in the current solution.

1. Right click on a database, press Tasks and click on extract Data-tier Application

2. Press a couple of times on Next and the .dacpac file is created. 

3. The next step is to add a database reference in the project in SSDT.

4. Fill in the location of the .dacpac file and the other fields and press OK.

The database reference is created and ready for use.

5. The next step is to replace the static databasename and replace this with the Database Reference.

The Database Reference is now also mentioned in the Project properties (SQLCMD Variables)

6. Press Publish and the following window occurs.

7. The stored procedure is now adjusted with the new name.


Pretty cool option of SSDT. You can develop off line your application and publish this solution to a server when you're ready.



dinsdag 14 april 2015

Some features of SQL Server Data Tools


Today a post about SSDT (SQL Server Data Tools) and this tool is the replacement of BIDS and Database Developer (or Data Dude). There are two tastes of SSDT: SSDT-BI and SSDT-DB. Both have to be installed seperately and that's very confusing. But once you know what's going on there are no problems anymore. SSDT-DB is the evolution of  "Visual Studio Team System Edition for Database Professionals" and this project was the baby of Gert E. Drapers that gave it its more common name, "Data Dude".

There seems multiple versions of SSDT around and I've found these versions of SSDT of the last years.
  • 2014 January update.
  • 2014 July update.
  • 2015 February update.

There are two versions available:
  • Visual Studio 2012.
  • Visual Studio 2013.
The ultimate goal of SSDT is to create a more integrated development environment. Traditionally, database development has often been done outside of Visual Studio, for instance in SQL Server Management Studio. With SSDT, database development is instead done inside Visual Studio. Databases are included as projects in the Visual Studio solutions.

What is SSDT?

This tool is the single environment for writing DB code as well as BI development. There is no need to switch between  SSMS and SSDT. You can write all the code in Visual Studio 2012 or 2013. There is now a clear distinction between a developer environment and a DBA environment. SSMS is more about managing SQL Server and databases. There are two modes possible:
  • Disconneted mode.
  • Connected mode
    • On premise.
    • Off premise.
There is a local database for developers and this comes in handy because developers don't need a connection to a live database. so you write code in a declarative manner and you don't need to worry(?!) about implementing these chnges on a database.

Normally, until you had SSDT, you would write code with SSMS in a development environment. When you needed to deploy the code to another environment you had to remember or compare the versions with TFS and then create change scripts for the other environment. With SSDT you don't need that anymore. SSDT does that for you.


There are multiple ways to start a project with SSDT:
  • Start from scratch with a project in SSDT
  • Import the code from a current database into SSDT project.
  • Import from a .dacpac file
  • Import a from scriptfiles. 
In the following tutorial I will show some simple steps importing the code from an existing database into a project in SSDT. This is the DB part of SSDT.

Import the code from a current database into SSDT project

1. First start SSDT and start a new project and you'll see the following window. Here you can see the distinction between BI and DB.

2. Export the Data-tier Application from the database by right clicking on the specific database, press Tasks and Export Data-tier Application.

3. Press Next.

4. Save the file to a local disk by clicking on Browse

5. Choose the location and press Save.

6. Press Next.

7. Choose the database objects and press Next

8. Now we have the summary window. Press Finish.

9. Errors appears and why is that?

There seems a lot of errors in export of the database to the bacpac file, unfortunately. The errors are static and you can do anything about it. Below,  I copied the most common errors that apeared in the error list

The element Extended Property: [dbo].[ufnGetCustomerInformation].[@CustomerID].[MS_Description]

The element Extended Property: [SalesLT].[SalesOrderDetail].[IX_SalesOrderDetail_ProductID].[MS_Description] is not supported

The element XML Schema Collection: [SalesLT].[ProductDescriptionSchemaCollection] is not supported

Element Column: [SalesLT].[Address].[AddressID] has an unsupported property IdentityIsNotForReplication

 Element Column: [SalesLT].[Address].[rowguid] has an unsupported property IsRowGuidColumn set 

Table Table: [dbo].[BuildVersion] does not have a clustered index.  Clustered indexes are required for inserting data in this version of SQL Server.

Element User: [NT AUTHORITY\NETWORK SERVICE] has an unsupported property AuthenticationType set and is not supported when used as part of a data package.   

When searching the internet I found this helpful article on SQL Server Data Tools Team Blog. It seems that there is another way of importing the database into SSDT-DB. And that is with SSDT-DB itself.

10. Select the specific database and right click on Create New Project

11. Set the Target name and click on Start

12. The export is running

13. And, the project is created based on the AdventureWorks2012LT database.

Refactor the database

Refactoring the database is one of the things that needs to be done. Objects needs to be changed when the world around us changes and database needs to evolve with that. Let's try to refactor a field in table in the AdventureWorksLT2012 database.

14. Let's try to change the CompanyName in the Customer script to Company and try to publish this to the database. We do this by right clicking on the CompanyName and press Refactor. Change the name to Company.

15. Here you can see the changes and the press Apply

16. The next thing is to publish the changes to the database. You do this by right clicking on the Project and press Publish.

17. After some time of processing the changes are processed and the database is in the new state.

Snapshot the project

It's also possible to snapshot your project in dacpac files. These files contains your scripts saved at a certain point in time. This will help you save the projects at important points in time.

18. Right click on the project and click on Snapshot Project.

19. And the snapshot is saved under Snapshots.

Using References

If your database references other databases, you'll need a dacpac file for each external database referenced in your code. References can be resolved for objects in one of the following locations:

  • In the referencing database.
  • In a database other than the referencing database, but on the same server.
  • In a database other than the referencing database, on a different server.

In this blogpost, I've choosed to use "In a database other than the referencing database, but on the same server".

Suppose you have a stored procedure in a project that references a table in an other database (e.g. TestDB). Then the following error is shown in SSDT. You need to create a Database reference to the project.

22. Add a Database Reference to the project by right clicking References and Add Database Reference.

23. First extact a dacpac from the database in SSMS and store that on a central place and press Next.

24. Select that dacpac file from the central place in your project.

25. Replace the database name by the Database Reference.

25. Now if you want to publish this to the database, the values of the Database Reference can be set.

Here you can set the value for the Database reference.


SSDT is logical evolution in Database and SSIS, SSRS and SSAS design. I'm very happy with the development environment for SQL Server where you have an integrated approach. There is a clear distinction between DBA and Developer work.

Other references