BI Future Blog: Microsoft Azure ML

Introduction

Now Machine Learning (ML) is becoming more and more important, it's a good idea to get a grasp of the capabilities of Machine Learning (ML). In this blogpost I'll investigate machine learning in more detail. What is Machine learning?, what is the process of creating a ML model? and what are common algorithms (in Microsoft Azure ML)?

Datamining techniques was already available in SSAS, but hardly used by customers or BI consultants. Microsoft released a while ago Azure Machine Learning and this tool provides a way of applying historical data to a problem by creating a model and using it to successfully predict future behaviors or trends.

For this blogpost I'll use Microsoft Azure ML studio as an example. Below a screenshot of a try out of the Microsoft Azure ML.On the right, the data flow, clean up and a projection and on the left the Linear Regression algorithm that is applied to the data, such that the model is trained.

Machine learning process

Everybody knows the quiz where someone needs to guess an answer about what another person is telling about the characteristics about the object, like the game, called Pictionary. E.g: Me: “Its round, green, and edible” You: “It’s an apple!”. So, based from what you know (train the model) you can guess that the answer should be an apple (score and test the model).

And, you have to continue to learn the model by adding new data, like green apples, red apples and yellow apples to improve the model. So, there are a couple of steps:

Get the data.
Preprocess the data.
Define features.
Choose and apply an algorithm.
Predict new incoming data.

1. Get the data

First thing to do is to get the data from a source. There are multiple options for loading the data in ML Studio.

File Formats

The following file formats are supported in ML:

CSV file.
TSV file.
Plain text.
Svmlight file (Support Vector Machine).
Attribute relation file format.
Zip file.
RObject or Workspace.

Reader options

There are also some other input options available, like below:

Web Url via HTTP.
Hive Query.
Azure SQL Database.
Azure Table.
Azure Blob storage.
Data Feed Provider.

2. Preprocess the data

A dataset usually requires some preprocessing before it can be analyzed. You may notice some missing values in columns on different rows and these missing values needs cleaning in order to let the model analyze the data properly

In Microsoft Azure ML there are all kinds of manipulations (transformations?) possible:

Filtering.
Manipulation like adding columns, adding rows, Clean missing data, group categorical values or project columns
Create samples and splitting the data (for a training set and a test set).
etc

3. Define features

In machine learning, we are not talking about dimensions (or attributes) but about features. These are individual measurable properties of something you’re interested in. Each column in the dataset is a feature and finding a proper set of features is a tedious and important task for creating a predictive model. For instance some columns can have a strong correlation and therefore it will not add much new information to the model.

We'll select the features (columns) with the Project Columns module. For training the model it's needed that the dependent variable, the variable that we are going to predict, is in the data set.

4. Choose and apply an algorithm

Constructing a predictive model consists selecting an algorithm and train and test this algorithm in order to get the best result. These are algorithms that are currently available in Microsoft Azure ML:

Anomaly Detection

One-Class Support Vector Machine
PCA-Based anomaly Detection

Classification

Multiclass Decision Forest
Multiclass Decision Jungle
Multiclass Logistic Regression
Multiclass Neural Network
One-vs-All Multiclass
Two-Class Averaged Perceptron
Two-Class Bayes Point MAchine
Two-Class Boosted Decision Tree
Two-Class Decision Forest
Two-Class Decision Jungle
Two-Class Locally-Deep Support Vector Machine
Two-class Logistic Regression
Two-Class Neural Network
Two-Class Support Vector Machine

Clustering

K-Means Clustering

Regression

Bayesian Linear Regression
Boosted Decision Tree Regression
Decision Forest Regression
Fast Forest Quantile Regression
Lineair Regression
Neural Network Regression
Ordinal Regression
Poisson Regression

So, most of the algorithms focusses on classification and regression. Both algorithms are so called supervised learning algorithms. Classification algorithms are used for predicting responses that can have just a few known values (such as married, single, or divorced) based on the other columns in the dataset. Regression algorithms are used for prediciting values based on contineous variables like age.

An example of an algorithm:

5. Predict new incoming data.

Now that we have trained the model, we have to score the model. In the split we have a training set created by spliting 75% of the data and in order to score the model, we have to compare this with the 25% test set to see how well the model functions.

In this example I've dragged a score model component in the diagram.

Below an example of the scoring experiment

Here you can see the price and the predicted price based on the features of the data set.

When you compare the price with the scored labels with this diagram (below), you'll see that the low end is quite accurate but not upper right (because of the lack of sufficient data?).

Finally, to test the quality of the results, select and drag the Evaluate Model module to the experiment canvas, and connect the left input port to the output of the Score Model module. With this component it's possible to test two different algorithms for the best fit.

Conclusion

I've covered a small part of Azure Machine learning to get an impression of the possibilities. The possibilities of the free version are great and very useful for machine learning.

Greetz,

Hennie

BI Future Blog

woensdag 3 juni 2015

Microsoft Azure ML