BI Future Blog: data driven

Posts tonen met het label data driven. Alle posts tonen

dinsdag 2 juli 2019

DAT208x : Introduction to Python for Data Science

Introduction

I'm participating in the Microsoft Professional Program AI by Micrsosoft. I've already done the programs Data science and Big data. These are also part of the Microsoft Professional Programs. I've experienced them as an easy to follow instruction based courses. There are more of these programs available that are interesting for data enthusiasts. Think about Internet of Things and Data analysis.

The great thing about these programs is that these programs consists of high quality instructor led courses, broken in easy to digest videos, exercises, labs and quizzes on the Edx site. So every minute spare time you have you can follow a couple of videos.

The program is broken in the following courses :

Introduction to Artificial Intelligence (AI)
Introduction to Python for Data Science
Essential Mathematics for Artificial Intelligence
Ethics and Law in Data and Analytics
Data Science Essentials
Build Machine Learning Models
Build Deep Learning Models
Build Reinforcement Learning Models
Develop Applied AI Solutions
Microsoft Professional Capstone : Artificial Intelligence

DAT208x : Introduction to Python for Data Science

This course is an introduction of Python in combination with data science. There are other Python courses available but they do not always focus on data science. This course it is. The course is a collaboration between Edx and DataCamp and I have to say that the interaction between the two sites works great.

The course is divided in the following sections:

Python Basics
List - A Data Structure
Functions and Packages
Numpy
Plotting with Matplotlib
Control Flow and Pandas
Final Lab
Final Exam

Final thoughts

The whole scripting is executing in a controlled environment of DataCamp. They did a great job building an integrated learning environment. Every section has one or more labs and they are graded in Edx.

The Final Lab is lot of work and covers more than the material in the sections and in the videos. This took me quite some time finding out how and what. Google is your friend here. The Final Exam contains 50 questions and must be finished within 4 hours. You have limited time per question.

Hennie

donderdag 9 mei 2019

Microsoft: DAT263x Introduction to Artificial Intelligence (AI)

Introduction

The great thing about these programs is that these programs consists of high quality instructor led courses, broken in easy to digest videos, exercises, labs and quizzes on the Edx site. So every minute spare time you have you can follow a couple of videos.

The program is broken in the following courses :

Introduction to Artificial Intelligence (AI)
Introduction to Python for Data Science
Essential Mathematics for Artificial Intelligence
Ethics and Law in Data and Analytics
Data Science Essentials
Build Machine Learning Models
Build Deep Learning Models
Build Reinforcement Learning Models
Develop Applied AI Solutions
Microsoft Professional Capstone : Artificial Intelligence

This blogpost describes the experiences I had with the first course : Introduction to Artificial Intelligence (AI).

DAT263x Introduction to Artificial Intelligence (AI)

This course is about an introduction of AI and exists of the following parts:

Machine learning
Language and communication
Computer vision
Conversation as a platform

Machine learning is a very lightweight introduction of machine learning and not a very comprehensive overview of the different terminology like AI, machine learning and deep learning. Very quickly the course presents Azure ML Studio with regresssion, classification and clustering.

Language and communication is about textprocessing, an introduction to NLP and using the application in Azure LUIS (Language Understanding Intelligent Service) with intents and how to use language processing in an example.

Computer vision is an introduction to get you starting with image processing and working with Images and Videos.

Conversation as a platform is about the bots: an introduction and how to build an intelligent bots

Final thoughts

I haven't followed the complete program yet (disclaimer alert!). The course is mostly about the products of Microsoft and is very hands-on. For a very theoretical, or an overview of AI I would rather look into another course like that of Andrew Ng on Coursera. Although I didn't participated in that specialization track yet, I think that kind of a course is more about the theory of AI.

But, if you want to know more about the products of Microsoft and how these are related to AI, I would recommend this program of Microsoft. I've learned about products like LUIS and didn't knew before.

I'll let you know my progress in the program!

Hennie

zondag 14 mei 2017

Getting filtered data from a tabular with Excel

Introduction

Building a PowerPivot model on filtered data of a table from a SQL Database is fairly simple. In the table properties you can change the table view to query view and change the query. I haven't found a similar way when using a tabular with Excel (realtime connection).

The only option I have found so far is using the CommandText in the connection properties, but that changes my look and feel of browsing the cube. The retrieved data from the tabular is just like one plain table. This blogpost will describe the way of adjusting the .odc file in Excel and getting the data as a plain table.

The goal of this blogpost is to show how to filter the factdata based on a dimension value (DimCustomer).

The Datamodel

I've created a simple model in SQL Server and imported it into the Tabular project. It is modeled as follows.

One Facttable with a key to the DimCustomer table and a measure Amount. I added a few records.

Some simple EVALUATES

In SSMS, I experimented with some EVALUATES to get some data from the Tabular. Below the result of this exercise.

--Getting the data from a table
EVALUATE Fact

--A simple  query (and join between tables)
EVALUATE
SUMMARIZE (Fact
,DimCustomer[Name]
,"TotalAmount", SUM(Fact[Amount])
)

--A simple query and a calculated measure is used
EVALUATE
SUMMARIZE (Fact
,DimCustomer[Name]
,"TotalAmount", Fact[TotalAmount]
)

--A calculated measure and a Filter
EVALUATE
FILTER(
 SUMMARIZE (Fact
 ,DimCustomer[Name]
 ,"TotalAmount", Fact[TotalAmount]
 ),
 DimCustomer[Name] = "Hennie"
)

Adjusting the .odc file

From the experiments in SSMS i took one of the DAX Evaluates and inserted that in the CommandText tag in the odc file of Excel.

This is the result in the Connection properties in the Excel file. The CommandType combobox is changed to Default and in the CommandText box a DAX EVALUATE expression is presented.

The result in Excel

And below you can see the fields in the FieldList and as you can see the Look and Feel has changed of the FieldList.

Conclusion

This blogpost is about how to filter the data in Excel from a Tabular Model.

Best regards,

Hennie

woensdag 22 juni 2016

PowerPivot : Discovering prediction of sales with simulation

Introduction

I was wondering whether I could add predictive analytics into my PowerPivot Workbooks. If I could predict the sales based on the sales in the former month or based on the same period last year or perhaps both, it would be a neat addition to my workbooks.

Now, the problem lies in the fact that statistical support in PowerPivot 2013 is very limited. In Excel you have statistical functions like NORM() and NORM.INV() that can help you building predictions (with simulation) in workbooks. But in PowerPivot 2010 and 2013 practically no support for statistics is available. Indeed we can calculate the Average and the standard deviation, but if you want to go a step further and use for instance a normal distribution function to calculate changes, you are out of luck.

So this blogpost is a kind of a discovery based blogpost where I try to find some solutions that would help predicting sales with aid of Excel 2013 and PowerPivot 2013. I found out that there are also (custom) functions in SQL Server that can help you building simulations, but that is for a later blogpost.

Simulation explained

Simulation is the ability to understand the system and implications of its changes over time including forecasting. Simulation is the mimicking of the operation of a real system, such as day to day operations in an organization. By using simulation you can quickly find out how the department or an organization might behave in the future. Simulation in a probabilistic model means that the outcome of the model changes every time you run it. It will change because a certain randomness in the formulas. This is in contrast with deterministic models when every time you run the model the output is the same.

Simulation doesn't show you what will occur; instead, it will show you many scenarios that might occur!

The data sets

I’ve created a couple of data sets in this model. The first one is the fact of sales for a certain organization. For simplicity, I’ve only included a date and a sales amount for the current month. Now we can extent this with another period If we like, for instance same period last year. I’ll not do explain this in more detail. This blogpost is just about getting the idea.

So this is the calendar dimension with a date as key and a daynumber as an attribute.

This is the sales data and it could be in a fact

Now for using the data in the simulation we need to calculate the average and the standard deviation. Luckily PowerPivot 2013 can calculate this!

       
AvgSales:=CALCULATE(AVERAGE(Sales[Sales]); ALL(Sales))

StanDevSales:=CALCULATE(STDEV.P(Sales[Sales]); ALL(Sales))

Now because PowerPivot does not have statistical functions I can use Excel to calculate the predictions. So I pull the averages and standard deviation into Excel and calculate there the simulation per day.

With this code for Average Sales and for the standard deviation:

       
=CUBEVALUE("ThisWorkbookDataModel"; "[Measures].[AvgSales]")
       
=CUBEVALUE("ThisWorkbookDataModel";"[Measures].[StanDevSales]")

The following piece of information I use in the table to calculate the predicted sales based on the average and standard deviation:

       
=NORM.INV(RAND();$E$4;$E$5)

Now I pull this information back into PowerPivot in order to integrate this later in a PivotChart.

And now I can build this data model

And now I can draw this graph with the information :

And if we refresh the data you can see that is a probabilistic model and the prediction is simulated differently when the simulation executes multiple times. This is illustrated with the grey line.

Review of the solution

Now there are some serious drawbacks about this solution. If I want to reload the data with new calculations PowerPivot reports an error. The column with the predicted sales is showing "GETTING DATA".

If the button "Refresh Selected" is pressed the error disappears and the data is refreshed again. Now, this is not very feasible in an automated environment, is it?

Another disadvantage is passing the average and the standard deviation to Excel with the CUBEVALUE. I haven't found another solution yet but I need to reference this cells with an average and standarddeviation in the NORM.INV() function in Excel. The problem is that this is nice on high level sales analysis but what if we want to include products in the simulation. Referencing every product with CUBEVALUE is time consuming and prone to errors when products are introduced and others decay.

Yet, another problem is when the data is not normally distributed. In the diagram below the predicted sales is even negative!

Conclusion

This was a interesting exercise to understand simulation a bit further. It seems that support for this in PowerPivot 2013 is minimum. Perhaps, better tooling are PowerPivot 2016, PowerBI v2 (Desktop). I've read that there is still no support for a Excellish NORM.INV() and NORM() in PowerBI v2. SQL Server 2016 has an R engine included. But, I think it is possible to automate this with SQL Server T-SQL functions as well. But that is for a future blogpost.

Regarding the simulation, this example assumes that the data is normally distributed, but for instance there are products that are just introduced and has another pattern (exponential?). Now, if you want to predict trustful figures a good model is needed. Now the problem is, what is a good model. As someone said :"All models are wrong but some are less wrong", you have to strive to a maximum of effort to bring the model to perfection. Perhaps you have to apply the 80/20 rule: 20 % of effort will give you a 80% correctness.

In my opinion you need to test first whether what kind of distribution the data has. Is it normally distributed, is it skewed? Or perhaps there are seasonal aspects in the data? Or can I give a reliability number to a prediction? So, a nice discovery of the subject but not ready for a real situation.

Greetz,
Hennie