zaterdag 31 januari 2015

Datamining : Reasoning (Part I)

Introduction

According to Wikipedia, Datamining is: "Searching for statistical patterns in (large) datasets for purposes like scientific, commercial usage, etc". So, datamining is for discovery based strategies (in contrast with Business based discovery) to discover relations, correlations or perhaps categorizations. Data mining is about induction. Induction is generalizing a certain number of observations. Induction is the opposite of deduction. Deduction is generic rule applied to a certain part of observations.

Deductive reasoning

In contrast to inductive reasoning, deductive reasoning is concerned with conclusions follow with certainty from their premises and inductive reasoning refers to situations where conclusions only probabilistic follow from their premises.  There are a couple of rules of inference according to the book from my good old study at the University: "Cognitive psychology and it's implications" by John R. Anderson. I can recommend this book by the way (if it's still sold).

Conditional statement (modus ponens)
A conditional statement is an assertion, such as "if you read this blogpost, you will be wiser". The if part is called the antecedent and the then part is the consequent. This certain reasoning is called modus ponens (if A, then B and the proposition A, we can infer B). Suppose the following:

1. If you read this blogpost, you will be wiser.
2. You read this post.

From premises 1 and 2 we can infer 3 by modus pones.

3. You are wiser!

Modus tollens
Another rule of inference is modus tollens. This rule states, that if we are given the proposition A implies B and the fact that B is false, then we cab infer A is false. An example:

1. If you understand this blogpost, you understand reasoning with datamining
2. You don't understand reasoning with datamining.

It follows from modus tollens that:

3. You didn't understand this blogpost.

Inductive reasoning

With induction, a way of reasoning is meant and it's acting as a proof.  With inductive reasoning an general rule is defined on a certain number of observations. We call this generalization. On a certain number of observations we reason to a generalization of the rule. Inductive reasoning is the term used to  describe the process by which one comes to conclusions that are probable rather that certain. This seems much more useful because very little is certain end at very best, very likely. 

1. If you read this blogpost, you will be wiser.
2. You are wiser.

Then it follows with inductive reasoning:

3. You've read this blogpost! Great!

Bayes's Theorem provides a way for assessing the plausibility of this reasoning. What is the probability that you are wiser because of reading this blogpost? There is also a probability that you have become wiser because of reading something else. That is important to know too.

Conclusion

This is an introduction to datamining and inductive reasoning with data. This blogpost discusses some issues with inductive reasoning and datamining.

maandag 8 december 2014

Let's Disco!

Introduction

Disco is a process mining tool that let's you discover processes by analyzing event logs. It's suitable for analyzing transactions that happen in a process. For instance, a sales process or a order process are examples that can be analyzed. You can examine bottlenecks in your process, loops, time, durations, averages, fast/slow lanes, conformance issues, resource performance, etc.

So, processmining with a tool like ProM, Disco (and I've even seen a demo of SAS Visual Analytics with a Sankey Diagram) are very well suited for process model analysis. Different tooling are enhancing process analysis. Process analysis can be a great addition to Business Intelligence. Where Business intelligence is more like phishing with a fishing rod in an organisation for KPI's and Business Metrics, Process Mining is much more 'on' the process. Business Intelligence and Process Mining can work together to optimize processes.

A simple tour through Disco

In this blogpost, I've included some screenshots about Disco. I've used a small file that I've borrowed from the Coursera Course "Process mining:Data Science in Action" lectured by TUE, Wil van der Aalst. Below the opening screen of Disco.


The first thing that you have to is load the data. In this case a csv file.


Then you have to set the CaseID, Events, Timestamps and the resources. This is needed for analysis of process.




The next step is importing the data and generating the processmap. Below an example of the duration of the processsteps.


Here an example of the mean of durations


And below, an example when you play the data on the model. The yellow/red dots are markers that flow through the model



 And below some statistical analysis with Disco:


Some more information about the cases.


Conclusion

Disco is a great tool for analysis of processes. Process mining can be a great addition to Business Intelligence and very helpful for analysis of processes. Both analyses processes but on different levels.

Greetz,
Hennie

donderdag 4 december 2014

Process mining

Introduction

Currently following the course processmining on Coursera and this course is combination of data mining and processmodels. Like the same as with Business intelligence (sort of), processmining analyzes data about processes in an organisation. I'm quite enthusiastic about this approach because it analyzes processes on a scientific manner with data mining. Datamining analyzes data but not with the (direct) aim of looking at processes in an organisation. Processmining does. There is also a relation between BPM (Business Process modelling). In the course sometimes BPM models are used aside petrinets.

Although I've just started with the course, I want to share some interesting things, I've come along during the course. In this blog post I'll describe this.

Defining process mining

The definition of Processmining according to Wikipedia :

"Process mining is a process management technique that allows for the analysis of business processes based on event logs.".

But in my opinion event logs is a bit of a narrow keyword. A more broader definition could be applicable. Think about facts in a star schema or satellite information in a Datavault model that are very often used in business intelligence and data warehousing. These are transactions that happens in the operations of an organisation These are also events. Events that happened. Think about an order entry system with statuses. Sometimes, customers told me how the order entry process worked and when I studied the different statuses an order should have, I sometimes found out that different sequences of the order process were possible. With process mining you can identify undiscovered routes of your business process in automated way. This is truly an addition in the  field of Business intelligence, Lean Six Sigma,  datamining and BPM.

Just a simple example, suppose from the customer you hear that the model is this (orders with order statusses):



But when we study the transactions of the order entry system the following is noticed (records are identified by a case ID (the grouping of the records) and the activity at a certain moment):

Here we see that order 4568 is reopened and this should not have supposed to happen according to the designed model. After analyzing the events in a log or perhaps a transactional modelled star schema the model appears like this (corrected):


It could mean that in the operational process order entry personnel has reopened the order for some reason. If you want optimize the process in order to reduce the wastes (Lean Six sigma) than this is very interesting information. Process mining can do this for you.

Conclusion

Although I've just started with studying process mining, this seems a very interesting approach for analyzing processes with datamining. And, this is also applicable on huge log files and analyzing log files is one of the applications of Big data analytics. 

Hope you have read this blogpost with pleasure..

Greetz,

Hennie

maandag 1 december 2014

SSAS SSRS : Bad Performance SSRS parameters and SSAS

Introduction

SSRS Reporting parameters and SSAS cubes......There is a lot to read about on the internet about the bad performance of  parameters that are generated by Reporting Services based upon a SSAS Cube. Today, I've ran into this problem too. I've a parameter with a dataset that queries a customer table of about 40000 records. Peanuts for a simple SQL query, but not for a SSAS cube (pfffff). And, I'm not the only one who has encountered this problem. You can read hundreds of blogposts, forumposts and more. This is one blogpost more about this bad cooperation between SSRS and SSAS.

The lab situation is this: we have a main report with a customer parameter and a sub report. I want to pass a Customer list via the parameter to the subreport in order to show detailrecords. In my real word example a map is shown and clicking on a area on the map shows the detail records that belong to a region.

Analyzing the parameter performance problem

The initial problem of a parameter with 40000 records is simple. It doesn't perform! I will show you what kind of issues rise when you create a parameter with 40000 records.  First create a dataset with a parameter "Customer" by checking the parameter option in the query window.




A parameter dataset is created (you can't see that because it's hidden). Choose for "Show hidden datasets". If you run the report, you'll notice that the performance is bad..very bad. When you open profiler to take a look at what is happening during cube querying a lot of serialize events happen and these happen often (per 1000) and these are sloooowwww. The serialize results current is a very slow process. So processing the query is very fast but serializing the result for the Report is slow.



If you take a look at this diagram, you can see that the serialize is done from the Compute Cell data part and it's transferred to the client application.


Now, I haven't found a silver bullet unfortunately. But Chris Webb helped me a bit further. In his blogpost, Chris Web explains this problem a bit further and he gives a solution. Well, sort of but I'll explain this later. Suppose you add a parameter to a dataset according to the following screenshot. 



Then a parameter is automatically created with a dataset:



And, the query in this dataset looks like this:

 WITH 
  MEMBER [Measures].[ParameterCaption] AS [Customer].[Customer].CURRENTMEMBER.MEMBER_CAPTION 
  MEMBER [Measures].[ParameterValue] AS [Customer].[Customer].CURRENTMEMBER.UNIQUENAME 
  MEMBER [Measures].[ParameterLevel] AS [Customer].[Customer].CURRENTMEMBER.LEVEL.ORDINAL
 SELECT 
 {[Measures].[ParameterCaption],
  [Measures].[ParameterValue], 
  [Measures].[ParameterLevel]} ON COLUMNS , 
 [Customer].[Customer].ALLMEMBERS ON ROWS 
 FROM [Adventure Works]

According to Chris Webb  you can replace this query by an OLEDB query. This query performs much faster than the generated query by SSRS.

WITH
 MEMBER MEASURES.DUMMY AS NULL
SELECT
{MEASURES.DUMMY}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
DIMENSION PROPERTIES UNIQUE_NAME, MEMBER_CAPTION, LEVEL_NUMBER
ON ROWS
FROM [Adventure Works]
CELL PROPERTIES VALUE

This speeds up the loading the data into the parameter.

Invalid URI : The URI String is too long 

The next problem I had was that I got an error when I call a subreport with the same parameter. It seems that there is limitation of 65520 characters that you can pass with a parameter. This is the error:


A limitation of the number of characters that can be passed through a subreport. Theo Lachevs explains why. There are a couple of options. The best option is to limit the number of characters of the parameter that are passed to the subreport. This is not perfect because of growing the amount of the data somewhere in the future you'll run into problems again and you can't solve it anymore.

Okay, how do we limit the number of characters that are passed to the subreport? Well, Erika Bakse showed in het blogpost a way to limit the number of members. In this example I've created a main report and a subreport. The main report has a Customer parameter that lists customers

For this blogpost I've created a simple report with a Tablix. On the Textbox Properties I've created an Action with an expression.



This is the expression that is passed to the subreport:

=Join(Parameters!CustomerCustomerName.Label, "],")





This is the output of the expression:


And this expression is without the full membernames and therefore compacted. In the parameter of the subreport this is one large string:


With the following expression you can change these CSV separated string into full membernames:

="[Customer].[Customer Name].&[" + REPLACE(Parameters!Customer.Label(0) + "]", "],", "],[Customer].[Customer Name].&[")

Shown below:

The output of the expression is this:


And now we can pass this to a dataset and filter the records with this list.

Conclusion

This URI problem happens quite often. I encountered this problem on numerous environments with SQL Server 2008, SQL Server 2008 R2 and SQL Server 2012 in different situations. I've also encountered different behaviour between SSRS, Report manager and Reportbuilder.

Greetz

Hennie




zondag 16 november 2014

BI and the Anscombe quartet

The anscombe quartet is showing us something that is also related to Business intelligence. In Business Intelligence it's familiar to show KPI's or business metrics. A KPI as one number hides some times important information like the anscombe quartet is showing us.


And, they all have the same mean, variance, deviation, correlation and lineair regression!


What is this saying us? The statistician Francis Anscombe suggested this quartet to demonstrate the importance of graphical data analysis and the effect of outliers before they decide on a statistical analysis on the basis of their characteristics. The data sets show that the simple statistical characteristics are not always sufficient in order to describe the data.

Conclusion

Be very cautious showing numbers in General Business Metrics and KPI's in your report and your dashboards. Tell the whole story behind a KPI or Business metric and do not only show a number.