zondag 15 maart 2015

Big Data : Data Obese (Part I)


We are buried under the amount of information that we collect. Especially the amount of data we refer to as big data information. This information is mostly unstructured or semi structured like JSON or XML. We need on an information diet. How do we handle this?

In addition, some companies (Cloudera)  has already announced that the end of the data warehouse will happen when Hadoop is full grown. I'm still not convinced, especially in the short and medium term. Hadoop is an eco system that is under development and it has lots of potential.

Why a (structured) data warehouse?

Why do we have a (structured) data warehouse? You'd almost forgotten in the Big Data violence that we hear and read daily. Let's go back to the basics. A data warehouse is the place where we can store all information of an organization (and beyond) in a structured way. Structure is important in this statement. This fact has many advantages. The data has been optimized for complex queries. It has been cleaned (ETL) and a labelled (metadata) such as: It is a varchar and it is always 50 characters wide.

Relieving a sourcesystem can be an important advantage too. For example, if there is no data warehouse then, each data mart or a brief report must have access to the source system. The one-time retrieval of the data in an isolated environment can thus relieve the production systems.

But the main advantage is the fact that a data warehouse can integrate data from different sources. Based on business keys used by the organization so you can integrate for example, sales, purchasing and production (vertical columns) again make horizontal. You collect and integrate all departments.

And, if you want you want do it completely right,  then you have to make sure that you save the history data from the source systems properly. Often people will think you mean the old data, but it is not. The point is that you save the changes of the data in the source systems. This has advantages, namely that you have the central place and your data warehouse is the truth for all corporate data. After all, you have saved the state of a source system at any point in time. So now  you can time travel, "How my report looked at the time April 1, 2013?". The report will show the position of the data at that specific time. Subsequent changes have no effect. No more guesswork by stating that changes in the data provide other figures in the reports (if you do not save changes). It also has other advantages namely that you are 'auditable', you can re-generate new insights on old data and you can analyze processes by example you have registered status changes (process mining).

How does Big data fit into this?

But now the big question : "how does big data fit in this story?". The question is what we mean by big data and that is difficult. It has become such a huge container concept that it is not entirely clear what exactly is big data. Looking at the 3 Vs will see it is a lot of data, which are very diverse and that also goes fast. Suppose we assume a (enterprise) data warehouse,  there are complex business rules and integration issues it is not expected that real-time processing is possible. This is simply not easy to achieve. Assuming this situation : speed (real time) is less important, we still hold about two aspects that may be important in a big data data warehouse: The amount of data and the variety.

For the amount of data, we could realize a Hadoop solution that stores the unstructured data. Because there is no schema, you can store data quickly in a Hadoop cluster. And, you can store all information without you know the schema of the data. Imagine that your datastructure  has changed, You have to drop and recreate the table in a RDBMS and that is not needed with Hadoop. 

There is no schema data (we do not have data types such as varchar or integer). In short, when reading the data from the Hadoop cluster, then we have to define a schema. Defining the schema can be defined very broadly. The script (mapper) determines what is the schema. For example, you can run the script Text ETL where you are looking for patterns. But for example with Hive, that is part of the ecosystem Hadoop make a schema on the data files in HDFS. This is also called "Schema-On-Read '.

But let's just focus on unstructured information such as email, reports, Word documents or files that have a bit more structure such as error logs or web logs. So basically data without metadata. How can we analyze this? We can use text ETL. You can search for patterns, for example, taxonomies or ontologies. Below is an example of a taxonomy:

  • Auto
    • Manufacture
      • Honda
      • Fiat
      • Porsche
    • Type
      • SUV
      • Sedan
      • Station
  • Aircraft
    • Manufactur
      • Airbus
      • ....

Suppose you want to analyze this sentence:

"We drove and we passed a Porsche and a Volkswagen on the highway"

If we handle this with a taxonomy then the following may come from:

"We drove and pased the Porsche / Manufacturer past a Volkswagen / Manufacturer on the highway / road"

Then we could save this information as a Name - Value pair:

Manufacturer: Porsche
Manufacturer: Volkswagen
Road: Highway

And in this way, it looks a bit more structured. And this would be the time to load it in the data warehouse and combine with other structured data. This is not the only way but there are many more. Eg Name-Value processing, or Homografic resolution, or example List processing. These are all techniques you can use to filter out patterns in the data (to schematizing).

An architecture that can handle this could be this :

The data lands from structured sources in stagingarea that is structured and the unstructured data lands in a data lake (Hadoop). Then the structured information is loaded into an Enterprise Data Warehouse (EDW), processed and reported by BI tooling. The unstructured data is stored in a data lake and analyzed in an analytical platform with all kinds of tooling such as R or with the power series of Microsoft. This is a more explorative process (discovery). The results are written back into the data lake, after which this data can be read again into the EDW. The EDW also provides display data to the analytical platform in order to enrich the unstructured data. So there are multiple loops in this architecture. And that is generally so with big data architectures. It is heterogeneous environment with various tooling that do what they are good at.


So this is a way to reduce the amount of data that make data less 'data obese'. It is important that we find ways to extract information from the enormous information piles that we can use in analytical platforms and can add value to structured (enterprise) data warehouses.



zaterdag 14 maart 2015

Open Virtualization Format Archive

I've downloaded a Hadoop distribution from HortonWorks and it's an open Virtualization format archive. Open Virtualization Format (OVF) is an open standard for packaging and distributing virtual applicances for running in virtual machines.

This is how I loaded the OVF file in Virtual Box.

1. Start VirtualBox Manager

2. click on File and Import Appliance

 3. Select the OVF file

4. Adjust the settings.

 5. And press Import and the import starts.

6. And the results.


zaterdag 31 januari 2015

Datamining : Reasoning (Part I)


According to Wikipedia, Datamining is: "Searching for statistical patterns in (large) datasets for purposes like scientific, commercial usage, etc". So, datamining is for discovery based strategies (in contrast with Business based discovery) to discover relations, correlations or perhaps categorizations. Data mining is about induction. Induction is generalizing a certain number of observations. Induction is the opposite of deduction. Deduction is generic rule applied to a certain part of observations.

Deductive reasoning

In contrast to inductive reasoning, deductive reasoning is concerned with conclusions follow with certainty from their premises and inductive reasoning refers to situations where conclusions only probabilistic follow from their premises.  There are a couple of rules of inference according to the book from my good old study at the University: "Cognitive psychology and it's implications" by John R. Anderson. I can recommend this book by the way (if it's still sold).

Conditional statement (modus ponens)
A conditional statement is an assertion, such as "if you read this blogpost, you will be wiser". The if part is called the antecedent and the then part is the consequent. This certain reasoning is called modus ponens (if A, then B and the proposition A, we can infer B). Suppose the following:

1. If you read this blogpost, you will be wiser.
2. You read this post.

From premises 1 and 2 we can infer 3 by modus pones.

3. You are wiser!

Modus tollens
Another rule of inference is modus tollens. This rule states, that if we are given the proposition A implies B and the fact that B is false, then we cab infer A is false. An example:

1. If you understand this blogpost, you understand reasoning with datamining
2. You don't understand reasoning with datamining.

It follows from modus tollens that:

3. You didn't understand this blogpost.

Inductive reasoning

With induction, a way of reasoning is meant and it's acting as a proof.  With inductive reasoning an general rule is defined on a certain number of observations. We call this generalization. On a certain number of observations we reason to a generalization of the rule. Inductive reasoning is the term used to  describe the process by which one comes to conclusions that are probable rather that certain. This seems much more useful because very little is certain end at very best, very likely. 

1. If you read this blogpost, you will be wiser.
2. You are wiser.

Then it follows with inductive reasoning:

3. You've read this blogpost! Great!

Bayes's Theorem provides a way for assessing the plausibility of this reasoning. What is the probability that you are wiser because of reading this blogpost? There is also a probability that you have become wiser because of reading something else. That is important to know too.


This is an introduction to datamining and inductive reasoning with data. This blogpost discusses some issues with inductive reasoning and datamining.

maandag 8 december 2014

Let's Disco!


Disco is a process mining tool that let's you discover processes by analyzing event logs. It's suitable for analyzing transactions that happen in a process. For instance, a sales process or a order process are examples that can be analyzed. You can examine bottlenecks in your process, loops, time, durations, averages, fast/slow lanes, conformance issues, resource performance, etc.

So, processmining with a tool like ProM, Disco (and I've even seen a demo of SAS Visual Analytics with a Sankey Diagram) are very well suited for process model analysis. Different tooling are enhancing process analysis. Process analysis can be a great addition to Business Intelligence. Where Business intelligence is more like phishing with a fishing rod in an organisation for KPI's and Business Metrics, Process Mining is much more 'on' the process. Business Intelligence and Process Mining can work together to optimize processes.

A simple tour through Disco

In this blogpost, I've included some screenshots about Disco. I've used a small file that I've borrowed from the Coursera Course "Process mining:Data Science in Action" lectured by TUE, Wil van der Aalst. Below the opening screen of Disco.

The first thing that you have to is load the data. In this case a csv file.

Then you have to set the CaseID, Events, Timestamps and the resources. This is needed for analysis of process.

The next step is importing the data and generating the processmap. Below an example of the duration of the processsteps.

Here an example of the mean of durations

And below, an example when you play the data on the model. The yellow/red dots are markers that flow through the model

 And below some statistical analysis with Disco:

Some more information about the cases.


Disco is a great tool for analysis of processes. Process mining can be a great addition to Business Intelligence and very helpful for analysis of processes. Both analyses processes but on different levels.


donderdag 4 december 2014

Process mining


Currently following the course processmining on Coursera and this course is combination of data mining and processmodels. Like the same as with Business intelligence (sort of), processmining analyzes data about processes in an organisation. I'm quite enthusiastic about this approach because it analyzes processes on a scientific manner with data mining. Datamining analyzes data but not with the (direct) aim of looking at processes in an organisation. Processmining does. There is also a relation between BPM (Business Process modelling). In the course sometimes BPM models are used aside petrinets.

Although I've just started with the course, I want to share some interesting things, I've come along during the course. In this blog post I'll describe this.

Defining process mining

The definition of Processmining according to Wikipedia :

"Process mining is a process management technique that allows for the analysis of business processes based on event logs.".

But in my opinion event logs is a bit of a narrow keyword. A more broader definition could be applicable. Think about facts in a star schema or satellite information in a Datavault model that are very often used in business intelligence and data warehousing. These are transactions that happens in the operations of an organisation These are also events. Events that happened. Think about an order entry system with statuses. Sometimes, customers told me how the order entry process worked and when I studied the different statuses an order should have, I sometimes found out that different sequences of the order process were possible. With process mining you can identify undiscovered routes of your business process in automated way. This is truly an addition in the  field of Business intelligence, Lean Six Sigma,  datamining and BPM.

Just a simple example, suppose from the customer you hear that the model is this (orders with order statusses):

But when we study the transactions of the order entry system the following is noticed (records are identified by a case ID (the grouping of the records) and the activity at a certain moment):

Here we see that order 4568 is reopened and this should not have supposed to happen according to the designed model. After analyzing the events in a log or perhaps a transactional modelled star schema the model appears like this (corrected):

It could mean that in the operational process order entry personnel has reopened the order for some reason. If you want optimize the process in order to reduce the wastes (Lean Six sigma) than this is very interesting information. Process mining can do this for you.


Although I've just started with studying process mining, this seems a very interesting approach for analyzing processes with datamining. And, this is also applicable on huge log files and analyzing log files is one of the applications of Big data analytics. 

Hope you have read this blogpost with pleasure..