zondag 15 maart 2015

Big Data : Data Obese (Part I)

Introduction

We are buried under the amount of information that we collect. Especially the amount of data we refer to as big data information. This information is mostly unstructured or semi structured like JSON or XML. We need on an information diet. How do we handle this?

In addition, some companies (Cloudera)  has already announced that the end of the data warehouse will happen when Hadoop is full grown. I'm still not convinced, especially in the short and medium term. Hadoop is an eco system that is under development and it has lots of potential.



Why a (structured) data warehouse?

Why do we have a (structured) data warehouse? You'd almost forgotten in the Big Data violence that we hear and read daily. Let's go back to the basics. A data warehouse is the place where we can store all information of an organization (and beyond) in a structured way. Structure is important in this statement. This fact has many advantages. The data has been optimized for complex queries. It has been cleaned (ETL) and a labelled (metadata) such as: It is a varchar and it is always 50 characters wide.

Relieving a sourcesystem can be an important advantage too. For example, if there is no data warehouse then, each data mart or a brief report must have access to the source system. The one-time retrieval of the data in an isolated environment can thus relieve the production systems.


But the main advantage is the fact that a data warehouse can integrate data from different sources. Based on business keys used by the organization so you can integrate for example, sales, purchasing and production (vertical columns) again make horizontal. You collect and integrate all departments.

And, if you want you want do it completely right,  then you have to make sure that you save the history data from the source systems properly. Often people will think you mean the old data, but it is not. The point is that you save the changes of the data in the source systems. This has advantages, namely that you have the central place and your data warehouse is the truth for all corporate data. After all, you have saved the state of a source system at any point in time. So now  you can time travel, "How my report looked at the time April 1, 2013?". The report will show the position of the data at that specific time. Subsequent changes have no effect. No more guesswork by stating that changes in the data provide other figures in the reports (if you do not save changes). It also has other advantages namely that you are 'auditable', you can re-generate new insights on old data and you can analyze processes by example you have registered status changes (process mining).


How does Big data fit into this?

But now the big question : "how does big data fit in this story?". The question is what we mean by big data and that is difficult. It has become such a huge container concept that it is not entirely clear what exactly is big data. Looking at the 3 Vs will see it is a lot of data, which are very diverse and that also goes fast. Suppose we assume a (enterprise) data warehouse,  there are complex business rules and integration issues it is not expected that real-time processing is possible. This is simply not easy to achieve. Assuming this situation : speed (real time) is less important, we still hold about two aspects that may be important in a big data data warehouse: The amount of data and the variety.

For the amount of data, we could realize a Hadoop solution that stores the unstructured data. Because there is no schema, you can store data quickly in a Hadoop cluster. And, you can store all information without you know the schema of the data. Imagine that your datastructure  has changed, You have to drop and recreate the table in a RDBMS and that is not needed with Hadoop. 

There is no schema data (we do not have data types such as varchar or integer). In short, when reading the data from the Hadoop cluster, then we have to define a schema. Defining the schema can be defined very broadly. The script (mapper) determines what is the schema. For example, you can run the script Text ETL where you are looking for patterns. But for example with Hive, that is part of the ecosystem Hadoop make a schema on the data files in HDFS. This is also called "Schema-On-Read '.


But let's just focus on unstructured information such as email, reports, Word documents or files that have a bit more structure such as error logs or web logs. So basically data without metadata. How can we analyze this? We can use text ETL. You can search for patterns, for example, taxonomies or ontologies. Below is an example of a taxonomy:

Transport
  • Auto
    • Manufacture
      • Honda
      • Fiat
      • Porsche
    • Type
      • SUV
      • Sedan
      • Station
  • Aircraft
    • Manufactur
      • Airbus
      • ....

Suppose you want to analyze this sentence:

"We drove and we passed a Porsche and a Volkswagen on the highway"

If we handle this with a taxonomy then the following may come from:

"We drove and pased the Porsche / Manufacturer past a Volkswagen / Manufacturer on the highway / road"

Then we could save this information as a Name - Value pair:

Manufacturer: Porsche
Manufacturer: Volkswagen
Road: Highway

And in this way, it looks a bit more structured. And this would be the time to load it in the data warehouse and combine with other structured data. This is not the only way but there are many more. Eg Name-Value processing, or Homografic resolution, or example List processing. These are all techniques you can use to filter out patterns in the data (to schematizing).

An architecture that can handle this could be this :


The data lands from structured sources in stagingarea that is structured and the unstructured data lands in a data lake (Hadoop). Then the structured information is loaded into an Enterprise Data Warehouse (EDW), processed and reported by BI tooling. The unstructured data is stored in a data lake and analyzed in an analytical platform with all kinds of tooling such as R or with the power series of Microsoft. This is a more explorative process (discovery). The results are written back into the data lake, after which this data can be read again into the EDW. The EDW also provides display data to the analytical platform in order to enrich the unstructured data. So there are multiple loops in this architecture. And that is generally so with big data architectures. It is heterogeneous environment with various tooling that do what they are good at.


Conclusions

So this is a way to reduce the amount of data that make data less 'data obese'. It is important that we find ways to extract information from the enormous information piles that we can use in analytical platforms and can add value to structured (enterprise) data warehouses.

Greetz,

Hennie 

zaterdag 14 maart 2015

Open Virtualization Format Archive

I've downloaded a Hadoop distribution from HortonWorks and it's an open Virtualization format archive. Open Virtualization Format (OVF) is an open standard for packaging and distributing virtual applicances for running in virtual machines.

This is how I loaded the OVF file in Virtual Box.

1. Start VirtualBox Manager

2. click on File and Import Appliance



 3. Select the OVF file



4. Adjust the settings.



 5. And press Import and the import starts.


6. And the results.



Greetz,
Hennie