Posts tonen met het label Inmon. Alle posts tonen
Posts tonen met het label Inmon. Alle posts tonen

zondag 15 maart 2015

Big Data : Data Obese (Part I)

Introduction

We are buried under the amount of information that we collect. Especially the amount of data we refer to as big data information. This information is mostly unstructured or semi structured like JSON or XML. We need on an information diet. How do we handle this?

In addition, some companies (Cloudera)  has already announced that the end of the data warehouse will happen when Hadoop is full grown. I'm still not convinced, especially in the short and medium term. Hadoop is an eco system that is under development and it has lots of potential.



Why a (structured) data warehouse?

Why do we have a (structured) data warehouse? You'd almost forgotten in the Big Data violence that we hear and read daily. Let's go back to the basics. A data warehouse is the place where we can store all information of an organization (and beyond) in a structured way. Structure is important in this statement. This fact has many advantages. The data has been optimized for complex queries. It has been cleaned (ETL) and a labelled (metadata) such as: It is a varchar and it is always 50 characters wide.

Relieving a sourcesystem can be an important advantage too. For example, if there is no data warehouse then, each data mart or a brief report must have access to the source system. The one-time retrieval of the data in an isolated environment can thus relieve the production systems.


But the main advantage is the fact that a data warehouse can integrate data from different sources. Based on business keys used by the organization so you can integrate for example, sales, purchasing and production (vertical columns) again make horizontal. You collect and integrate all departments.

And, if you want you want do it completely right,  then you have to make sure that you save the history data from the source systems properly. Often people will think you mean the old data, but it is not. The point is that you save the changes of the data in the source systems. This has advantages, namely that you have the central place and your data warehouse is the truth for all corporate data. After all, you have saved the state of a source system at any point in time. So now  you can time travel, "How my report looked at the time April 1, 2013?". The report will show the position of the data at that specific time. Subsequent changes have no effect. No more guesswork by stating that changes in the data provide other figures in the reports (if you do not save changes). It also has other advantages namely that you are 'auditable', you can re-generate new insights on old data and you can analyze processes by example you have registered status changes (process mining).


How does Big data fit into this?

But now the big question : "how does big data fit in this story?". The question is what we mean by big data and that is difficult. It has become such a huge container concept that it is not entirely clear what exactly is big data. Looking at the 3 Vs will see it is a lot of data, which are very diverse and that also goes fast. Suppose we assume a (enterprise) data warehouse,  there are complex business rules and integration issues it is not expected that real-time processing is possible. This is simply not easy to achieve. Assuming this situation : speed (real time) is less important, we still hold about two aspects that may be important in a big data data warehouse: The amount of data and the variety.

For the amount of data, we could realize a Hadoop solution that stores the unstructured data. Because there is no schema, you can store data quickly in a Hadoop cluster. And, you can store all information without you know the schema of the data. Imagine that your datastructure  has changed, You have to drop and recreate the table in a RDBMS and that is not needed with Hadoop. 

There is no schema data (we do not have data types such as varchar or integer). In short, when reading the data from the Hadoop cluster, then we have to define a schema. Defining the schema can be defined very broadly. The script (mapper) determines what is the schema. For example, you can run the script Text ETL where you are looking for patterns. But for example with Hive, that is part of the ecosystem Hadoop make a schema on the data files in HDFS. This is also called "Schema-On-Read '.


But let's just focus on unstructured information such as email, reports, Word documents or files that have a bit more structure such as error logs or web logs. So basically data without metadata. How can we analyze this? We can use text ETL. You can search for patterns, for example, taxonomies or ontologies. Below is an example of a taxonomy:

Transport
  • Auto
    • Manufacture
      • Honda
      • Fiat
      • Porsche
    • Type
      • SUV
      • Sedan
      • Station
  • Aircraft
    • Manufactur
      • Airbus
      • ....

Suppose you want to analyze this sentence:

"We drove and we passed a Porsche and a Volkswagen on the highway"

If we handle this with a taxonomy then the following may come from:

"We drove and pased the Porsche / Manufacturer past a Volkswagen / Manufacturer on the highway / road"

Then we could save this information as a Name - Value pair:

Manufacturer: Porsche
Manufacturer: Volkswagen
Road: Highway

And in this way, it looks a bit more structured. And this would be the time to load it in the data warehouse and combine with other structured data. This is not the only way but there are many more. Eg Name-Value processing, or Homografic resolution, or example List processing. These are all techniques you can use to filter out patterns in the data (to schematizing).

An architecture that can handle this could be this :


The data lands from structured sources in stagingarea that is structured and the unstructured data lands in a data lake (Hadoop). Then the structured information is loaded into an Enterprise Data Warehouse (EDW), processed and reported by BI tooling. The unstructured data is stored in a data lake and analyzed in an analytical platform with all kinds of tooling such as R or with the power series of Microsoft. This is a more explorative process (discovery). The results are written back into the data lake, after which this data can be read again into the EDW. The EDW also provides display data to the analytical platform in order to enrich the unstructured data. So there are multiple loops in this architecture. And that is generally so with big data architectures. It is heterogeneous environment with various tooling that do what they are good at.


Conclusions

So this is a way to reduce the amount of data that make data less 'data obese'. It is important that we find ways to extract information from the enormous information piles that we can use in analytical platforms and can add value to structured (enterprise) data warehouses.

Greetz,

Hennie 

zondag 4 maart 2012

Four different datamodeling methods

Introduction
Most of the time, 3NF are datamodels that are used in transactional systems like an orderentry application. For datawarehousing we had until a couple of years ago the (historized) 3NF and the starschema. Since a couple of years there are some new kids in the datawarehouse town: Datavault modeling and Anchor modeling. These datamodels try to deal with the disadvantages of 3NF and starschemas.

We are starting to learn that the starschema is more an end user datamodel that is easily understood by end users. It's a high performance model. Datavault modeling and Anchor modeling are more focussed on storing the right information in such a way that you achieve agility, speed of development and auditability. There are more reasons but i don't want to elaborate too much.

At first i wanted to write down a successor of my first blogpost about Anchormodeling but soon the blogpost became more and more an overview of the four types of data modeling techniques. I hope to write more about anchormodeling in next blogposts.

In this exercise i'll use the 3NF datamodel of the AdventureWorks2008LT Database . This is a simplified database of the large SQL Server AdventureWorks 2008 database. You can find this LT database on the site Microsoft codeplex. There isn't a SQL Server 2012 version available yet, as far as i  know. I'll transform this 3NF datamodel into a Datavault model (As i've blogged earlier), a anchor model and a starschema model.

Adventureworks2008LT  (3NF)
Below a representation of the 3NF datamodel as i've analyzed from the Adventureworks2008LT database. There are a couple of transaction tables (SalesOrderHeader and SalesOrderdetail) present and some master data tables (Customer, Productdescription, productcategory, etc). At first sight there seems no reference data in the diagram.




Adventureworks2008LT (3NF) in more detail
Below the database in more detail. I've decided to leave the errorlog and buildversion out of the models.




The Datavault model
As mentioned in my earlier blogpost i've used Quipu as a tool to generate the datavault model from the Adventureworks2008LT database. This tool can be downloaded, for free. The datavault model that is generated by Quipu is a typically source datavault model. You can find more information about this discussion at the blog of Ronald Damhof.

Now, the model that is generated is shown in the diagram below. The blue tables are the links, the orange tables are hubs and the yellow ones are the satellites. I've organized the diagram in a way that it looks like as much as possible as the Adventureworks2008LT 3NF model.



At first sight, you'll see that the number of tables are more than the 3NF model. That  is because the relations have become a table (link), the busineskeys have become a seperate table (hub) and the descriptive fields are stored in another table (sat). In my opinion the main advantage of this model is that you can increase agility and speed of development by separating the relations, business keys and the descriptive fields. Auditability seems to me a derived advantage of the agility argument: a historized 3NF datawarehouse datamodel is also auditable. But because the datavault model is more flexible that the 3NF model you can achieve the auditabilty much easier and faster than the historized 3NF datamodel.

The Anchor model
Below you can see a first draft of the Anchor datamodel. In my former post i've briefly explained a piece of the anchor modeling technique. I'm currently investigating this modeling technique for usage in my projects. It seems an interesting technique. The agility is pushed to the extreme because every field in a source table is a table in the anchor model. You can imagine that this enhances maximum flexibility.


The Adventureworks2008LT 3NF model is exploded into a huge number of tables in the anchor model. A good naming convention is very very appropriate. Even better, a tool should be available to manage the tables. In a future post i'll show the tables that are created and i can tell you now that the 9 tables in the 3NF model are about 100 tables in the anchor model (depending of the number of attributes in a table in the source model).

The diagram is created on the site http://www.anchormodeling.com/ and although it's a great solution and it looks great there are some disadvantages. It took me quite some time to create the datamodel.

The starschema
Below an example is shown of a starschema based on the Adventureworks2008LT 3NF datamodel. There are a couple of directions possible depending on the business process you're supporting. I've chosen to model it like this but you could also combine the SalesOrderHeader and the SalesOrderDetail in one fact. Or you could remove the snowflake table Category and move it to the fact. So depending on your (future?) business question you determine the starschema model.

In this example i've decided to split the SalesOrderHeader and SalesOrderDetail table in two facts. The SalesOrderHeaderDD is the link between the two facts. There are a couple of  difficult relations in the AdventureWorks2008LT 3NF model:
  • The relation product-Category.
  • The n:m relation between the productmodel and the Productdescription entity.
  • The relation between customer and address (CustomerAddress).


I've decided not to model the Customeraddress in the model. I've also decided to model the other difficult relations in the same way as the 3NF datamodel, resulting in a snowflake relation and a bridge table solution. Off course there are other decisions possible.

Conclusion
First, this blogpost started in my mind with a successor of the anchormodeling blogpost but later on i've decided to rewrite the blogpost to a overview blogpost of the four datamodels i'm interested in. I hope it gives an overview of the models.

So in my opinion we have now the following options to model a datawarehouse:

  • Historized 3NF model.
  • Datavault model.
  • Anchormodel.
  • Starschema.

From these options it seems that for a data storage foundation the best solution is datavault model or the anchor model. For enduser reporting and cube design the starmodel seems the most optimal solution. The main problem of the historized 3NF model is that the cascading changes that can happen and therefore this model is not the best option (in my opinion).

Greetz,

Hennie

donderdag 21 oktober 2010

Kimball vs Inmon (Part II) : it's now scientific.

In an earlier post i've written about Inmon vs Kimball, according to Inmon himself. The conclusions in my post are underwritten by a scientific research by Watson & Ariyachandra (2005).  In their research document "Types of DW architectures and factors influencing their selection" they are trying to answer the following questions:
  1. What factors lead companies to select a particular architecture and
  2. How successful are the various architectures? 
In this well described and easy reading document there are 5 architectures identified after interviews with experts: independent data marts, bus architecture (kimball), hub and spoke (inmon), centralized and federated. Datavault is an variation of the hub and spoke model of Inmon and this isn't discussed in the document. I refer to the document for more detailed explanation about these various datawarehouse architectures. After the researchers have identified the architectured about 500 managers, users, etc of datawarehouse were surveyed so this is quite a thorough research project.

So below i've pointed the most interesting parts of this document:
  • 39%  of the respondents uses the hub and spoke model (Inmon).
  • 26 % of the respondents have built and maintains the busarchitecture (Kimball).
  • The hub and spokearchitecture requires the most time to roll out with an average of 11 months. 
  • The bus architecture takes about 9 months to develop and deploy.
  • The hub and spoke and the busarchitecture are equally succesful.
  • One third of the company's switched  the architecture. 
  • 30% procent of the switchers switched from the hub and spoke to the bus architecture (AND back!). This is awkward! The supporters of the different architectures talked about the failures of the other(!). Perhaps the made initially the wrong choice for a architecture, based on the wrong assumptions?
  • The hub and spoke model has higher technical issues and needed more expert influence.
  • There is not a big difference between the hub and spoke model and the centralized datawarehouse.

    Conclusion
       When to choose the busarchitecture (Kimball):
      • When the need for a datawarehouse is high (and you want quick results).
      • High need of information flow between organizational units. I don't understand this. Perhaps it 's meant as the unstructured version of the dataintegration of the hub and spoke architecture?
      • You want to build a silod system/ departmental datawarehouse (from my earlier post).
      • When the information requests don't change that much (from my earlier post).
      • The view of the ware house prior to implementation is more limited in scope.

         When to choose the hub and spoke architecture (Inmon):
        • When there is a high need of integration between organizational units.
        • The datawarehouse is viewed as strategic.
        • Percieved ability of the in-house staff is high.
        • When the information needs are not quite clear (from my earlier post).
        • When the information request are varied (from my earlier post).
        • when information requests are very ad hoc (from my earlier post).
        • The 'system of record' principle is important (from my earlier post).
        • When there is (more) time to build the datawarehouse.

          Thats it for now....

          Hennie

            vrijdag 13 augustus 2010

            Inmon vs Kimball according to....Inmon himself

            What the... a couple of weeks ago Inmon produced an article on his blog. A smile appearred on my face. Inmon talking about the comparison between Kimball and Inmon. Interresting. Let's see what he has to say. He starts a bit grumpy and i had a deja vu remembering him presenting at the BI event where i already had the impression that he was a bit grumpy. But ok....Let's see what he has to say...

            So first he explains the two different systems, first Kimball and then Inmon. In the part about Kimball he explains that Kimball "in order to resolve differences of granularity between fact tables, conformed dimensions are used.". Hmm i never thought about that, in that manner. In my view conformed dimensions are used for getting an integrated view between fact tables and so you can use the conformed dimensions for 'hopping' between business processes and combine the measures from this different facts. So this way you can consolidate your datawarehouse. It's not quite clear what he meant by this statement. What are we trying to resolve with conformed dimensions, concerning granularity? When building an integrated view of facts with different granularities still needs some additional logic. When one fact is connected to one dimensionrecord, another fact record can connect to two dimension records. From my viewpoint, conformed dimensions are not a solution for resolving granularities between facts. They can aid building integrated views, like dashboards, etc.

            Inmon explains some stuff about his metholodogy, which i agree most of it. The only point i want to make is that Datavault by Linstedt is DW2.0 proof and could be a better alternative than the relational approach of Inmon, but that's my opinion.

            Okay, the bulletpoints (of Inmon):
            • The long-term or short-term nature of the solution. If what you want is speed of development and good short term results, Kimball is the way to go. If you want a long-term architectural solution, then the Inmon approach is the way to go (Comment: Agree)
            • If you want a tactical solution, then Kimball is the way to go. If you want a strategic solution, then Inmon is the way to go (Comment: Agree)
            • The Kimball approach is very fragile. Whenever requirements undergo significant change, the Kimball solution is “brittle.” (Comment: Agree).
            • In the Kimball approach, there is no single version of the truth, no system of record. In the Inmon approach, at the integrated granular level, there is a carefully crafted single version of the truth (Comment: Interresting. In the projects 've done building kimball datawarehouses we always try to build a clean ETL and what i mean by this, at first i let the garbage in the data and showed the data quality in the cubes (building a cube is easy when you have a star created already). After that i started to squeze the bad data by building error marts. So there is not really a base where the data is collected in a consolidated system. So, he is right about this).
            • The Kimball approach produces results in a short amount of time. The Inmon approach produces results over a longer period of time.(Comment: Yup, Kimball is quickstarter for enterprise datawarehouses but Inmon will take the lead later in the project. on the other hand for implementing a Inmon like datawarehouse it's more likely to have more support from the higher management. With Kimball it's easier to implement the 'Think big, start small' approach)
            • The Kimball architecture does not produce a foundation that can be built on over time as requirements change (Comment: agree with that)
            • The Kimball architecture focuses on a narrow set of infrastructure components such as tables and indexes. The Inmon architecture is far more robust than the Kimball architecture including such things as unstructured, textual data, near line storage, archival storage and processing the tight integration of metadata into the infrastructure (Comment: Kimball approach can be built upon unstructured data, but he hasn't or doesn't want to incorporate that in his methodology. This is a less important argument)
            So, what can we learn about this? So, here's my advice:
            1. Choose Inmon when the information needs are not quite clear, when the information request are varied, when information requests are very ad hoc, the 'system of record' principle is important, when the datawarehouse is very strategic to a organisation or when there is time to build a consolidation layer.
            2. Choose Kimball when you built a datawarehouse as a silod system, when the information requests won't change that very much, when the datawarehouse is not very strategic but more tactical (departmental) or when you want to show quick results.
            Greetz!
             
            Hennie

            donderdag 31 december 2009

            How to ETL a dimension in a datawarehouse (part I)

            Today I want to discuss some different methods of extract, transform and loading (ETL) a dimension into a datawarehouse. With ETLing a dimension into a datawarehouse i mean loading (ETL) the information from a source system (OLTP) into dimensiontable which is part of a star model. There are several approaches of doing this. I will discuss the different ways which you can use to load a dimension.

            I think there are two methods of ETLing a datawarehouse. Some guys from SQLBI wrote some articles on how they perform their loading method (SQLBI articles). They pointed me on a (obvious) loading method for small sized systems. So as I said earlier there are two general loading methods for a dimension:

            • ‘ Loading the datawarehouse on the fly’. You can discuss whether this fits into the definition of datawarehouse (non volatile? blabla). But this approach can be used when the data amount is small or at least loadable during the night. I heard that some Inmon minded datawarehouses generate star models on the fly from a (relational) datawarehouse. But, I haven’t seen this working (meaning it could be there but haven’t seen it yet).

            • Keeping record of earlier loaded information in your datawarehouse. Meaning that a process of loading the information into a dimensiontable, keeps track of whether the information hasn’t loaded earlier. If the record is loaded earlier, don’t process further. If not loaded before load (new, changed) the record into your dimensiontable. And, if deleted mark the specific record deleted.

            What do want to store in your dimension?

            • a new record in the source system (table, view, etc).

            • a changed record in the source system (table, view, etc).

            • a deleted record in the source system (table, view, etc).

            The first 2 options of changes (new and changed) in the source are quite non discussable, but detecting a deleted record in the source is more difficult, because it requires an another approach than the first two options. The first two options is simply comparing the incoming records in the datawarehouse against the already loaded information, and then you have to decide what you want to do with this (Update (Type I) or Insert (TYPEII)). But, detection of deleted record is something different.

            How do detect a deleted record in the source system. There are two options I can think of:

            • Using trigger (CDC or alike) which fires an event when the record is deleted in the source.

            • Or detect the deleted records by offering the complete dataset to the datawarehouse for every ETL.

            The first method could be named ‘Pushing the deleted record in the datawarehouse’ and the other one is ‘pulling the deleted records into a datawarehouse’. Both has their advantages and disadvantages. I won’t discuss this here.

            So this is it for now. A blog shouldn’t be too lengthly… Next part laters.