Posts tonen met het label Linstedt. Alle posts tonen
Posts tonen met het label Linstedt. Alle posts tonen

zondag 15 maart 2015

Big Data : Data Obese (Part I)

Introduction

We are buried under the amount of information that we collect. Especially the amount of data we refer to as big data information. This information is mostly unstructured or semi structured like JSON or XML. We need on an information diet. How do we handle this?

In addition, some companies (Cloudera)  has already announced that the end of the data warehouse will happen when Hadoop is full grown. I'm still not convinced, especially in the short and medium term. Hadoop is an eco system that is under development and it has lots of potential.



Why a (structured) data warehouse?

Why do we have a (structured) data warehouse? You'd almost forgotten in the Big Data violence that we hear and read daily. Let's go back to the basics. A data warehouse is the place where we can store all information of an organization (and beyond) in a structured way. Structure is important in this statement. This fact has many advantages. The data has been optimized for complex queries. It has been cleaned (ETL) and a labelled (metadata) such as: It is a varchar and it is always 50 characters wide.

Relieving a sourcesystem can be an important advantage too. For example, if there is no data warehouse then, each data mart or a brief report must have access to the source system. The one-time retrieval of the data in an isolated environment can thus relieve the production systems.


But the main advantage is the fact that a data warehouse can integrate data from different sources. Based on business keys used by the organization so you can integrate for example, sales, purchasing and production (vertical columns) again make horizontal. You collect and integrate all departments.

And, if you want you want do it completely right,  then you have to make sure that you save the history data from the source systems properly. Often people will think you mean the old data, but it is not. The point is that you save the changes of the data in the source systems. This has advantages, namely that you have the central place and your data warehouse is the truth for all corporate data. After all, you have saved the state of a source system at any point in time. So now  you can time travel, "How my report looked at the time April 1, 2013?". The report will show the position of the data at that specific time. Subsequent changes have no effect. No more guesswork by stating that changes in the data provide other figures in the reports (if you do not save changes). It also has other advantages namely that you are 'auditable', you can re-generate new insights on old data and you can analyze processes by example you have registered status changes (process mining).


How does Big data fit into this?

But now the big question : "how does big data fit in this story?". The question is what we mean by big data and that is difficult. It has become such a huge container concept that it is not entirely clear what exactly is big data. Looking at the 3 Vs will see it is a lot of data, which are very diverse and that also goes fast. Suppose we assume a (enterprise) data warehouse,  there are complex business rules and integration issues it is not expected that real-time processing is possible. This is simply not easy to achieve. Assuming this situation : speed (real time) is less important, we still hold about two aspects that may be important in a big data data warehouse: The amount of data and the variety.

For the amount of data, we could realize a Hadoop solution that stores the unstructured data. Because there is no schema, you can store data quickly in a Hadoop cluster. And, you can store all information without you know the schema of the data. Imagine that your datastructure  has changed, You have to drop and recreate the table in a RDBMS and that is not needed with Hadoop. 

There is no schema data (we do not have data types such as varchar or integer). In short, when reading the data from the Hadoop cluster, then we have to define a schema. Defining the schema can be defined very broadly. The script (mapper) determines what is the schema. For example, you can run the script Text ETL where you are looking for patterns. But for example with Hive, that is part of the ecosystem Hadoop make a schema on the data files in HDFS. This is also called "Schema-On-Read '.


But let's just focus on unstructured information such as email, reports, Word documents or files that have a bit more structure such as error logs or web logs. So basically data without metadata. How can we analyze this? We can use text ETL. You can search for patterns, for example, taxonomies or ontologies. Below is an example of a taxonomy:

Transport
  • Auto
    • Manufacture
      • Honda
      • Fiat
      • Porsche
    • Type
      • SUV
      • Sedan
      • Station
  • Aircraft
    • Manufactur
      • Airbus
      • ....

Suppose you want to analyze this sentence:

"We drove and we passed a Porsche and a Volkswagen on the highway"

If we handle this with a taxonomy then the following may come from:

"We drove and pased the Porsche / Manufacturer past a Volkswagen / Manufacturer on the highway / road"

Then we could save this information as a Name - Value pair:

Manufacturer: Porsche
Manufacturer: Volkswagen
Road: Highway

And in this way, it looks a bit more structured. And this would be the time to load it in the data warehouse and combine with other structured data. This is not the only way but there are many more. Eg Name-Value processing, or Homografic resolution, or example List processing. These are all techniques you can use to filter out patterns in the data (to schematizing).

An architecture that can handle this could be this :


The data lands from structured sources in stagingarea that is structured and the unstructured data lands in a data lake (Hadoop). Then the structured information is loaded into an Enterprise Data Warehouse (EDW), processed and reported by BI tooling. The unstructured data is stored in a data lake and analyzed in an analytical platform with all kinds of tooling such as R or with the power series of Microsoft. This is a more explorative process (discovery). The results are written back into the data lake, after which this data can be read again into the EDW. The EDW also provides display data to the analytical platform in order to enrich the unstructured data. So there are multiple loops in this architecture. And that is generally so with big data architectures. It is heterogeneous environment with various tooling that do what they are good at.


Conclusions

So this is a way to reduce the amount of data that make data less 'data obese'. It is important that we find ways to extract information from the enormous information piles that we can use in analytical platforms and can add value to structured (enterprise) data warehouses.

Greetz,

Hennie 

zondag 25 mei 2014

Datavault : Kanban in a Datavault project (part III)

Introduction

I've finished my Lean Six Sigma Orange Belt certification a couple of months ago. For this certification I needed a case study that would justify a certification. I decided to use my project as a base for my certification. In my current project we are developing a analytic platform for analysts with Datavault. We already used a kanbanboard but I thought that some improvements was possible.

We encountered some throughput problems of the implmentations and I decided to investigate our development proces and thought about what could aid speeding up the process. The teammembers discussed our method and we tried to adopt the various techniques of Lean Six Sigma.

One of the subject areas of Lean Six Sigma is Kanban and Kanban can be very useful for making your processes visible and identifying possible bottlenecks in the process. Read more about Kanban on Wikipedia and this free book about Kanban and Scrum. These are very learnful.

This is the approach I have taken (with the development team) in order to develop a (new) Kanban board:
  • Identifying the steps we take in the development process?
  • What do we do when we build Datavault implementations (the objects) ?
  • Who are the customers and who are the suppliers in the process?
    • Supplier, Input, Process, output and Customer  (SIPOC).
  • What are the criteria for moving a card from one phase to another?

This blogpost is one in a series of blogposts about datavault:


Task analysis

At first We decided to take a look at our activities in the project. What are the activities in the project? Well, We came up with this list:
  • Analysis (gaining knowledge)
    • Deskresearch
    • Data-analysis
    • interviews
  • Design
    • Datamodeling (Conceptual, logical, SourceVault and  BusinessVault)
    • ETL
  • Development (DDL + ETL Packages)
    • StagingIn
      • Tables
    • SourceVault
      • Hubs
      • Links
      • Sats
      • EndSats (Effectivity)
      • Refs
      • Other DV Tables
  • Build (DDL + ETL packages)
    • BusinessVault
      • Hubs
      • Links
      • Sats
      • EndSats
      • Refs
      • Other DV Tables
  • BusinessAccessLayer
    • Development (ETL packages)
    • Entities
  • ProcessFlows
    • MasterProcess flows (PCF and MPCF)
    • Input file processing (IFP)
    • StagingIn
    • SourceVault
    • BusinessVault
  • Initial loads (This is for going back in time and resample (historize) the data).
  • ProcessManagement
    • Configuration
  • Test
    • Systemtest
    • Acceptancetest
  • Documentation
    • (Conceptual Datamodel)
    • SourceVault Datamodel
    • BusinessVault Datamodel
    • Functional and Technical descriptions
    • SSIS packages
  • Additional
    • Supply contracts with the datasuppliers

SIPOC

The next step was investigating our working process. What are our phases we actually executing during development of a deliverable? And what were the criteria for moving from one phase to another?


The next step was studying the SIPOC (supplier, Input, Process, Output and Customer). With SIPOC you have a tool that help you to improve a process. This is mostly used during the Define phase int he DMAIC process.


I noticed that there are two kinds of output in a process: Work and Criteria. The criteria says something about go/no go decisions and work is about delivering value to the valuestream, for instance a Datavault model.

The Kanban board

The next step we took was designing a new Kanban board, based on the taskanalysis and the SIPOC schema. I made some designs of the board and We had quite some discussions about the board. The discussion was about finding a balance between not too simple but also not too complicated. If it was too simple it will not help us understanding the process and if it was too difficult people wouldn't accept the board. We finally made this board:


Phases
We have defined the columns as identified (more or less) in the SIPOC schema during the Define process : Analysis/design, Develop, build, test, QA, Ready for production and Production.

Lanes
We introduced two kinds of lanes: Projectlanes and fastlanes. Fast lanes were intended for high priority issues that had a higher priority than the projects that we were running.

Definition of Done (DoD)
And the last part that we added to the board was a Definition of Done row. This DoD were the criteria in the SIPOC schema. These criteria helped us describing the DoD. This helped us agreeing about when a card should be moved from one column to the other, as a group.

The result

Finally, we created a new kanban board. This is shown below.


Conclusion

Kanban can be very useful in a project with a number of teammembers. Developers have to tell something about what they are doing, the impediments they have and what they are planning to do. A Kanbanboard can be very effective when all members of team are working with one goal: deliver value within a certain period of time. I think it's less effective when all members are working on their own deliverable and with their own deadline. When teammembers are working on a collaborative goal more co-operation  and more synergy effects will happen

Greetz,
Hennie

zondag 4 maart 2012

Four different datamodeling methods

Introduction
Most of the time, 3NF are datamodels that are used in transactional systems like an orderentry application. For datawarehousing we had until a couple of years ago the (historized) 3NF and the starschema. Since a couple of years there are some new kids in the datawarehouse town: Datavault modeling and Anchor modeling. These datamodels try to deal with the disadvantages of 3NF and starschemas.

We are starting to learn that the starschema is more an end user datamodel that is easily understood by end users. It's a high performance model. Datavault modeling and Anchor modeling are more focussed on storing the right information in such a way that you achieve agility, speed of development and auditability. There are more reasons but i don't want to elaborate too much.

At first i wanted to write down a successor of my first blogpost about Anchormodeling but soon the blogpost became more and more an overview of the four types of data modeling techniques. I hope to write more about anchormodeling in next blogposts.

In this exercise i'll use the 3NF datamodel of the AdventureWorks2008LT Database . This is a simplified database of the large SQL Server AdventureWorks 2008 database. You can find this LT database on the site Microsoft codeplex. There isn't a SQL Server 2012 version available yet, as far as i  know. I'll transform this 3NF datamodel into a Datavault model (As i've blogged earlier), a anchor model and a starschema model.

Adventureworks2008LT  (3NF)
Below a representation of the 3NF datamodel as i've analyzed from the Adventureworks2008LT database. There are a couple of transaction tables (SalesOrderHeader and SalesOrderdetail) present and some master data tables (Customer, Productdescription, productcategory, etc). At first sight there seems no reference data in the diagram.




Adventureworks2008LT (3NF) in more detail
Below the database in more detail. I've decided to leave the errorlog and buildversion out of the models.




The Datavault model
As mentioned in my earlier blogpost i've used Quipu as a tool to generate the datavault model from the Adventureworks2008LT database. This tool can be downloaded, for free. The datavault model that is generated by Quipu is a typically source datavault model. You can find more information about this discussion at the blog of Ronald Damhof.

Now, the model that is generated is shown in the diagram below. The blue tables are the links, the orange tables are hubs and the yellow ones are the satellites. I've organized the diagram in a way that it looks like as much as possible as the Adventureworks2008LT 3NF model.



At first sight, you'll see that the number of tables are more than the 3NF model. That  is because the relations have become a table (link), the busineskeys have become a seperate table (hub) and the descriptive fields are stored in another table (sat). In my opinion the main advantage of this model is that you can increase agility and speed of development by separating the relations, business keys and the descriptive fields. Auditability seems to me a derived advantage of the agility argument: a historized 3NF datawarehouse datamodel is also auditable. But because the datavault model is more flexible that the 3NF model you can achieve the auditabilty much easier and faster than the historized 3NF datamodel.

The Anchor model
Below you can see a first draft of the Anchor datamodel. In my former post i've briefly explained a piece of the anchor modeling technique. I'm currently investigating this modeling technique for usage in my projects. It seems an interesting technique. The agility is pushed to the extreme because every field in a source table is a table in the anchor model. You can imagine that this enhances maximum flexibility.


The Adventureworks2008LT 3NF model is exploded into a huge number of tables in the anchor model. A good naming convention is very very appropriate. Even better, a tool should be available to manage the tables. In a future post i'll show the tables that are created and i can tell you now that the 9 tables in the 3NF model are about 100 tables in the anchor model (depending of the number of attributes in a table in the source model).

The diagram is created on the site http://www.anchormodeling.com/ and although it's a great solution and it looks great there are some disadvantages. It took me quite some time to create the datamodel.

The starschema
Below an example is shown of a starschema based on the Adventureworks2008LT 3NF datamodel. There are a couple of directions possible depending on the business process you're supporting. I've chosen to model it like this but you could also combine the SalesOrderHeader and the SalesOrderDetail in one fact. Or you could remove the snowflake table Category and move it to the fact. So depending on your (future?) business question you determine the starschema model.

In this example i've decided to split the SalesOrderHeader and SalesOrderDetail table in two facts. The SalesOrderHeaderDD is the link between the two facts. There are a couple of  difficult relations in the AdventureWorks2008LT 3NF model:
  • The relation product-Category.
  • The n:m relation between the productmodel and the Productdescription entity.
  • The relation between customer and address (CustomerAddress).


I've decided not to model the Customeraddress in the model. I've also decided to model the other difficult relations in the same way as the 3NF datamodel, resulting in a snowflake relation and a bridge table solution. Off course there are other decisions possible.

Conclusion
First, this blogpost started in my mind with a successor of the anchormodeling blogpost but later on i've decided to rewrite the blogpost to a overview blogpost of the four datamodels i'm interested in. I hope it gives an overview of the models.

So in my opinion we have now the following options to model a datawarehouse:

  • Historized 3NF model.
  • Datavault model.
  • Anchormodel.
  • Starschema.

From these options it seems that for a data storage foundation the best solution is datavault model or the anchor model. For enduser reporting and cube design the starmodel seems the most optimal solution. The main problem of the historized 3NF model is that the cascading changes that can happen and therefore this model is not the best option (in my opinion).

Greetz,

Hennie

dinsdag 12 april 2011

Why datavault will diminish the Kimball architectures.

Hi, in 2010 i started to collect more in-depth knowledge of the datavault modeling. Most of the Datavault folks know the TDAN articles of Dan Linstedt,  Dan blogs, the linkedin datavault discussions and Dan wrote a technical book about Datavault ("Supercharge your datawarehouse"). This year i succeeded for the Datavault certification of the Genessee academy. Still, i have to unreveal many areas of the datavault modeling (and anchor modeling) for a thorough understanding of the datavault.

After years and years building Kimball datawarehouses i'm convinced that the materialized star models will be more and more obsolete in the future due to hardware improvements like SSD's and Fusion techniques. In my opinion the only business case for (materialized) star modeling will be simplicity of the model (and perhaps the technical business case: it's easier to build cubes on pure star models (with dummy records)).

Because of the Self Service BI movement users will be enabled by the user friendiness tooling building easier and faster reports based on products like clickview and the new tool of Microsoft codename "Cresent" (and offcourse other SSBI Tooling). Building (easy) reports will be less and less important for the BI consultant. The easy- and medior reports will be build by the end (power) users. There are three reasons when the BI Consultant comes in to play and that is that reports needs scalability (enterprise wide), needs standardization or/and needs some certain amount of skills (the three S-esses). As the Kimball star models are positioned at the user side, less and less usage of the kimball datawarehouses will be inevitable. The end user BI tooling will be very powerful in such a way that sometimes a Kimball warehouses is not needed.

The main information delivery of a enterprise (and beyond) will be focussed on integrating data in a flexible way. (3NF) datamodels and starmodels were inflexible when changes were needed, rapidly. Changing the models caused sometimes errors, problems at the front end application, other databases that depended it on the datamodel, etc. With the datavault model more flexibility and agility is achieved. The datavault model is more error proof and will be the base or storing your enterprise data. Because of the principle "store facts not truth (one version...)" ALL data is loaded in the datavault resulting a auditable architecture because whenever the source changes KPI's can be recalculated in history(Type II principle).

The next step is building the starmodels, not physically but virtually. Starting point should be building them virtually unless the performance is bad. This way you are very flexible in building your starmodels. Perhaps a new manager wants another KPI or wants another calculation it's possible to recalculate this in the past (because you have kept all the information and used a Type II approach).

What do you think? Let me know.

Hennie

dinsdag 18 januari 2011

Linstedts videos and his first bookchapters about Datavault

On Learndatavault.com it's possible to download some chapters about Dan Linstedts new book. It's also possible to see some videos about the datavault method and modeling. In this post I'll outline the most intersting subjects that he introduced or are new (in my opinion).

Ontology
One interesting issue is that Linstedts introduces ontologies. Wikipedia defines Ontology "as the philosophical study of the nature of being, existence or reality as such, as well as the basic categories of being and their relations. [...]Ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences". Initially i didn't quite understand why Linstedt uses this terminology, but when i visited Wikipedia and saw this picture, a resemblance with DV is there.
Linstedts says that learning warehousing, applying and using ontologies is a critical succes factor for handling, managing and applying unstructured data to a structured datawarehouse. I asked him some questions about ontologies:
  1. Is a ontology model a logical model? And DV a physical model?
  2. What does the ontology model add more than the existing model techniques?
  3. Why did you mention that ontology is a critical succes factor for building a datawarehouse in your first chapter?
Here are the answers from Dan:
1)  "In a way, yes, I see different ontologies as different logical models, the DV is “most often” a Physical model. The ontology model adds understanding of the terminology. It provides IT with a way to communicate the model to business users, and to get “order of importance” from business users. You see, business users fight over the definitions of their terms, not just because they USE them differently, but also because they each have a “different ontology” in their heads to represent the data. If you can show a common consistent reference model (ontology), then the level of understanding by business users greatly increases, and there is less arguing about how to define master data. Just remember: there are many different ontologies that can be applied to the same physical model."

2) "I relate ontologies/taxonomies to the Terminology or metadata we use to build data models. If you understand your business terms, or concepts, you can use those to create the hierarchies based on Importance and Classification. These metadata (business terms) are often what we call a logical model when we build a data model. On the other hand, since the Data Vault is based on Business Keys for a start, you can more effectively get the business keys for each conceptual layer in the hierarchy that you build."

3) "I say critical success factor because of the impact it has on communication between IT and business stakeholders. If the stakeholders understand WHAT your building, they can more easily buy-in. They will feel more comfortable, and will usually no longer care HOW it’s built (ie: DV model) – as long as you’re flexible in the near future. Using an ontological representation of the model helps IT with transparency of the DV project. it also can serve as a modeling guide for what data is available to reports and so on. I’ll write up an example in the near future, as that would make another great video.
I did not mean to state that a DW project will fail without ontologies, maybe I used the wrong term to describe them. What I really meant is what I stated above, an increase in the chance of success if your money holders understand what your building, and you give them a way to think about it."
As i already said there is a resemblance between ontologies/taxanomies and datavault, but what i would like to see some real world examples how these two models are related.

The steps to be taken to build succesful datavault models
In Linstedts videos he explains the following steps that need to be taken when you are developing a Datavault model:
  1. Identify the hubs. The hubs are used for tracking and identifying key information
  2. Identify the master data setup. Linstedt says that you have to discover the hierarchy in your business keys with an ontology in order to learn the dependiencies of the dataset. There are a lot of views of ontologies of the business possible.
  3. Identify transactions and relationships to model link structures. Links are associations or transactions.
  4. Identify and model your satellites based on frequency of data change.
This is in contrast with the TDAN articles of Dan Linstedt, published earlier :
  • Model the hubs
  • Model the links
  • Model satellites
  • Remodel the satellites (Monster/mini)
It seems there is shuffle between the steps. The interesting step that catches my eyes is step 2 in the new program (identify the master setup). This issue has a relation with the first point in this post: ontology. Step 2 is about ontologies (Master data and their relations to each other).

MetaFields
From what i've seen from the videos there are a couple of metafields available in every table. As it seems there are some little changes in the way Datavault deals with metafields. The standard as 'defined' (where?) until the videos of Dan Linstedt was that there were a couple of metafields available:
  • Load date source (LDTS).
  • Record source (RSRC).
As it seems now Dan added a new metafield : Load Enddate Datasource (LEDTS). This fields ends the validity of a record. It means that a new record is added to the table and the old record is obsolete. In a discussion i've read there are also more meta fields discussed in the courses of Datavault (hope to join one very soon) and these are:
  • Last seen datetime. This would indicate whether a source are 'drying up'. Not sure what this means exactly. Hope to find out soon.
  • Event datime. This is the data of the actual change in the source (different from LDTS).
In the datawarehouses i've developed so far, i've add the following metafields (and some not):
  • Creationdate (same as LDTS).
  • Expiration date (same as LEDTS).
  • Deletiondate (only possible with CDC/triggers or full load).
  • Current Yes/No.
  • Versionnumber (this indicates whther how much times a satellite changed).
Just some thoughts...

Greetz,
Hennie

zondag 3 oktober 2010

Informing you about the Dan linstedt session in the Netherlands

Hi,

This week Dan Linstedt visited the Netherlands for a session about Datavault. The title of the seminar was about best practices  in datavault. There were sessions about implementation about Tele2 and Vektis for showing how Logica implemented the datavault at some customers.

Below i bulleted the most significant subjects of the evening (to my viewpoint):
  • There's 'no version of the truth' but a 'single version of the fact'. When you think about it, it makes sense to me. What he meant is that once there is a agreement about some KPI's and the system is implemented, some other managers, which didn't participate in the requirementsgroup, could have another viewpoint on a calculation of a KPI. Or perhaps a new manager want to measure KPI's in another way. Could be a point. But i see also some difficulties here:
    • When the calculation of KPI changes to much, it's difficult to compare figures over a longer time. How do you handle this?
    • I blogged about Self Service BI and i indicated that there will companywide KPI's but there will be also specific manager KPI's. How do you deal with this? Do you change the companywide KPI when a new manager enters the town? No, as in my former post said, you have to put a organisational unit in place: Governance committee. This group needs to decide about this stuff.
  • Dan showed the following graph of a classic datawarehouse system. The problem with this kind of architecture that it's not auditable. The data is changed when the data enters the datawarehouse. I've done projects where the ETL developer got the blame because he didn't implemented the KPI in a proper manner. Who is the owner of the data? The business! Why should a ETL developer change the data? Hmm interesting viewpoint, but you could also say that the business didn't test the solution rightly. I know that testing is always a problem because of timeconstraints or less involvement of the business, etc.
  •  Dan showed another picture, as you can see below. In this picture the rules are propagated later in the architecture. This way the system is more auditable, you can change the KPI and recalculate it over you history. A next step is to implement a business vault in this model (not shown) which is a view over the datavault. This could be any datamodel that fits the business (e.g. Datavault modelling, starmodelling, etc).

  • The physical model will be less important in the future. This point is not quite clear to me, yet. I think it has to do with cloudcomputing. You define a model and you don't care about the physical implementation of it. 
  • He also phrased in the same part that columnar MPP will be more important in the future. Agree with that.
  • Data as a Service is growing and will take firm ground. Where the data is coming from will be less important. It could be that the master data will come from datasuppliers and that the datawarehouse in a company will be factgenerators. The dimensions are master data.
  • Operational datawarehousing is also taking firm ground.
  • Dan spoke also about dynamic datawarehouse. This can also be seen in the light of the agility developments. The datawarehouse will be more organic, it will change according to the developments in the business or communities. How? Good question. But i think that datavault is a organic like modelling technique and anchor modelling is even more flexible. This will enhance the dynamic datawarehousing! 

Regards,
Hennie