dinsdag 22 april 2014

Datavault : Adaptability of the Datavault model (Part II)

Introduction

This is the second blogpost in a series about Datavault. The first blogpost in this series was about the basics of Datavault and in this blogpost I would like to discuss the adaptability of the datavault model. Why is Datavault an adaptive method for designing your datawarehouse? I think there are three reasons why Datavault is an adaptive Datamodel approach:
  1. A fundamental reason: Separating the keys, relationships and the descriptive data in separated tables enhances the flexibility and agility of the model.
  2. You can easily add Satellites (within a Concept) to the model without breaking the existing model.
  3. You can easily add Links (a new Concept or a new Concept Constellation) to the model without altering the existing model.

In this blogpost I'll discuss the adaptive characteristics of the datavault model in more depth..

Adding Satellites

A satellite doesn't contain businesskeys. It stores (all)  information about the businesskey (hub) and historizes all the information about the concept. It inherits the surrogate key of the hub (or link). In order to historize, the primary key is extended with a Date/Time Stamp. In this way the Satellite acts as a Type II dimension (SCD2). The Satellite stores every change that happens in the source.


Now let's see what happens when structural changes happens in a operational system that acts as a source for a datavault data warehouse.

1) Initial situation. 
In this initial situation there is a hub and one satellite. This situation occurs most often during the start of a data warehouse in a production state. The hubs and satellites are designed and ready for usage.




2) Extra attributes added to a table in the sourcesystem.
Now, something happens in the operational system. Some tables are extended with extra fields. These extra fields should be propagated to the data warehouse. Now, the strength of the data vault model shows up. Nothing changes to the exisiting model. The extra fields should not be integrated in the existing satellite but create a new satellite and add this to the hub. Integrating the fields in the existing satellite causes difficulties like dropping and recreating the table.




On Linkedin I had a discussion about changing source systems..I'll blog about this in the future.

Adding Links

The same is true for links. The hubs are the anchors of the model. These anchors provide steadyness of the model and anchors provide satellites and links with surrogate keys. Below and example of a link table and a hub table. The hub provides the link with a surrogate key and this principle helps to connect other concepts and other concept constellations (see part I).


Suppose you have the following situation. There is an old piece of data warehouse and a new piece (concepts) of data warehouse is build.


Now another advantage of the Datavault comes into play. The new concept is linked by the link tables with the old piece of the data warehouse. And, nothing changes in the current datawarehouse. Starmodels, reports and analysis build on top the Datavault model continues to function properly.



Conclusion

The Datavault model is adaptive because of the separation of the businesskeys, the relations and the descriptive information. If you condense the data like the star model, the model will become less adaptive. It will be harder to change the model. By separating the different kinds of data, the model become more resilient to changes. By adding satellites and links to the model and not changing current satellites and links (and hubs) the current reports and (user) queries will remain running. This way the datavault model lends itself for extending the model without breaking the current model. 


The star schema is less flexible for storing datawarehouse information because we have only two flavours: Facts and dimensions. Therefore Datavault is more suitable for storing datawarehouse information than star schemas.

Greetz,
Hennie

vrijdag 18 april 2014

Hadoop summit 2014 review

Introduction

Couple of weeks ago I joined the Hadoop summit 2014 in Amsterdam. I enjoyed most of the sessions and I've learned new stuff and it gave me more insights in the usage of Hadoop.The world of data is changing and it will happen fast, as it seems, according to some speakers at the Hadoop Summit. The data paradigm is shifting from Schema-on-Write (RDBMS) to Schema-on-Read (Hadoop). This paradigm shift is happening at this moment. We will see for certain types of data the Schema-on-Read will become popular in favorite of Schema-on-Write solutions. Most of these types of data are unstructured or semi-structured.

And, it's not about new technology replacing old technology, it's also about creating new businesses with new technology. With Hadoop there are new business opportunities possible with new ways of developing business models based on Hadoop and together with RDBMS solutions (for now).



DataOS

With the launch of YARN (Yet Another Resource Negotiator) the name of DataOS appeared. YARN is called DataOS. In 2012, YARN became a sub-project of the Apache Hadoop project. YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component. This enables Hadoop to support more different processing approaches and a broader array of applications.





RDBMS living together with HADOOP

How about RDBMS and Hadoop?  YARN replaces Mapreduce and it isn't a batch oriented system but also a real time solution. Storm is an example of this. I think dat future developments of Hadoop will lead to a change of the design of the Enterprise Data Warehouse. But for now, in the referential architecture of Hortonworks RDBMS and Hadoop live together and they both borrow data from each other. The photo below shows you this (sorry for the bad photo).



This is also the case with the PDW v2 solution of Microsoft. In this appliance Microsoft already sells Hadoop as an integrated part. Polybase is the layer that abstracts the Hadoop and the MS nodes. So, In one solution a RDBMS and a Hadoop solution with an uniform layer. That's nice.



Conclusions

The Open source community of Hadoop has become huge and practically every vendor embraces this new technology. Microsoft gave up there Hadoop look-a-like software (Bing) and has embraced Hadoop and integrated this in PDW v2 (The modern Datawarehouse). There are new Hadoop developments like YARN, Stinger, TEZ, Storm.

As for integrating Hadoop into your Enterprise Data warehouse I can imagine that Hadoop will become a base for capturing raw data into a (historically) Staging Area. Hadoop is about storing Raw data and it's cheap. You can Build on top of that a Business datawarehouses and a discovery- and analytical platform for analyzing structured and unstructured data.

Greetz,

Hennie