zondag 12 januari 2014

R : An introduction (Part I)

Introduction

Today a blogpost about R. R is a Statisitcal Programming Language and can be downloaded from the CRAN site. From Wikipedia : R is a free software programming language for statistical computing and graphics and it seems that R is widely used among statisticians and Data miners. R is an implementation of the S programming language. I was curieus about the name R and it seems that it's named after first names of the first authors of R.

Installation of R and RStudio

In this blogpost I'll write about the installation of R studio and some small examples. This is a very popular IDE environment and provides a powerful and productive user interface for R. It's free and open source , available for Windows, Max and Linux. But before you can use R studio you need install R. If you don't, you'll recieve the following message:


For this blogpost I've installed the following software:
  • R (version 3.0.2). This is the console application and has some basic functionality and it seems to be differ between the operating systems.
  • R Studio (version 0.98). R Studio is cross platform application that can be used for analysis. 


R studio

When you have started R studio the following window appears.


Here is a screenshot of some expressions I've borrowed from Wikipedia.


There are some pre installed datasets in R and one of them is "cars". This is a set of data containing the braking distances that were recorded for cars travelling at different speeds. I've tried some graphics.


Conclusion

R Studio is a great tool for analyzing datasets with R. There is a lot of information available on the internet that will help you learn R. Here is a good resource : Wikibooks.

Greetz,
Hennie

woensdag 1 januari 2014

Datavault : The basics (Part I)

Introduction

I've blogged about Datavault in my earlier blogposts (e.g. Four different datamodeling methods). I've looked at different tooling that are available on the market like Quipu and I've blogged about that too. Now, I've had the change to gain some hands-on experiences during a long-term project where I've implemented several Datavault datamodels. I want to share my experiences about Datavault modeling with you. 

This the first blogpost of the Year 2014 and I wish everyone a pleasant and an information wise year!

In this blogpost I'll define Datavault, the Hubs, the Links, the Satellites, Concepts and Concept Constellations. All this information is gathered from seminars (Data modelling Zone Hannover 2013 (BIPodium), Datavault Automation), courses (DV Certification(Centennium) and Advanced Datavault modeling (BIPodium)) and books ("Supercharge your data warehouse" By Dan Linstedt, "Modeling the agile data warehouse with Datavault" by Hans Hultgren and "The Integrated Data Hub" by Dario Mangano). These resources gives you a thorough understanding of Datavault modelling.

In this blogpost I'll build up a simple diagram of a Datavault model. In a few steps W'll enhance a datavault model to a high level diagram that will help you understand the concepts around DataVault. And, it will also help you understand how to build a datavault model.

Datavault

What is Datavault? Datavault is a modeling method used to design a datawarehouse and more specifically an enterprise datawarehouse. The modeling approach is very well suited for building and maintaining a flexible and agile datawarehouse. 

Hans Hultgren talks in his book "Modeling the agile data warehouse with Datavault" about the concept unified decomposition and that seems a true observation of him. An entity in a logical model is split into multiple tables in the physical model but it acts as a whole, an unification of a concept (the entity in the logical model).

The datavault model consists of three basic entity types: Hubs, Links and Satellites. Hubs contains the business keys, Links connects the hubs and the satellites contains the descriptive data about the businesskeys. Together, they build up the Datavault model.

Hubs

As said earlier, Hubs are the business keys, an unique list as Dan Linstedt states in his book "Supercharge your Data warehouse". The Hub is a single table with one or multiple columns with businesskeys. It records the business keys and stores them uniquely. Examples are Social Security Number, LicenseNumber, VehicleIdentification, EmployeeNumber, etc.

And in the process building a Datavault model it's the first step to determine the businesskeys and the correpsonding Hubs. During the analysis of the businessproblem you go and try to find a proper businesskey. 

Here is the first diagram with the Hubs. I've also adapted the color scheme of Datavault.



Links

Links defines the relation between the businesskeys. They give the datavault model the flexibility to extend with other concepts. This type of Datavault entity gives the ability to extend the model in the future when new data should be integrated in the model. This really enhances the maintenance of the model. Examples of this type of Datavault entity are Customer/Sales or CustomerCategory/Category. 

The second step designing the Datavault model is determining the Link entities. These tables are n:m associative entities that are in fact n:m relationtables. This has a little drawback because you can'see on the outside of Link entity whether the underlying logical model is in fact a 1:n relation. This information is implicit. But, it gives you also an advantage. When the relation changes from 1:n to n:1 the Datavault model still remains the same.

Here is the Datavault model with links added


Satellites

Satellites addse descriptive information to the model. These tables contains the descriptive information and the history of the datavault model. These tables gives your model the datawarehouse capability.  Examples are Customer information like name, address, telephonenumber or Sales information like Orderdate, Amount, etc.



Concepts

Dan talked about this in his book and is more in detail described in the book of Hans: Concepts. Concepts are comparable with entities in the logical model. For instance, Customer, Sales, Employee, EmployeeCategory. In the physical model they do don't have any meaning but they help you understand the datavault model. 



Concept Constellations

A new idea is Concept Constellation by Hans Hultgren described. This groups the different concepts together into a higher level meaning. For instance, suppose we have the concepts Employee, EmployeeCategory, EmployeeWhatever, these concepts tell you something about the Concept Constellation Employee. Employee is a higher level definition of all the Hubs, Links, Satellites stores all the information of the different aspects of  an Employee.


Conclusion

This blogpost describes the basics of the Datavault modeling. There is a lot more information available about other types of tables but they all are based on these three types of tables. The strength in the model is by separating the business keys, relations and descriptive information flexibility is introduced. As I'll show you in future blogposts you can easiliy add new descriptive information (satellite) or new concepts to the Datavaultmodel (link).

Greetz
Hennie