BI Future Blog: datamodeling

Posts tonen met het label datamodeling. Alle posts tonen

zondag 15 december 2019

Azure series : Comparing Azure SQLDB vs SQLDW vs AAS

Introduction

There are two SQL Database solutions, available at this moment : Azure SQL Database and Azure SQL Datawarehouse (which is now part of the Synapse analytics solution) and I was wondering what are the differences between the two solutions. For the sake of completeness I've also included Azure Analysis Services.

So my question is : "What is the best Azure SQL data platform solution for implementing a data warehouse solution?" In the schema below I've gathered the pros and the cons of the three solutions (SQLDB, SQLDW and AAS) but focussed on the SQLDB and SQLDW.

I've used the folowing resources for this blogpost :

Compare Azure SQL Database vs. Azure SQL Data Warehouse: Definitions, Differences and When to Use by Barry Luijbregts
Data Compression by Microsoft
IS AZURE SQL DATA WAREHOUSE A GOOD FIT? (UPDATED) by Melissa Coates.
What is Azure Synapse Analytics? by TimExtender

Comparing the three data platforms

In the diagram below, I've compared the three solutions based on the characteristics of these technologies.

Final thoughts

All the data platforms have advantages and disadvantages. Depending on your requirements it's probably not always a good fit to choose one of the data platforms. Sometimes you can choose a combination of tooling to make it an acceptable fit for purpose.

Love to hear your opinion...

Hennie

woensdag 6 maart 2019

Fact Oriented Modeling constraints (part II)

Introduction

In my previous blogpost about FOM, I've discussed the basics of FOM. In this blogpost I would like to elaborate on constraints in FOM. Constraints are important in data modelling because they limit the degrees of freedom (of the data) in the model. In this blogpost I'll describe the different constraints that can be applied to the FOM data models.

Constraints

As said before, there are different constraints possible in FOM :

Value Constraints.
Uniqueness Constraints.
Totality Constraints.
Subset Constraints.
Equality Constraints.
Mutually Exclusion Constraints.
Number constraints.

Value Constraints

Valueconstraints are limitations of certain labeltypes, for instance a sequencenumber can only exist of 1, 2 or 3. You can add this in CaseTalk.

This result in the following change in the diagram. It is now visible that sequencenumber can only exist of 1, 2 or 3.

Uniqueness Constraints

Yet another constraint is uniqueness, meaning that values are unique in a population. You can do this with the menu option Set Unique Constraint in CaseTalk.

This results in the following constraint on Student, saying that there can be only one student with a firstname and a lastname.

There are also other combinations possible, for instance :

Here you say that Peter Jansen is unique and this means that Peter cannot live in two places at the same time. The other way around: In Nijmegen it is possible that two students live here. So, there is not an unique constraint needed.

Another example is presented below. Here you can see that there are two unique constraints on separate columns, meaning that a student is unique and internship is unique. This means that a student can only do one internship and an internship can only be done by one student.

Here is a more complicated uniqueness constraint. In this diagram there 3 roles and two uniqueness constraints:

Student and internship.
Student and sequencenumber.

The combination of student and internship is unique. That makes sense. You can not apply twice for the same internship. The other uniqueness constraint is student and sequencenumber is also logical. A student can not apply for an internship with the same sequencenumber. The student should make an order of internships.

After determining the uniqueness constraints a couple of tests are necessary and they are organized as follows:

Elementary test

n-1 rule test
n rule test
Projection/Join test

Nominalisation test

Totality Constraints.

Totality constraints are constraints that says that every tuple from a role of a (nominalised) facttype should be present in a involved role. For instance: "Every student should have a place to live" meaning that when a student is known, the place is also known. If a place is not known, this is not possible in the semantics. This is not the fact. The fact is that a student has a place. So, you identify is with a totality constraint. How to set this up in CaseTalk?

I had to figure this one out but I managed to make a totality constraint. First you have to press the <CTRL> button and then click on the role you want to put a Totality constraint on.

Now when you are done, The following diagram has Totality Constraints.

Setting Totality Constraints should be handled with care, because when generating a data model with a totalityconstraints can result in a NOT NULL column.

Subset Constraints.

Subset constraints are constraints that says something about subsetting of particular roles. This states that the set of roles is a subset of another set of roles.The values of oneset of roles should be present in the other set of roles.

Let's experiment this with CASE Talk. First press on the subset constraint Icon in the top left in diagram pane.

This will show the following window and here you can enter the subset constraint with the From and To part.

And here is the result of the subset constraint.

The plotted size is a bit huge in contrast with the diagram. You can change that in the diagram with Style and Options.

And here you set the font size.

Equality Constraints.

An equality constraint is needed when two set of roles are equal. This is entered as two subset constraint and is only changed in the From and To.

Mutually exclusion constraints
Mutually exclusive constraints are constraints that are exclusive between two roles. These roles cannot have a common population. For instance, You're married or divorced. You can not be both at the same time.

Number constraints

Number constraints are constraints that limit the values in a role. For instance a student can only enter three preferences for a internship.

Final thoughts

Setting the constraints on the FOM datamodel with CaseTalk will limit the degrees of freedom in your information model and help you structure the information and ultimately a physical model.

Regards,

Hennie de Nooijer

vrijdag 8 februari 2019

Fact Oriented Modeling Introduction (Part I)

Introduction

Today, I want to write something about Fact based Oriented Modelling (FOM). FOM is not about modeling objects in the real world but it is focused on modeling the communication about the objects in the world around us. This is another focus than methods like Codd or Chen. During projects, I gathered information about the area of interest and one of the next steps was trying to imagine the objects and model the data. For instance, you have Patient data and you define the Patient entity. This approach is different than FOM. With FOM you gather the information from communication and verbalize the information in so called fact expressions.

Semantics

I've noticed that during discussions I've had so far, semantics is a much used keyword. Although I was aware of a kind of meaning of this word, I decided to google it, and here is what I found:

"The branch of linguistics and logic concerned with meaning. The two main areas are logical semantics, concerned with matters such as sense and reference and presupposition and implication, and lexical semantics, concerned with the analysis of word meanings and relations between them."

In my opinion, this is saying that semantics is the area of understanding the meaning of communication between (business)people. As a data modeler it is important to understand the wording, the meaning and the relations between the (certain) words.

Verbalization

Fact expressions are important in FOM. FOM expressions are based on predicate logic. They are true or not. For instance you can say something like: "There is a student called Peter Janssen" or "Order 12345 is ordered on February 15th, 2019". The first one is a so called postulated existent expression (I hope I translated that from Dutch correctly) and the latter one is elementary fact expression, meaning it is the minimal information to identify the fact. In other words, there is no redundant information and there is not to less information.

For this blogpost, I've used the following examples (translated from Dutch) from the book Fully Communication Oriented Information Modeling (FCO-IM) by G.Bakema, J. Zwart and H. van der Lek. This is a very readable book about FOM. English version here.

There is a student Peter Jansen.
There is a student Jan Hendriks.
Student Peter Jansen lives in Nijmegen.
Student Jan Hendriks lives in Nijmegen.
Internship S101 is available.
Internship S102 is available.
Student Peter Jansen prefers nr 1 stage S101.
Student Peter Jansen prefers nr 2 stage S203.
Stage S101 takes place in Nijmegen.
Stage S102 takes place in Eindhoven.
Stage S101 is developing a time registration program.
Stage S102 is researching CASE tooling.
Student Peter Jansen is assigned to internship S101.
Student Jan Hendriks is assigned to internship S203.

As you can see, these sentences are easier to verify by business users than a Bachman- or a Chen diagram. Users can say: "No that is not correct, it should be this or that". So this is the first, but very important step in data modeling with FOM. I've not seen this kind of approach earlier. You can say that is the conceptual level of modelling! It models the facts in the communication.

Qualification and Classification

When you are satisfied with the verbalizing the facts, the next step starts. That is grouping the fact expressions into categories and giving the group a name. This is called qualification and classification. For instance, sentences like "There is a student Peter jansen" and "There is a student Jan Hendriks" are grouped together and named "Student".

[Student]
There is a student Peter Jansen.
There is a student Jan Hendriks.

[Residence]
Student Peter Jansen lives in Nijmegen.
Student Jan Hendriks lives in Nijmegen.

[Internship]
Internship S101 is available.
Internship S102 is available.

[Internshippreference]
Student Peter Jansen prefers nr 1 stage S101.
Student Peter Jansen prefers nr 2 stage S203.

[Internshiplocation]
Stage S101 takes place in Nijmegen.
Stage S102 takes place in Eindhoven.

[Internshipdescription]
Stage S101 is developing a time registration program.
Stage S102 is researching CASE tooling.

[internshipassignment]
Student Peter Jansen is assigned to internship S101.
Student Jan Hendriks is assigned to internship S203.

Creating an Information Grammatical Diagram (IGD)

When verbalization is done the next step is executed: designing an IGD. This model is not used for communication with end users because the diagram can be overwhelming and difficult to understand by novice users. IT people prefer abstract diagrams to understand the area of interest better.

Now, you can do it manually or you can use a tool like CaseTalk. Let's take the first fact expression and try to identify the labels and objects here. Objects are things that we want to know more about it. Objects should have an unique identification. Labels are descriptive information.

Below, I have used CaseTalk to identify the labels firstname and lastname. On the right an impression of the diagram is shown. FactType Student has two roles with two labels firstname and lastname.

This results in the following part of an IGD. A fact expression with placeholders 1 and 2 that can instantiated by "Peter Jansen" and "Jan Hendriks". I've entered the second sentence into CaseTalk too.

The next step is to enter all of the sentences aka fact expressions into CaseTalk. For instance, when the next fact expression is entered, the diagram is changed into the following :

Now we can derive two fact expressions from this model : There is a Student Jan Hendriks and Student Peter Jansen lives in Nijmegen.

When all of the Fact expressions are entered in CaseTalk the diagram appears as follows:

In this diagram all of the Fact types are added and the Factexpressions can be derived from the model. For instance,

F2 : <3> lives in <4>.
Role 3 is played by the nominalized objecttype Student which is Student <1><2>.
Role 1 is played by labeltype first name and role 2 is played by labeltype last name.
Role 4 is played by Nominalized objecttype Place <5>.
Role 5 is played by the labeltype placename.

Now this result in the following substitution:

"Student Jan Hendriks lives in Nijmegen"

Final thoughts

This is a short description about Fact Oriented Modelling. I've explained verbalization, classification and qualification, deriving an IGD from fact expressions. In the next blogpost, I'll focus on the constraints of a model. Although there is a structure in the model, there are more limitations/constraints possible, for instance there can be only one student with the same name. This will be subject for the following blogpost.

Hennie