dinsdag 14 april 2015

Some features of SQL Server Data Tools

Introduction

Today a post about SSDT (SQL Server Data Tools) and this tool is the replacement of BIDS and Database Developer (or Data Dude). There are two tastes of SSDT: SSDT-BI and SSDT-DB. Both have to be installed seperately and that's very confusing. But once you know what's going on there are no problems anymore. SSDT-DB is the evolution of  "Visual Studio Team System Edition for Database Professionals" and this project was the baby of Gert E. Drapers that gave it its more common name, "Data Dude".

There seems multiple versions of SSDT around and I've found these versions of SSDT of the last years.
  • 2014 January update.
  • 2014 July update.
  • 2015 February update.

There are two versions available:
  • Visual Studio 2012.
  • Visual Studio 2013.
The ultimate goal of SSDT is to create a more integrated development environment. Traditionally, database development has often been done outside of Visual Studio, for instance in SQL Server Management Studio. With SSDT, database development is instead done inside Visual Studio. Databases are included as projects in the Visual Studio solutions.


What is SSDT?

This tool is the single environment for writing DB code as well as BI development. There is no need to switch between  SSMS and SSDT. You can write all the code in Visual Studio 2012 or 2013. There is now a clear distinction between a developer environment and a DBA environment. SSMS is more about managing SQL Server and databases. There are two modes possible:
  • Disconneted mode.
  • Connected mode
    • On premise.
    • Off premise.
There is a local database for developers and this comes in handy because developers don't need a connection to a live database. so you write code in a declarative manner and you don't need to worry(?!) about implementing these chnges on a database.


Normally, until you had SSDT, you would write code with SSMS in a development environment. When you needed to deploy the code to another environment you had to remember or compare the versions with TFS and then create change scripts for the other environment. With SSDT you don't need that anymore. SSDT does that for you.

Options

There are multiple ways to start a project with SSDT:
  • Start from scratch with a project in SSDT
  • Import the code from a current database into SSDT project.
  • Import from a .dacpac file
  • Import a from scriptfiles. 
In the following tutorial I will show some simple steps importing the code from an existing database into a project in SSDT. This is the DB part of SSDT.

Import the code from a current database into SSDT project

1. First start SSDT and start a new project and you'll see the following window. Here you can see the distinction between BI and DB.


2. Export the Data-tier Application from the database by right clicking on the specific database, press Tasks and Export Data-tier Application.

3. Press Next.


4. Save the file to a local disk by clicking on Browse



5. Choose the location and press Save.


6. Press Next.



7. Choose the database objects and press Next


8. Now we have the summary window. Press Finish.



9. Errors appears and why is that?


There seems a lot of errors in export of the database to the bacpac file, unfortunately. The errors are static and you can do anything about it. Below,  I copied the most common errors that apeared in the error list

The element Extended Property: [dbo].[ufnGetCustomerInformation].[@CustomerID].[MS_Description]

The element Extended Property: [SalesLT].[SalesOrderDetail].[IX_SalesOrderDetail_ProductID].[MS_Description] is not supported

The element XML Schema Collection: [SalesLT].[ProductDescriptionSchemaCollection] is not supported

Element Column: [SalesLT].[Address].[AddressID] has an unsupported property IdentityIsNotForReplication

 Element Column: [SalesLT].[Address].[rowguid] has an unsupported property IsRowGuidColumn set 

Table Table: [dbo].[BuildVersion] does not have a clustered index.  Clustered indexes are required for inserting data in this version of SQL Server.

Element User: [NT AUTHORITY\NETWORK SERVICE] has an unsupported property AuthenticationType set and is not supported when used as part of a data package.   


When searching the internet I found this helpful article on SQL Server Data Tools Team Blog. It seems that there is another way of importing the database into SSDT-DB. And that is with SSDT-DB itself.

10. Select the specific database and right click on Create New Project


11. Set the Target name and click on Start


12. The export is running


13. And, the project is created based on the AdventureWorks2012LT database.


Refactor the database

Refactoring the database is one of the things that needs to be done. Objects needs to be changed when the world around us changes and database needs to evolve with that. Let's try to refactor a field in table in the AdventureWorksLT2012 database.

14. Let's try to change the CompanyName in the Customer script to Company and try to publish this to the database. We do this by right clicking on the CompanyName and press Refactor. Change the name to Company.



15. Here you can see the changes and the press Apply


16. The next thing is to publish the changes to the database. You do this by right clicking on the Project and press Publish.


17. After some time of processing the changes are processed and the database is in the new state.


Snapshot the project

It's also possible to snapshot your project in dacpac files. These files contains your scripts saved at a certain point in time. This will help you save the projects at important points in time.


18. Right click on the project and click on Snapshot Project.


19. And the snapshot is saved under Snapshots.



Using References

If your database references other databases, you'll need a dacpac file for each external database referenced in your code. References can be resolved for objects in one of the following locations:

  • In the referencing database.
  • In a database other than the referencing database, but on the same server.
  • In a database other than the referencing database, on a different server.

In this blogpost, I've choosed to use "In a database other than the referencing database, but on the same server".

Suppose you have a stored procedure in a project that references a table in an other database (e.g. TestDB). Then the following error is shown in SSDT. You need to create a Database reference to the project.


22. Add a Database Reference to the project by right clicking References and Add Database Reference.


23. First extact a dacpac from the database in SSMS and store that on a central place and press Next.


24. Select that dacpac file from the central place in your project.


25. Replace the database name by the Database Reference.




25. Now if you want to publish this to the database, the values of the Database Reference can be set.


Here you can set the value for the Database reference.

Conclusion

SSDT is logical evolution in Database and SSIS, SSRS and SSAS design. I'm very happy with the development environment for SQL Server where you have an integrated approach. There is a clear distinction between DBA and Developer work.

Other references

zondag 15 maart 2015

Big Data : Data Obese (Part I)

Introduction

We are buried under the amount of information that we collect. Especially the amount of data we refer to as big data information. This information is mostly unstructured or semi structured like JSON or XML. We need on an information diet. How do we handle this?

In addition, some companies (Cloudera)  has already announced that the end of the data warehouse will happen when Hadoop is full grown. I'm still not convinced, especially in the short and medium term. Hadoop is an eco system that is under development and it has lots of potential.



Why a (structured) data warehouse?

Why do we have a (structured) data warehouse? You'd almost forgotten in the Big Data violence that we hear and read daily. Let's go back to the basics. A data warehouse is the place where we can store all information of an organization (and beyond) in a structured way. Structure is important in this statement. This fact has many advantages. The data has been optimized for complex queries. It has been cleaned (ETL) and a labelled (metadata) such as: It is a varchar and it is always 50 characters wide.

Relieving a sourcesystem can be an important advantage too. For example, if there is no data warehouse then, each data mart or a brief report must have access to the source system. The one-time retrieval of the data in an isolated environment can thus relieve the production systems.


But the main advantage is the fact that a data warehouse can integrate data from different sources. Based on business keys used by the organization so you can integrate for example, sales, purchasing and production (vertical columns) again make horizontal. You collect and integrate all departments.

And, if you want you want do it completely right,  then you have to make sure that you save the history data from the source systems properly. Often people will think you mean the old data, but it is not. The point is that you save the changes of the data in the source systems. This has advantages, namely that you have the central place and your data warehouse is the truth for all corporate data. After all, you have saved the state of a source system at any point in time. So now  you can time travel, "How my report looked at the time April 1, 2013?". The report will show the position of the data at that specific time. Subsequent changes have no effect. No more guesswork by stating that changes in the data provide other figures in the reports (if you do not save changes). It also has other advantages namely that you are 'auditable', you can re-generate new insights on old data and you can analyze processes by example you have registered status changes (process mining).


How does Big data fit into this?

But now the big question : "how does big data fit in this story?". The question is what we mean by big data and that is difficult. It has become such a huge container concept that it is not entirely clear what exactly is big data. Looking at the 3 Vs will see it is a lot of data, which are very diverse and that also goes fast. Suppose we assume a (enterprise) data warehouse,  there are complex business rules and integration issues it is not expected that real-time processing is possible. This is simply not easy to achieve. Assuming this situation : speed (real time) is less important, we still hold about two aspects that may be important in a big data data warehouse: The amount of data and the variety.

For the amount of data, we could realize a Hadoop solution that stores the unstructured data. Because there is no schema, you can store data quickly in a Hadoop cluster. And, you can store all information without you know the schema of the data. Imagine that your datastructure  has changed, You have to drop and recreate the table in a RDBMS and that is not needed with Hadoop. 

There is no schema data (we do not have data types such as varchar or integer). In short, when reading the data from the Hadoop cluster, then we have to define a schema. Defining the schema can be defined very broadly. The script (mapper) determines what is the schema. For example, you can run the script Text ETL where you are looking for patterns. But for example with Hive, that is part of the ecosystem Hadoop make a schema on the data files in HDFS. This is also called "Schema-On-Read '.


But let's just focus on unstructured information such as email, reports, Word documents or files that have a bit more structure such as error logs or web logs. So basically data without metadata. How can we analyze this? We can use text ETL. You can search for patterns, for example, taxonomies or ontologies. Below is an example of a taxonomy:

Transport
  • Auto
    • Manufacture
      • Honda
      • Fiat
      • Porsche
    • Type
      • SUV
      • Sedan
      • Station
  • Aircraft
    • Manufactur
      • Airbus
      • ....

Suppose you want to analyze this sentence:

"We drove and we passed a Porsche and a Volkswagen on the highway"

If we handle this with a taxonomy then the following may come from:

"We drove and pased the Porsche / Manufacturer past a Volkswagen / Manufacturer on the highway / road"

Then we could save this information as a Name - Value pair:

Manufacturer: Porsche
Manufacturer: Volkswagen
Road: Highway

And in this way, it looks a bit more structured. And this would be the time to load it in the data warehouse and combine with other structured data. This is not the only way but there are many more. Eg Name-Value processing, or Homografic resolution, or example List processing. These are all techniques you can use to filter out patterns in the data (to schematizing).

An architecture that can handle this could be this :


The data lands from structured sources in stagingarea that is structured and the unstructured data lands in a data lake (Hadoop). Then the structured information is loaded into an Enterprise Data Warehouse (EDW), processed and reported by BI tooling. The unstructured data is stored in a data lake and analyzed in an analytical platform with all kinds of tooling such as R or with the power series of Microsoft. This is a more explorative process (discovery). The results are written back into the data lake, after which this data can be read again into the EDW. The EDW also provides display data to the analytical platform in order to enrich the unstructured data. So there are multiple loops in this architecture. And that is generally so with big data architectures. It is heterogeneous environment with various tooling that do what they are good at.


Conclusions

So this is a way to reduce the amount of data that make data less 'data obese'. It is important that we find ways to extract information from the enormous information piles that we can use in analytical platforms and can add value to structured (enterprise) data warehouses.

Greetz,

Hennie 

zaterdag 14 maart 2015

Open Virtualization Format Archive

I've downloaded a Hadoop distribution from HortonWorks and it's an open Virtualization format archive. Open Virtualization Format (OVF) is an open standard for packaging and distributing virtual applicances for running in virtual machines.

This is how I loaded the OVF file in Virtual Box.

1. Start VirtualBox Manager

2. click on File and Import Appliance



 3. Select the OVF file



4. Adjust the settings.



 5. And press Import and the import starts.


6. And the results.



Greetz,
Hennie

zaterdag 31 januari 2015

Datamining : Reasoning (Part I)

Introduction

According to Wikipedia, Datamining is: "Searching for statistical patterns in (large) datasets for purposes like scientific, commercial usage, etc". So, datamining is for discovery based strategies (in contrast with Business based discovery) to discover relations, correlations or perhaps categorizations. Data mining is about induction. Induction is generalizing a certain number of observations. Induction is the opposite of deduction. Deduction is generic rule applied to a certain part of observations.

Deductive reasoning

In contrast to inductive reasoning, deductive reasoning is concerned with conclusions follow with certainty from their premises and inductive reasoning refers to situations where conclusions only probabilistic follow from their premises.  There are a couple of rules of inference according to the book from my good old study at the University: "Cognitive psychology and it's implications" by John R. Anderson. I can recommend this book by the way (if it's still sold).

Conditional statement (modus ponens)
A conditional statement is an assertion, such as "if you read this blogpost, you will be wiser". The if part is called the antecedent and the then part is the consequent. This certain reasoning is called modus ponens (if A, then B and the proposition A, we can infer B). Suppose the following:

1. If you read this blogpost, you will be wiser.
2. You read this post.

From premises 1 and 2 we can infer 3 by modus pones.

3. You are wiser!

Modus tollens
Another rule of inference is modus tollens. This rule states, that if we are given the proposition A implies B and the fact that B is false, then we cab infer A is false. An example:

1. If you understand this blogpost, you understand reasoning with datamining
2. You don't understand reasoning with datamining.

It follows from modus tollens that:

3. You didn't understand this blogpost.

Inductive reasoning

With induction, a way of reasoning is meant and it's acting as a proof.  With inductive reasoning an general rule is defined on a certain number of observations. We call this generalization. On a certain number of observations we reason to a generalization of the rule. Inductive reasoning is the term used to  describe the process by which one comes to conclusions that are probable rather that certain. This seems much more useful because very little is certain end at very best, very likely. 

1. If you read this blogpost, you will be wiser.
2. You are wiser.

Then it follows with inductive reasoning:

3. You've read this blogpost! Great!

Bayes's Theorem provides a way for assessing the plausibility of this reasoning. What is the probability that you are wiser because of reading this blogpost? There is also a probability that you have become wiser because of reading something else. That is important to know too.

Conclusion

This is an introduction to datamining and inductive reasoning with data. This blogpost discusses some issues with inductive reasoning and datamining.

maandag 8 december 2014

Let's Disco!

Introduction

Disco is a process mining tool that let's you discover processes by analyzing event logs. It's suitable for analyzing transactions that happen in a process. For instance, a sales process or a order process are examples that can be analyzed. You can examine bottlenecks in your process, loops, time, durations, averages, fast/slow lanes, conformance issues, resource performance, etc.

So, processmining with a tool like ProM, Disco (and I've even seen a demo of SAS Visual Analytics with a Sankey Diagram) are very well suited for process model analysis. Different tooling are enhancing process analysis. Process analysis can be a great addition to Business Intelligence. Where Business intelligence is more like phishing with a fishing rod in an organisation for KPI's and Business Metrics, Process Mining is much more 'on' the process. Business Intelligence and Process Mining can work together to optimize processes.

A simple tour through Disco

In this blogpost, I've included some screenshots about Disco. I've used a small file that I've borrowed from the Coursera Course "Process mining:Data Science in Action" lectured by TUE, Wil van der Aalst. Below the opening screen of Disco.


The first thing that you have to is load the data. In this case a csv file.


Then you have to set the CaseID, Events, Timestamps and the resources. This is needed for analysis of process.




The next step is importing the data and generating the processmap. Below an example of the duration of the processsteps.


Here an example of the mean of durations


And below, an example when you play the data on the model. The yellow/red dots are markers that flow through the model



 And below some statistical analysis with Disco:


Some more information about the cases.


Conclusion

Disco is a great tool for analysis of processes. Process mining can be a great addition to Business Intelligence and very helpful for analysis of processes. Both analyses processes but on different levels.

Greetz,
Hennie