BI Future Blog: september 2010

donderdag 30 september 2010

SSAS : IgnoreUnrelatedDimensions

Currently working with analysis and MDX and i ran into the property IgnoreUnrelateddimensions. This property controls whether the All member is substituted in place of any specified members from unrelated dimensions.
The MDX query is something like this :
SELECT [Measures].[Extended Amount] ON 0,
[Dim Reseller].[Reseller Name].MEMBERS ON 1
FROM [Adventure Works DW]

And when i executed this at first it appeared to me that something is wrong:

But when i changed the property IgnoreUnrelatedDimensions the values changed to null and that's more intuitive because there's no relation between the measure and the dimension.

And the results are:

Small post about a small handy property ;-) Greetz,Hennie

dinsdag 28 september 2010

CustomRollup

In this article about SSAS, i want to show you an implementation of a custom rollup. In this example I’ll show you a customrollup with a parent-child relationship. Most of time a aggregation of the data is done with a sum, max, min or the other standard aggregation.

An example is depreciation. I borrowed this from the book “Microsoft SQL ServerAnalysis Services 2008 with MDX”. Depreciation is about “mapping an assets expense over time to benefits gained through use of those assets”, meaning that the value of assets decrease over time.You can deprecate in different manners, exponential, lineair, etc.

In the foolowing example we’ll use an account dimension from the adventureworks database, that indicates the types of accounts of your company such as asset, liability, income and expenditure. Together with a amount we could rollup the information . In case of parent-child hierarchy, analysis services allows you to perform a custom rollup using a feature called unary operators. Unary operators allow a specification of basic aggregation functions, add, subtract, etc.

If you look into the account dimension, available in the adventureWorks dimension and you incorporates this into an Analysis project, you’ll see something like this:

In the properties of an attribute you'll see a property called CustomRollupColumn. This property needs to be set to the relational column that has the MDX expression for the customrollupformula. The MDX expression gets evaluated on the cells when the values corressponding members of the hierarchy are being retrieved in the query.

The sample AdventureWorks database provides a column with custom rollups for the account dimension : operator.

So you can see how the hierarchy of account attributes needs to be rolled up by the operator column. Right click on the ParentAccountkey and specify the levelnames for the parent-child hierarchy like below:

Set the IsAggregatble to false and click on the '...' caption of the CustomRollupcolumn in the properties window. Below you can see the result of custom rollup in the Finance fact:

So this is an example of the customrollup property.

Greetz,

Hennie

zondag 26 september 2010

Join elimination

While reading an article about anchormodeling, I was triggered by the sentence: “An anchor model requires join elimination techniques in your RDBMS”. Join eliminiation? The writer about this article pointed me at a blog from the optimizer Oracle development team. Interesting!

In that blog they talked about 'table eliminitation' (alternately called "join elimination"), which removes redundant tables from a query. A table is redundant if its columns are only referenced to in join predicates, and it is guaranteed that those joins neither filter nor expand the resulting rows. On this blog you can find some examples.

At first the following script is shown on the Oracle blog :

create table jobs
( job_id NUMBER PRIMARY KEY,
    job_title VARCHAR2(35) NOT NULL,
    min_salary NUMBER,
    max_salary NUMBER );

create table departments
( department_id NUMBER PRIMARY KEY,
department_name VARCHAR2(50) );

create table employees
( employee_id NUMBER PRIMARY KEY,
employee_name VARCHAR2(50), department_id NUMBER REFERENCES departments(department_id),
job_id NUMBER REFERENCES jobs(job_id) );

This create statement shows me that there are references, meaning a foreign key.

An simple SQL query example, which i copied from the Oracle site:

select e.employee_id, e.employee_name
from employees e
where not exists
(select 1 from jobs j where j.job_id = e.job_id);

Resulting in this execution plan (Not sure whether Oracle calls this execution plan)

-------------------------------------------
 Id   Operation             Name      
-------------------------------------------
   0  SELECT STATEMENT                
*  1   TABLE ACCESS FULL    EMPLOYEES 
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("E"."JOB_ID" IS NULL)

A full table scan is done on the employees table. The jobs table isn't scanned and it shouldn't, because there is no need to scan the job table.

select e.employee_name
from employees e, departments d
where e.department_id = d.department_id;

The Oracle optimizer will generate this plan for the query:

-------------------------------------------
 Id   Operation             Name      
-------------------------------------------
   0  SELECT STATEMENT                
*  1   TABLE ACCESS FULL    EMPLOYEES 
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 1 - filter("E"."DEPARTMENT_ID" IS NOT NULL)

So this is aqueryplan from the Oracle and you can see here that the department table isn't scanned. Strange? How is Oracle determining the innerjoin? Well because of the REFERENCE relation ship, which you can see in the CREATE TABLE script.

I also tried some other queries. I created two views :

vwINNERJOIN
vwOUTERJOIN

Here are the scripts:

DROP VIEW vwINNERJOIN
GO

CREATE VIEW vwINNERJOIN AS
SELECT P.Color, PS.Name as ProductSubcategory, PC.Name as Productcategory
FROM Production.Product P
INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID
INNER JOIN Production.ProductCategory PC ON PS.ProductCategoryID = PC.ProductCategoryID
GO

---
DROP VIEW vwLEFTOUTERJOIN
GO

CREATE VIEW vwLEFTOUTERJOIN AS
SELECT P.Color, PS.Name as ProductSubcategory, PC.Name as Productcategory
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID
LEFT OUTER JOIN Production.ProductCategory PC ON PS.ProductCategoryID = PC.ProductCategoryID

And i executed the two queries:

SELECT Color 

FROM vwINNERJOIN

SELECT Color 

FROM vwLEFTOUTERJOIN

and this resulted in the following execution plan:

So the example is about selecting one field from a view with multiple joining. And the execution plan shows that er is only one table scan in case of a left outer join and one more for the inner join.

So the conclusion is that SQL server doesn't adapt 'Join elimination' in case of a innerjoin but does it in case of the left join. This is because of the cardinality of the query. In case of the Inner join there should be Foreign key record present because it will be excluded from the result set. Therefore a scan is needed. So it seems that SQL Server applies Join elimination.

Regards,
Hennie

donderdag 23 september 2010

SQLPass meeting 21 september 2010

I joined a session at the dutch chapter of SQLpass and it was about gamechangers. Due to recent developments (SSD, fusion IO) in hardware and databasetechnology (NoSQL, columnar databases) innovations are going pretty fast. So what does this mean for the work of a DBA, SQL Developer, datawarehouse architect or anyone who is working with databases? The session had two main parts : a presentation video of DR. David DeWitt (Microsoft Data and Storage) from PASS Summit 2009 and the next part was a discussion about what does the recent developments mean for the current and future job activities of the (SQL Server) database specialist.

The first part of this night was a videopresentation of Dr David deWitt at PASS Summit 2009 and was about the developments in hardware over the past 30 years and he analyzed that the current disks are 10000x larger than 30 years ago, memory is 1000 x larger, CPUs are 2000x faster and CPU Batches are 1000x faster. As he explained further, CPU has become 1000x faster but IO has only become 10x faster and SSD seems to be the only real hope for getting things done faster. Also he made a comparison between the transferrate of data 30 years ago and nowadays and the outcome was that the disks are 150 x has become relatively slower than 30 years ago!! So improvements in CPU and memory increased more than improvements on disks.

After that a lot of technical stuff with L1, L2 and memory caching he explained that the column store is, in theory, 7 times faster than row stored based databases. But there’s a tradeoff, updating is slower than row based databases. Conclusion was that column based databases reduces the amount of disk I/O required to execute a query and improves CPU performance by reducing memory stalls. Best practice is not the use a column store database for operational purposes. He believed that due to hardwaretrends more specialization will happen in the future.

After this a discussion started about SSD’s, fusion IO’s, IOPS, virtual file stats, use SSD for tempdb’s, listing pending I/O’s. Also there was a talk about integrated memory of memory and storage. This is an interesting development because I learned from the first books about computers that the Von Neuman architecture always is based on a cpu, an in- and external memory. This paradigm will change in the future. No more harddisks? What would this mean? Are things going so fast that there’s less need for work done by ICT professionals, like architecture development, DBA and performance tuning, etc? Is that the future? Or will volumes of data grow that fast that there will always a need for ICT professionals? Time will learn…

Another viewpoint is that 10 year ago the world was much easier. You had a database and VB6 application and that was it. But nowadays we have all kind of new features in products, complex environments, more data, more (different) user needs. As Dr David deWitt states more specialized products will be on the market in the future and therefore more specialization of ICT-professionals is inevitably.

So the title of the evening was 'Gamechangers'. What is a gamechanger? I think that you should define what a gamechanger is. As we look at the past what are typical gamechangers: the IBM desktop, internet, mobile devices are typical gamechangers. Also you could define gamechangers as a revolution, a turnaround. In this discussion, i didn't hear a gamechanger. The stuff i heard were more a evolutions: the column based database is specialized database (evolution) and the SSD disk is an evolution of the external memory. Things (talked about at that SQL Pass evening) don't change that much, only are going faster or are more specialized.

Greetz,

Hennie

zondag 19 september 2010

A duplicate attribute key has been found when processing....

Hi folks,

I was bit of experimenting with the hierarchies in analysis services and i came into an error, that i want to show you. I've installed the adventureworksDW database and built a dimension on the dimdate table of that database. I've created a small hierarchy:

A simple hierarchy as you can see. When i process this, the following error occurs:

This occurs because the monthname appears multiple times in multiple years and they has to be unique. Okay well there's a keycolumn property of the month attribute. Simple, just make the key unique and problem solved! So i added the datekey to key column and filled in the name column:

Process Succeeded!!! Okay, let's continue with something more interesting. This is too easy stuff. Hey well wait...This is my hierarchy:

This isn't what i want. This hierarchy has a lot of multiple members with the same value. Choosing a dimension key for making unique values is not a proper thing to do! Let's change the datekey to calendar year.

Process Succeeded! Let's see the result:

A simple post about a simple problem but it took my a while before i got it solved (at first i had a more complex hierarchy). Just making attributes unique by just adding a unique field could give you undesired results.

Greetz,
Hennie

vrijdag 17 september 2010

Lookup join with connection manager and transform

This blog is about the cache connection manager (CCM) and the cache transform. These two allows you to load the lookup cache from any source. The cache connection manager holds a reference to the internal memory cache and can both read and write the cache to a disk-based file. The cache transform is a new feature in SQL Server 2008.

The lookup transform can utilize this cached data to perform the lookup operation. The cache connection manager can persist the cache to a local file, allowing the cache to be shared between multiple SSIS packages and steps within a package. A perfect example of where this will useful is in the extract, transform and Load (ETL) packages where when we are processing the fact table we have to lookup to to application vs surrogate key. In my datawarehouse projects I call this the Lookup table (LU)). So if we create a lookup cache per dimension, the time of loading of the fact will be decreased when we use this features. It will even speed up more when a dimension (cache) is used by multiple facts.

To demonstrate how to use the new lookup and cache transforms, we will create a simple SSIS package a s shown below:

The DFT_Cache_Customer will cache the information from the Customer table into a .caw file. For this example is used the following query:

1. The first thing you have to do is create cache connection manager. Right click in the connection manager area of the control flow design surface and selkect new connection. Select CACHE from the add SSIS connection manager dialog and then click on the Add button

2. The next thing is to set some properties that need to be configured. On the general tab click the Use file cache checkbox and enter a filename for this cache; this will persist the cache to a local file, allowing it to be reused by other tasks in the package or other packages.

3. Click on the columns tab to configure the data columns to be stored in the cache.

4. Build the customer cache (DFT_CacheCustomer)

5. The OLE DB Source selects the customer data : SELECT CustomerID, AccountNumber FROM Sales.Customer.

6. Then setup the cache transform component. This one loads the cache . The cache transform requires a cache connection manager which defines the cache. You could create a new onebut we already created on in the connection managers pane.

7. Click on mappings to map the input columns in the dataflow to the destination columns in the cache.

8. From a former post you can recognize the following SSIS snippet

9. The lookup transform needs two adjustements. The first one is to set a Connection type to Cache connection manager. This allows to pre load the cache from the file we just created.

10. The next thing you have to select is the Cache connection:

Conclusion: As is already said in the intro, this feature can speed up a lookup function of loading a fact and especially dimension surrogate keys. Even more when the cache is used multiple times.

Bye,
Hennie

woensdag 15 september 2010

Change Data Capture (CDC)

Change data capture is a way of capturing changed data from a source system within a specified window of time. These changes includes inserts, updates and deletes and the required window of time may vary from minutes, hours to weeks. It’s a new feature since SQL Server 2008 and it’s based on replication and designed with the SSIS team.

CDC is available on certain editions of SQL Server, namely SQL Server 2008 Enterprise, Developer and Evaluation editions. As an ETL developer you could be tempted to turn CDC on for every table, but that is a bad idea. Although CDC has low overhead, it does not have zero overhead.

Below you can see the steps for turning CDC on.

Enable CDC on at database level. This will let SQL Server set up the required infrastructure that we’ll need later. It creates a schema called cdc as well as security, functions and procedures.

USE AdventureWorks2008R2

GO
--enable CDC on the database
EXEC sys.sp_cdc_enable_db
GO

--Check CDC is enabled on the database
SELECT name, is_cdc_enabled
FROM sys.databases
WHERE database_id = DB_ID()

Make sure that the SQL Server agent is running on the same server where CDC is turned on. Start up SQL Server 2008 R2 configuration manager and check whether the agent is running.

Now it’s possible to enable CDC functionality on a table. For instance:

USE AdventureWorks2008R2

GO

--enable CDC on a specific table
EXECUTE sys.sp_cdc_enable_table
@source_schema = N'HumanResources',
@source_name = N'Employee',
@role_name = N'cdc_Admin',
@capture_instance = N'HumanResources_Employee',
@supports_net_Changes = 1;

And this results in :

Job 'cdc.AdventureWorks2008R2_capture' started successfully.
Job 'cdc.AdventureWorks2008R2_cleanup' started successfully.

You could check whether CDC is enabled for any particular table with the following query:

--Check CDC is enabled on the table

SELECT [name],
is_tracked_by_cdc
FROM sys.tables
WHERE [OBJECT_ID] = OBJECT_ID(N'HumanResources.Employee')

--Or use the CDC help procedure
EXECUTE sys.sp_cdc_help_change_data_capture
@source_schema = N'HumanResources',
@source_name = N'Employee'

So now CDC is turned on for the employee table. When this enabled SQL Server uses a default naming convention to create a shadowtable in the same database, called cdc.HumanResources_employee_CT. This table has the same schema as the source table, but is has extra metadata columns that CDC needs to do its magic.

SELECT * FROM cdc.HumanResources_Employee_CT

__$start_lsn = identifier of the original transaction
__$segval = Order of transactions
__$update_mask =
__$operation = type of operation (1 = delete, 2 = insert, 3 = update (before image), 4 (after image), 5 = merge)

Okay let’s make an update to this table and see what will happen. Hey, for every update there are two records: one before and one after image.

Ok now lets play with some of the API stored procedures and functions. With the following code you can retrieve the actual values of latest present records.

USE AdventureWorks2008R2

GO

DECLARE @dBeginTime AS Datetime = GETDATE() -1
DECLARE @dEndTime AS Datetime = GETDATE()
DECLARE @bFromLsn AS Binary(10) = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @dBeginTime)
DECLARE @bToLsn AS Binary(10) = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @dEndTime)

DECLARE @bMinLsn AS Binary(10)= sys.fn_cdc_get_Min_lsn('HumanResources_Employee')

IF @bFromLsn < @bMinLsn SET @bFromLsn = @bMinLsn

SELECT * FROM cdc.fn_cdc_get_net_changes_HumanResources_Employee(@bFromLsn, @bToLsn, N'all with mask');

Ok let’s use this information with SSIS. We could build a robust and efficient ETL solution.

Thats it for now!!

Gr,
Hennie

dinsdag 14 september 2010

Using the lookup join - Partial cache

This blog is about the partial-cache mode. In another post of mine, i explained a basic example of the lookup transform. In this post I will go a bit deeper into the lookup transform. The partial – cache mode gives you a middle ground between the no-cache and full-cache options. In the mode the component only caches the most-recently used data within the memory boundaries specified. As soon as the cache grows too big, the least-used cache data is thrown away.

The partial-cache mode provide another cache feature called the miss cache. If you use these two together you can gain further performance. So first I reused the package from first post about this item:

All the other tabs I showed in my former post (1) are almost the same: in the General tab the partial cache is changed to partial cache and i corrected the error in the connection tab

The next screendump shows the advanced tab:

I just changed the cache mode from full cache to partial and did some adjustments in the settings of the advanced tab.

Greetz,
Hennie

vrijdag 10 september 2010

Failure adding assembly to the cache

Hi,

A small post about GACUTIL. Currently, i'm building a custom source adapter for SSIS, reading a custom source. For this purpose i'm using Visual Studio Express 2010 C#. So, i am building the assembly and i want to register this in the Global Assembly Cache (GAC) with GACUTIL. So i executed this:

"C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\gacutil" /if "C:\ssis\SSISSourceAdapter\SSISSourceAdapter\bin\Release\SSISSourceAdapter.dll"

and this gave me following error:

Failure adding assembly to the cache: This assembly is built by a runtime newer than the currently loaded runtime and cannot be loaded.

So i thought that i should download the .net 4.0 framework SDK released in v7.1 from here . So, i have now three folders in C:\Program Files\Microsoft SDKs\Windows: v6.0A, v7.0A and v7.1. I executed GACUTIL from v7.1 folder and still it didn't work. After some search on the internet, i found out there's a folder in the v7.0A folder: C:\Program Files\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools. I executed with this GACUTIL commandline from this subfolder:

"C:\Program Files\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools\gacutil" /if "C:\ssis\SSISSourceAdapter\SSISSourceAdapter\bin\Release\SSISSourceAdapter.dll"

When i executed GACUTIL from this folder, the insertion was accepted. Not sure why the GACUTIL from v7.1 folder didn't work.

Greetz,
Hennie

Lookup join - Partial cache - Package Validation Error

In a former post i blogged about the full cache and i was figuring out how the partial cache is working. So i changed the full cache mode to partial cache and immediately i ran into an error.

and more specific:

TITLE: Package Validation Error

------------------------------
Package Validation Error
------------------------------
ADDITIONAL INFORMATION:
Error at DFT_AdventureWorks [Lookup [411]]: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80004005.
An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80004005 Description: "Syntax error, permission violation, or other nonspecific error".
Error at DFT_AdventureWorks [Lookup [411]]: OLE DB error occurred while loading column metadata. Check SQLCommand and SqlCommandParam properties.
Error at DFT_AdventureWorks [SSIS.Pipeline]: "component "Lookup" (411)" failed validation and returned validation status "VS_ISBROKEN".
Error at DFT_AdventureWorks [SSIS.Pipeline]: One or more component failed validation.
Error at DFT_AdventureWorks: There were errors during task validation.
(Microsoft.DataTransformationServices.VsIntegration)

I was trying to figure out what was going on and i found out i has something to do with this query in the lookup transform:

--16989 out of 19820
SELECT CustomerID, AccountNumber
FROM Sales.Customer
WHERE CustomerID % 7 <>0;

I've found out there are a couple things wrong with this query:

--16989 out of 19820 is not allowed in the query window.
; is also not allowed in this window.

What i also found is that when you press OK immediately after you corrected the error, the error dot will stay there!! But when you go to the Columns tab and when you press ok after this, the error disappears(??). Bit awkward and this cost me a lot of frustration before i found out about this.

Greetz,
Hennie

maandag 6 september 2010

Using the lookup join - Full cache

The lookup component solves join issues by using caching one of the datasets in memory and then compares each row arriving from the other dataset in its input pipeline against the cache. In the example below you can see the full-cache mode, whereby the lookup component stored all the rows resulting from a specified query in the memory. The benefit of this mode is that lookups against the in-memory cache are very fast.

So, first i started to build a package according to the below diagram:

First I build a query, something like this:

SELECT
OH.SalesOrderID,
OH.OrderDate,
OH.CustomerID
FROM Sales.SalesOrderHeader OH

This is the main stream of data that will flow trough the pipeline. It’s also a best practice to use the largest table for passing through a lookup transform with the data flow input. After this, open the Lookup transformation editor. And fill in the general tab like below:

Don’t forget to select the no matching entries.

After this, open the connection tab. This tab will show the query which is used for combing the data with the main data flow stream:

--16989 out of 19820

SELECT CustomerID, AccountNumber

FROM Sales.Customer

WHERE CustomerID % 7 <>0;

I added the WHERE clause for showing the non-matching record going into a other direction. Next is to select the joining columns and this is the CustomerID. After selecting these fields everything should be ready for running and you can see the result below:

Conclusion is that this transform can be used in loading a dimension. You could use the match output for the matching records and the no match records could be processed separately. These non-matching records are in fact ;-) dummy records. No records should disappear when you build a starschema because calculations will show wrong results. When a customerID is not found in the customer dimension, it should point to a dummy record in customer dimension, mostly the -1 record.

So that's it for now,

Hennie

zaterdag 4 september 2010

Merge join vs Lookup transform vs relational join

In this post i'll compare the two join components: the merge join and the lookup join with the relation join in order to determine which solutions is the best possible way to solve your problem regarding joining data from certain data sources. This post is based on SQL Server 2008 R2.

Merge Join
The merge join is used for joining data from two or more datasources. The more common ETL scenarios will require you to access two or more disparate datasources simultaneously and merge their results together into a single datasource. An example could be a normalized source system and you want to merge the normalized tables into a denormalized table.
The merge join in SSIS allows you to perform an inner or outer join in a streaming fashion. This component behaves in a synchronous way. The component accepts two sorted input streams, and outputs a single stream, that combines the chosen columns into a single structure. It’s not possible to configure a separate non-matched output.

Lookup
The lookup component in SQL Server Integration Services allows you to perform the equivalent of relational inner and outer hash joins. It’s not using the algorithms stored in the database engine. You would use this component within the context of an integration process, such as a ETL Layer that populates a datawarehouse from multiple non equal source systems.
The transform is written to behave in a synchronous manner in that it does not block the pipeline while it’s doing its work, mostly. In certain cache modes the component will initially block the package’s execution for a period of time.

Conclusion: There's only one reason for using the merge join and that is a lack of memory, otherwise always use the lookuptransform. There's done lot of improvement of the lookup transform in SQL Server 2008.

Greetz,

Hennie