zondag 27 november 2011

Handling NULL in dimensions and measures

A small post about 'Unknown' and 'Not Available' values in datawarehouse solutions. I've a requirement of my customer about this issue. The customer wants to see whether a value of a field in the sourcetable is empty (= NULL) or whether the value is not present in the dimension. There is a distinction between the two types of mismatching. The same problem is there with measures. They also want to see whether a measure value is NULL or whether it has a valid value. This way it's possible to monitor the dataquality of the source in a more sophisticated manner.

Normally you create one 'empty' row in your dimension called 'Unknown' (for handling NULLs and false dimension key lookups) and this a good solution under normal circumstances.But as said in my introduction, a more refined solution is needed. In my opinion the following distinction can be made:
  • Dimension.
    • -2 , "Not available" like in NULL. There is no value present in the source. In case of date dimension use a value like 1898, 1899 or 2098,2099 as a default value for unknown and/or not avaliable dates. In some sourcesystems i've seen that a '' (two single quotes) is implemented. You could include this check in the same logic for the "Not available" code too.
    • -1 , "Unknown" when the application key is not present in the dimension application lookup field. In this case there is a value present but it is not (yet) present in the dimension.
  • Measure.
    • When not filled in, use 0 because this won't bother aggregations like summing.
    • Use an extra indicatorfield <field>_AvailableYN to indicate whether the measure is available or not (NULL). Create a DD dimension and drop this field in here. 
Below a diagram is shown about a dimension lookup scenario. One NOT AVAILABLE and one UNKNOWN scenario.


When the column in the sourcetable is NULL it will insert a -2 into the fact table. When a value is present in the column in the source but the dimension lookup failed an unknown value (-1) is inserted in the fact table.


Powerusers should be aware of these choices when they build reports.If they are used to 'UNKNOWN' values, now they have include some logic for the NOT AVAILABLE values in the reports too.

An interesting article about dummy values can be downloaded here

Greetz,
Hennie

3 opmerkingen:

  1. I did not get the idea about the UNKNOWN value. If you don't have this customercode in your dimension table (it is new code), it should be loaded into this dimension in the next load? You probably want load this dimensions and the fact tables at the same rhythm?

    BeantwoordenVerwijderen
  2. This problem seems more present in a disparate environment with multiple source systems (like i'm involved with).

    It also depends on the strategy on how to handle 'late arriving dimensions'. In my case i'll update the key of the fact with the new customerid.

    Hope that this answers your questions.

    Hennie

    BeantwoordenVerwijderen
  3. Hi Hennie,

    This works fine for a DWH with SQL on it so you can manage the different situations (show or not show the unknown ones) but doesn't work well in an olap cube. I have a fact with a dimension field that's sometimes NULL, that's also correct in my case. In the cube, I don't want to show the measures on an unkown member when I select a dimension attribute.

    BeantwoordenVerwijderen