I follow now and then some discussions about data. Some say: “well data is like water; it flows through your organization and people will use the data as they like”. It’s a nice analogy but it’s too simple to say that, because data can be of all kinds of forms: simple, complex, float, text, etc. So data is flowing through your organization as little, big, roundy or roughly pieces. Little pieces for the easy businessprocesses and complex data for complex analytical questions. So, data can have all kind of forms and characteristics. Let’s describe some more characteristics of data:
• Data is a representation of the world around us, at some moment in time. Some data will be accurate (temperature readings) and some won’t (human entry systems). So, especially, don’t expect that data is non - erroneous representation of the world around us. It will allways has flaws. This is what we call data quality. How accurate describes data the world around us?
• Data has a life cycle. Data is created, updated, expired, renewed, destroyed, backupped, etc. In a datawarehouse we mostly keep all the data for years in a detailed level. Why? Well, because we can, we never know if we gonna need it, we don’t know how we are gone use it (lowest grain), because diskspace is cheap, etc etc. All reasons to keep the data in your systems.
Data about fashion sales in a fashionshop of the last day can be very important. Fashion sales from a couple of months ago about a certain day will be less important. May be data should aggregate more when it's get older. Below you can see a couple of data life cycle curves positioned against importance.
• Data can be very valuable at a moment in time. This depends on the success of the actions of an organization, department or person. Data is only valuable when it’s used. If it’s not used it doesn’t represent a value.
• Data can be aggregated to a higher level or it could be kept at a low level (grain).
• Data describes something and it can describe things : master data and it can describe events: transactional data.
• Data will be less valuable when the source is difficult to determine or unclear. Also when the data is ‘scrambled’ during ETL processes users tend to discuss this and could feel unsecure whether the data is correct.
• Data could be objective or it could be subjective. Objective when it’s trivial, like the address of a customer but it will be less objective when the data is processed for business decisions. Mostly data quality is mentioned with operational data like an address of a customer but when the data is aggregated to a higher level the ‘data quality’ of this indicator is also important. In this case the business rules should be judged for correction adaptation.
Gr,
Hennie