BI Future Blog: februari 2010

zondag 21 februari 2010

Gestallt psychology

Today i wan’t to discuss a principle of psychology which was subject of a cognitive psychology course during my study at the university. That is ‘Gestallt’ psychology or gestalism (German: essence or shape of an entity’s complete form). It’s a theory of mind and brain positing that the operational principle is holistic, parallel and analog, with self-organizing tendencies. The Gestalt effect refers to the form-forming capability of our senses. You can use the gestalt principles by the design of your reports, dashboards or whatever aid you’re using when presenting your information.

Below you can see some examples of gestallt principles:

Reification
Reification is the constructive or generative aspect of perception, by which the experienced percept contains more explicit spatial information than the sensory stimulus on which it is based.

For instance, a triangle will be perceived in picture A, although no triangle has actually been drawn. In pictures B and D the eye will recognize disparate shapes as "belonging" to a single shape, in C a complete three-dimensional shape is seen, where in actuality no such thing is drawn.

Emergence

Emergence is the process of complex pattern formation from simpler rules. It demonstrated by the perception of the Dog Picture, which depicts a Dalmatian dog sniffing the ground in the shade of overhanging trees. The dog is not recognized by first identifying its parts (feet, ears, nose, tail, etc.), and then inferring the dog from those component parts. Instead, the dog is perceived as a whole, all at once.

Pragnanz

The fundamental principle of gestalt perception is the law of prägnanz (German for pithiness) which says that we tend to order our experience in a manner that is regular, orderly, symmetric, and simple.

On the internet you can find all kind of these examples and tehy are quite interesting. Some examples are usable in your reports and some don't. For instance the example right here above is quite useful. By using colors boldness, italic you can implicitly suggest a relation between some information.

By ordering information which belongs together and splitting information which doesn't belong to each other you can increase readability.

Greetz,
Hennie

maandag 15 februari 2010

synchronize connection strings

In SSIS you have the possibility to create datasources in your project. These are design time projectlevel connectionstrings to connect to your datasources. So this seems a good idea: define once and reuse this multiple times in your packages. So when I start a project I create datasources and in the connection manager box (below) I select “New Connection from Data source”. Great! Perfect! This is what I want. I create the datasources at project level and use them in the packages whenever I need them. Perfect!

But troubles begin when you start using package configurations. Mostly i create package configurations (dtsConfig files) and use the environment variable for indirect use. When you start changing the package configurations files by hand you will receive an error when opening the specific package who uses this configuration: “Synchronize connection strings” (Grrr). The old connection string is the one changed in the file and the new one is my old connection string. Awkward! The other way around seems more logic to me.

Some info from BOL:

“When you add a data source to a package by using SSIS Designer, SQL Server Integration Services adds a connection manager that can be resolved as an OLE DB connection at run time, sets the connection manager properties, and adds the connection manager to Connections collection of the package. To associate the data source and the connection manager, Integration Services sets a property, DataSourceID. Because a data source is not created within the context of a package, multiple packages can use the same data source”

I searched the internet for a solution for this but I can’t find nothing better than: “Don’t use datasources”. Great! This is not the answer i want but i didn’t find any better solution so here is my best practice:

1. The capability of using datasources throughout your project is very handy. So at first you could create and use them until your project reaches a maturity level. After this delete the datasources.

2. Create different sets of configurations for Development, Test, Acceptance and production.

3. Use batch files for creating the environment variables (SETX XXXX_Config D:\XXXX\Configurations\XXXXXX.dtsConfig /M) for development, test, acceptance and production (DTAP).

Greetz,
Hennie

zaterdag 13 februari 2010

performance improvement options.

For a customer of mine I’ve developed a star model with in the middle a fact table and a couple of dimensions. Some of the dimensions are small, some not. The dimensions which are small reads data from CSV files and the large dimension reads from XML files. I created ETL processes with SSIS which reads from the XML, transforms the information in the fact table and a couple of dimensions.

One of the dimensions is a customer table. The customer cannot uniquely identified by any means what so ever, unfortunately. This will place a burden on the performance. This dimension table will grow with the same pace as the fact table (very rapidly changing dimension;-)). The number of transactions will be in the order of 20 a 30 million a year.

I already took the following measures (or proposed to):
• Creating integer surrogate keys.
• Creating (clustered) indexes on the surrogate keys of the dimension tables.
• Creating indexes on columns used to join other tables (foreign keys).
• Creating (covering) indexes on high selective fields (0-20% return of rows from the total)
• Creating indexes on fields used by ORDER BY clauses.
• Creating multiple filegroups for the indexes, logs and data for spreading them over RAID configurations.

But there will be a need for future improvements in performance and this blog is about the different options available for further performance improvement. The base for this is SQL Server. Hardware and network improvement is not taken into account. Also maintenance scripts, profiler, index tuning wizard are left overs for future blogs. So what will be design options for further improvement?

Further improvement in design could be found in the following options:

1. I’m thinking about using the Checksum Function (yes, I know it’s not perfect) to aid searching for specific values in some large dimensions. Creating an index on this column could be fast. So during ETL process I will create for highly used columns multiple checksum columns in the customer table. When the front end is querying the table it will convert the fields into checksum values too. Using this in a covering index could be fast.

2. Table partitioning. Table partitioning means nothing but splitting a large table into smaller tables so that queries has to scan less amount data while retrieving (Divide and Conquer).

Horizontal partitioning. So it’s expected to have 20 a 30 millions of rows per year for my customer. So my advice could be for splitting the fact and some dimensions into a yearly partition schema. A base for this could be date column or an identity column (the surrogate column).
Vertical partitioning. Suppose, we have seen that some columns in the table (fact or dimension) are frequently used and some or infrequently used. We can split this table into frequently used table and a less used table. Each table has a subset of the former table. This should be done carefully because when fields are needed from both tables serious performance degradation has to be taken into account.

So, putting them on separated disk will enhance performance.

3. Partitioned views. A partitioned view joins horizontally partitioned data from a set of member tables across one or more servers. This makes the data appear as if from one table. A view that joins member tables on the same instance of SQL Server is a local partitioned view (BOL). This could be an option when one server is not enough anymore. If you have multiple servers it’s possible to approach this in a federated way. The view is used as if it is from one server but underneath the view is based from different servers or virtual environments.

This is it for now.