For a customer of mine I’ve developed a star model with in the middle a fact table and a couple of dimensions. Some of the dimensions are small, some not. The dimensions which are small reads data from CSV files and the large dimension reads from XML files. I created ETL processes with SSIS which reads from the XML, transforms the information in the fact table and a couple of dimensions.
One of the dimensions is a customer table. The customer cannot uniquely identified by any means what so ever, unfortunately. This will place a burden on the performance. This dimension table will grow with the same pace as the fact table (very rapidly changing dimension;-)). The number of transactions will be in the order of 20 a 30 million a year.
I already took the following measures (or proposed to):
• Creating integer surrogate keys.
• Creating (clustered) indexes on the surrogate keys of the dimension tables.
• Creating indexes on columns used to join other tables (foreign keys).
• Creating (covering) indexes on high selective fields (0-20% return of rows from the total)
• Creating indexes on fields used by ORDER BY clauses.
• Creating multiple filegroups for the indexes, logs and data for spreading them over RAID configurations.
But there will be a need for future improvements in performance and this blog is about the different options available for further performance improvement. The base for this is SQL Server. Hardware and network improvement is not taken into account. Also maintenance scripts, profiler, index tuning wizard are left overs for future blogs. So what will be design options for further improvement?
Further improvement in design could be found in the following options:
1. I’m thinking about using the Checksum Function (yes, I know it’s not perfect) to aid searching for specific values in some large dimensions. Creating an index on this column could be fast. So during ETL process I will create for highly used columns multiple checksum columns in the customer table. When the front end is querying the table it will convert the fields into checksum values too. Using this in a covering index could be fast.
2. Table partitioning. Table partitioning means nothing but splitting a large table into smaller tables so that queries has to scan less amount data while retrieving (Divide and Conquer).
- Horizontal partitioning. So it’s expected to have 20 a 30 millions of rows per year for my customer. So my advice could be for splitting the fact and some dimensions into a yearly partition schema. A base for this could be date column or an identity column (the surrogate column).
- Vertical partitioning. Suppose, we have seen that some columns in the table (fact or dimension) are frequently used and some or infrequently used. We can split this table into frequently used table and a less used table. Each table has a subset of the former table. This should be done carefully because when fields are needed from both tables serious performance degradation has to be taken into account.
So, putting them on separated disk will enhance performance.
3. Partitioned views. A partitioned view joins horizontally partitioned data from a set of member tables across one or more servers. This makes the data appear as if from one table. A view that joins member tables on the same instance of SQL Server is a local partitioned view (BOL). This could be an option when one server is not enough anymore. If you have multiple servers it’s possible to approach this in a federated way. The view is used as if it is from one server but underneath the view is based from different servers or virtual environments.
This is it for now.