I’ve recently completed a data modeling initiative on a major project. After doing this I’ve come to two major conclusions:
- The coverage area in Insurance is probably the most devious and twisted area of data that I have ever modeled.
- Dimensional Modeling should be done on every model to ensure you can simply model the domain.
Why Dimensional Model?
For the model I worked on, there are over 261 tables in the Relational Model. In the Dimensional Model, there are only 25. The process to distill a Relational Model to a Dimensional one is not easy, but the process highlighted problems in my Relational Model that I was unaware of. These issues would have been exposed later during testing or post-production, but being able to identify them in development was extremely valuable.
So what is Dimensional Modeling? although most people have been exposed to the rules of Data Normalization, not as many people have been exposed to Dimensional Modeling. I will try to explain my opinion of Dimensional Modeling, but if you are interested I would highly recommend ‘The Star Schema Reference’ by Chris Adamson. Simply put, it is the most complete and concise book ever written on Dimensional Modeling. Dimensional Modeling is usually done to create a model for a Data Warehouse. This Data Warehouse can then be used for Operational and Analytic reporting. But the process of Dimensional Modeling does not need to be limited to a Data Warehouse. The process itself has value to validate the Relational Model.
The term that is commonly used to refer to a Dimensional Model is a ‘Star Schema’ model. This is due to the fact that most Dimensional Models look like a star. (With the Facts in the centre and the Dimensions around the Fact)
Dimensional Modeling is the process of taking a Relational Model in some normal form and being able to distill the many tables down to the business objects that the business works with. In most cases this would result in single tables that correspond to objects like:
- Sales Rep
- Sales Transaction
These business objects fall into two categories:
Facts – These are Facts about the business. They typically correspond to events or transactions and have metrics or measures that the business is concerned about. In many cases these fact tables are what the business wants to report on.
Dimensions – These are Dimensions to the Facts. They typically correspond to business objects that interact with Facts. These Dimensions represent how the business describes these objects and how they like to slice, dice, filter, sort, group, and order the Facts. A standard Dimension that occurs in all Dimensional Models that does not correspond to a business object is Time. Time is a dimension for almost all business events and transactions.
Initially, the act of creating these Facts and Dimensions is a large task of de-normalization. Although this sounds like a simply task, it is anything but. A Relational Model with hundreds of tables must be distilled down to a handful of tables. Many to Many relationships must be transformed into simple relationships that are easy to understand and report against. A natural key must also be defined that uniquely identifies a row in the Fact and Dimension. Although this sounds easy, you may find that some Dimensions don’t actually have a natural key. This is a clue that the Dimensional Model needs more attention.
A Dimensional Model also places the modeling of history at the forefront. A lot of attention is spent ensuring that data can accurately represent the changes that occur over time. Most of the time, a Relational Model is primarily concerned with current state.
Note: This is a gross over-simplification, if this is interesting to you I would recommend the book by Chris Adamson on all of the theory and complexity behind Dimensional Modeling. Or you can just start Googling!
The act of creating a Dimensional Model and creating the 25 Facts and Dimensions highlighted inconsistencies and duplication in my model. It also challenged my understanding of the data domain. It raised questions that I realized I didn’t have the answer for. It is easy for inconsistency and errors to hide in data model with 200+ tables. A data model that has 10-30 tables has nowhere to hide. It has the brutal transparency so valuable in Agile.