Dimensional Data Modeling
Related Blogs:
Data are usually captured and stored in tables in a database.
In most if not all scenarios, a very crucial goal is to be able to use, analyze and report using the data stored in these databases to help make good decisions.
Traditionally, transactional data are stored in normalized table structure to avoid data redundancy and improve data integrity by removing insertion anomaly, update anomaly and deletion anomaly. Normalized table structure usually means lots of table joins when we are trying to query (read) the data from these tables. These table joins usually add to latency of query results.
This means that querying a single table is usually faster than querying five tables that need to be joined. In order to improve the performance of table queries, we usually denormalize the tables. Denormalization is the opposite of normalization. In denomalization, we try to improve the query (read) performance by adding redundant copies of data or by grouping data into a single or smaller number of tables to reduce table joins.
Another approach to further improve query (read) performance is to design the structure of the tables in a database to allow for faster queries (reads). This structure is what we can call a dimensional structure and modeling the data this way is called dimensional data modeling or dimensional modeling.
The dimensional data structure is used mostly in data warehouses and it is in that context that we will discuss the dimensional modeling.
Dimensional Modeling Process
The most important aspect of dimensional modeling that are often neglected and not considered by data architects or data modelers is to have a good and practical understanding of the business or the domain.
I would like to mention this because without this understanding of the business or the domain, the dimensional data model may not fully address the requirements or will not be able to answer the important questions and decision points that the dimensional model was supposed to help answer.
As a data architect or data modeler, we should give ourselves time to study the business or the domain by talking to the business users and by research.
With that said, here are the basic steps in dimensional data modeling.
Step 1: Understand the business process or the domain
This is the important first step to go out and talk to the business stakeholders and business users. This involves discussing with the leadership on what questions they want answered and what information and insights they need to make the best decisions for their organizations.
We may not get the complete answers from the leadership but it will give us an idea what is important to the organization and to the leadership. This can be our minimum target and from this, we can think of other value-adding targets we can get from our dimensional data model (or our data warehouse).
We also need to talk to the business users and staff to understand how they use, consolidate and generate the current analysis and reports using the data. We will get a lot of practical insights on the actual challenges with the data from the people who actually use it.
Step 2: Define the subject area
Bill Inmon defines a data warehouse as a "subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".
I would like to focus on the word "subject-oriented". For me, this means that in designing a dimensional model, we focus on a certain subject . Examples of subject areas in enterprise business domain is sales, orders, shipments, invoices, etc.
By explicitly defining a subject area, we can focus on the subject and be able to design the right dimensional model to answer the questions and address the requirements for the subject area from the business stakeholders (leadership and users\staff).
Step 3: Define the grain
We need to define the right level or granularity of the transactions we will track and store in the dimensional data model.
For example, we can define the grain to be the line item in the sales order line or the grain can be the number of items sold in a day or in a month.
The grain should be stated in one sentence.
Defining the grain at the lowest level or most granular means that a lot more transactions will be stored. It also means more flexibility in aggregating the data. For example, if we say that the grain will be the number of items sold in a day, we can get the monthly aggregation from the data tracked daily. On the other hand if we defined the grain to be the number of items sold monthly, then we can only get the monthly count and then probably the quarterly and annual but not able to drill down to daily or weekly count.
It will depend on your requirements and resources on how you define your grain but it is important to be aware of the implications of the choice of a grain.
Step 4: Identify the dimensions
The dimensions are the different entities we will aggregate on. Dimensions are typically nouns like product, customer, time, store, location and so on.
For example, if we have the product, customer, location and time dimension, we can aggregate or come up with the count of items sold to which customer by product category and what month.
It is usually a best practice to keep the dimensions "wide" i.e. dimension tables should have all the necessary attributes or fields to allow for several groupings or slice and dice.
We can also define hierarchies in dimensions. This allows for a more insights.
A very important consideration in dimensions is whether to keep history of previous data values or not. This is what we call slowly changing dimensions (SCDs) and there are several approach on how to handles these in dimensional modeling.
Here are some good references on SCDs:
Step 5: Identify the facts (or measures)
The facts or sometimes called measures are what is\are being aggregated. These are usually numeric but does not need to be.
For example, the sales order line item price amount is a measure and it is aggregated in the fact tables.
The fact tables are usually "thin" i.e. contains only the foreign key columns to the dimensions' primary key columns and the facts or measures attributes\column. The fact tables have the most records (which depends on the grain defined).
Step 6: Test the dimensional data model
Once the dimensional model has been completed and the actual tables (Facts, Dimensions) are generated off the model, it is always prudent to load some actual data into the dimension tables and to the fact tables and then try to answer the questions from business leadership and business users using the dimensional model.
Thoughts on future of dimensional data model
As further reading, here (link) is a good article from Ralph Kimball on the future of ETL, database technology and dimensional modeling.
Disclaimer
Any external linked site(s) referenced in this blog are owned by their respective owners and are not part of this blog. In addition, the author may at times will not be able to verify fully the information in these external sites.