For all of you joining the world of analytic’s, I would like to share a small tutorial on star schema that you can use to optimize your analytic queries and data storage in databases. Although its one of the most used dimensional modelling techniques I would like to reenforce that this data organization architecture is not just for data warehousing as you can use it as a reference model to create for example powerpivot models based in excel files, that will be this way faster and more optimized.
For this post I have decided to follow a problem/solution approach.
The current scenario is based in a tabular architecture that is composed by tables that store all the data needed for a specific report. This will cause a massive redundancy of data which is critical in terms of storage occupation and not sustainable at mid-long term.
Although BI Data warehouses don’t use a 3rd form normalized format because it is optimized for transaction storage and this way is not oriented for data analysis (The amount of joins and calculations needed to retrieve data, makes report executions take too long), there are mid term models that resolve both the data redundancy issue and the data retrieval one.
Before looking at the solution is important to have a sense on what the current architecture causes in our database environment. If we look at the following table which simulates a simple sales table we can image what the current scenario causes in the space occupied by the records:
- Lets change our tabular model to a dimensional model like the star schema suggested by Ralph Kimbal.
- This will allows us the have a more scalable solution, oriented for data analysis which will reduce the amount of space occupied by our data warehouse while we increase the performance of the our ETL process which can now be separated in two stages: dimensions (perspectives of analysis) which some can be updated daily and facts (business events like transactions) which can be updated hourly.
If you look at the previous sales table example when applying the star schema model, we should have the following result:
Illustration 2: Example Star Schema
As you can see in the previous figure, although we can joins, they are “controlled” because they are optimized for just one join to retrieve data from fact to dimension tables.
More even if you want to add new fields to a specific perspective of analysis you just need to add it to the dimension and update only that dimension and not all records of the fact table.
Read more about the star schema: