Star Schema 101

For all of you joining the world of analytic’s, I would like to share a small tutorial on star schema that you can use to optimize your analytic queries and data storage in databases. Although its one of the most used dimensional modelling techniques I would like to reenforce that this data organization architecture is not just for data warehousing as you can use it as a reference model to create for example powerpivot models based in excel files, that will be this way faster and more optimized.

For this post I have decided to follow a problem/solution approach.

Problem:

The current scenario is based in a tabular architecture that is composed by tables that store all the data needed for a specific report. This will cause a massive redundancy of data which is critical in terms of storage occupation and not sustainable at mid-long term.

Although BI Data warehouses don’t use a 3rd form normalized format because it is optimized for transaction storage and this way is not oriented for data analysis (The amount of joins and calculations needed to retrieve data, makes report executions take too long), there are mid term models that resolve both the data redundancy issue and the data retrieval one.

Before looking at the solution is important to have a sense on what the current architecture causes in our database environment. If we look at the following table which simulates a simple sales table we can image what the current scenario causes in the space occupied by the records:

 

Seller

Product Store Country Date Qt. Sold

Anthony Smith

Bike Megastore Nigeria 10-08-2014 1

Anthony Smith

Helmet Megastore Nigeria 10-08-2014 2
Anna Lee Bike Megastore Nigeria 12-08-2014

2

Anna Lee Wheel Megastore Nigeria 12-08-2014

1

Anna Lee Helmet Megastore Nigeria 12-08-2014

2

 

Solution:

  1. Lets change our tabular model to a dimensional model like the star schema suggested by Ralph Kimbal.
  • This will allows us the have a more scalable solution, oriented for data analysis which will reduce the amount of space occupied by our data warehouse while we increase the performance of the our ETL process which can now be separated in two stages: dimensions (perspectives of analysis) which some can be updated daily and facts (business events like transactions) which can be updated hourly.

If you look at the previous sales table example when applying the star schema model, we should have the following result:

ss

Illustration 2: Example Star Schema

As you can see in the previous figure, although we can joins, they are “controlled” because they are optimized for just one join to retrieve data from fact to dimension tables.

More even if you want to add new fields to a specific perspective of analysis you just need to add it to the dimension and update only that dimension and not all records of the fact table.

Read more about the star schema:

 

Cheers

 

 

 

Anúncios

Deixe uma Resposta

Preencha os seus detalhes abaixo ou clique num ícone para iniciar sessão:

Logótipo da WordPress.com

Está a comentar usando a sua conta WordPress.com Terminar Sessão / Alterar )

Imagem do Twitter

Está a comentar usando a sua conta Twitter Terminar Sessão / Alterar )

Facebook photo

Está a comentar usando a sua conta Facebook Terminar Sessão / Alterar )

Google+ photo

Está a comentar usando a sua conta Google+ Terminar Sessão / Alterar )

Connecting to %s