This post will explain the following:
- Slow changing dimensions (SCD0,SCD1,SCD2 and SCD3)
- Inferred Members
- How to configure a SCD component in a data flow using SSIS2012
Slow-Changing Dimensions are a special data warehouse modeling technique, that allow you to manage how your ETL process will resolve updates and inserts into a dimension, so that you can tell your packages what to do if an attribute value changes.
There are currently three accepted SCD types (SSIS SCD component doesn’t support SCD3) :
- SCD1 in which you will overwrite an atribute value if the record has changed (I don’t need historical data );
- SCD2 in which you will create a new record whenever an atribute value changes (I want to keep all historical changes);
- SCD3 in which you create N columns in a dimension to keep N changes in it. (I just need the last N changes recorded).
Some authors refer another type, when no change (fixed atribute) is applied to an atribute, the SCD type is 0. I will assume a forth type as being SCD0.
For some time, the Integration Software available in the market didn’t support any king of wizard or automation while developing SCD ETL packages which means that if you wanted to handle a Slow-Changing dimension inside your packages, you had to create the whole data flow logics to handle it, which its not anything bad at all for me, because after you make it a few times, you will see that its always the same thing.
However SSIS gives you the SCD Component inside data flows so that you can use a wizard to help you creating the flow for handling SCD, however there are two important notes for you:
- SCD Component doesn’t support SCD3, so, if you want to use it you need to develop the flow yourself.
- If you want to use SCD2, you have to create two new columns in your dimension which are the begin date of the record life (SCD_DT_BEGIN, for example) and the end date of the record life, which represents the time in which a new record with different atribute values was identified in your ETL process, (SCD_DT_END, for example)
Another important thing for you to know, if you aren’t used to use SCD dimensions is that in the same Dimension you can have SCD1 atributes, SCD2 atributes and even SCD3 atributes, because when a change occurs in a specific attribute, you might not need to record it (SCD2) but you might want to update its value. These requirements will depend from businesses to business.
This way, lets start making an example to show you how easy it is to use the wizard inside SSIS2012. I will show you all available options inside it, using all three SCD types in a single dimension. My scenario is based on a football referee operacional system which will lead to and SCD dimension called DimReferee. Both are very simple dimensions with few atributes.
The source table is as follows:
Our DimReferee is structured as follows: (As you can see I have already added the start and end date atributes which will be used to flag if a SCD2 record is active or old. As I have already told you SSIS SCD component allows you to use a Status column as well)
Now that both columns are created, lets go to Integration Services. The first task is adding the SCD component to the data flow of your Dimension. The following figure shows you how it should look like. Our example package will be very simple, just a Source component to read the referee data from the source, the flow generated with the SCD component and a destination component to send our data do the Dimension.
To start configuring the SCD component double click it to open a Wizard that will help you configuring you Slow Changing Dimension. Once you do it, SSIS will show you an initial screen as the following. Hit the next button on it.
Once you do it the configuration wizard will show you one the most important screens. This is where you define the mappings between source and destination atributes, and the Business Key which will be used to look up the record in the source table. As you can also see you need to define the connection manager for the Dimension and the table you want to transform in a SCD dimension.
For our example lets map all atributes as follows and choose the id_referee (Primary key of the Referee table) as our business key. The start and end date will stay blank as they will be configured in a later screen.
After hitting the Next button we enter the most important screen in the SCD point of view. It is in this screen that you define the type of Slow Changing Dimension for each atribute mapped in the first screen. You have three available options:
- Fixed Attribute (SCD0)
- Changing Attribute (SCD1)
- Historical Attribute (SCD2)
Has I have told you before SCD3 is not supported in the SSIS component. In this example I will use all of the types, so when a referee name changes I don´t want to do anything because for my business I will assume that his name won´t change (SCD0). However when a referee marriage status changes I want to overwrite its value to the new one (SCD1) and when his address changes I want to save the old value so its need to be an Historical attribute (SCD2).
Once you click in the Next button SSIS will prompt you a new screen, so that you can choose what should it do, when an attribute value defined as Fixed changes and if you want to update even the SCD2 historical records if a type SCD1 value changes. In the first case you can tell SSIS to raise an error if a defined SCD0 atribute changes its value and the change is detected. In the second case if an SCD1 attributes changes its value, you need to define if you just want to overwrite its value in the most recent record or in all of them (Historical ones).
The last case only makes sense if you have an SCD2 attribute in the dimension. In our example lets raise an error if the referee name changes, has if it happens we might have misunderstood the business and it might need adjustments. In the second check box lets leave it blank as we just want to update the active record only.
The Next screen is where you define if you will use a Status column to indicate the current (most recent/active/alive) record or if you prefer to use a time interval. As I have already told you, I prefer to use the dates, as they allow me to now in which period did that record live. This way lets use our start_date and end_date attributes as a flag to the current record. We also need to indicate which type of variable are we going to use to populate the dates. To do so you have three options:
- ContainerStartTime – The start time of the container in which the package is in.
- CreationDate – Date of creation of the package
- StartDate – The start date of execution of the package*
I prefer the last option as it is the most near of the SCD flow in the package, reflecting the most recent date. However for this example I left the CreationDate. After that click in the Next button.
The next screen is delicate. An interesting concept related to data warehouses using SSIS and SQL Server is an inferred member which is a record from your dimension, referenced by fact tables, however the record is not yet loaded to the dimension. This means that a particular record can the referenced without existing. Might occur if for some reason you need to load facts without having the dimension record data yet, so you create a Blank/Null record in a dimension, that won’t have any data, just its primary key. The next screen allow you to tell SSIS what to do when the data for that inferred member arrives, either update the record or creating a new one. If you activate the inferred member support you have two options:
- All columns with a change type are null – This means that for every column in a inferred member record that is configured with and SCD type, SSIS will set its values to null.
- Use a boolean column to indicate whether the current record is an inferred member – In this case SSIS will use a column created by you, to flag if the current record is a inferred member.
Once you click next you are done. SSIS will now automatically create the flow for you, which as you can see in the following figure is a lot of time saved with transformations and components configuration. However if there are further transformations such as data types conversions, you still need to do them. For instance in our example we still need to convert some data types before running our package.
In other to have a complete usable data flow , you have to add some conversion components to respect data types, but i am pretty sure that for that you won’t need my help. However I leave you a final advice, to avoid this situation make sure your data is converted before using the SCD component. something like the following:
Adding that conversion component to ensure data types consistency, your data flow should now look like the following:
Now, run the package and test yourself.