The thing I love about my chosen profession is the ability to learn new things and improve on lessons learned from past projects. Recently I was able to take on a problem that I have experienced on multiple past projects.
“How can we easily create automated tests for a Data Migration or Extract, Transform, and Load application?”
Recently I have been lucky to be on Agile projects where we were able to create a large number of automated tests. I was able to see the huge increase in quality that came with these automated tests. All of these projects were Web Applications developed in either C# or Java.
In my coding days, I was always either a Data Modeler or Database Programmer. I had been on three projects where I was the lead developer on the extract, transform, and load (ETL) application that was responsible for loading data into new enterprise databases. Sadly, we had absolutely no automated tests in any of these projects. As we developed our ETL application, we had to manually test the loads to ensure they were operating as intended. This became especially painful near the end of the project as a small insignificant change resulted in hours and hours of manual retesting. We soon discovered that we were more likely to make a mistake testing the changes than making the actual coding change. This situation was something that was not sustainable.
After working with Agile teams and seeing how they were able to easily create a large number of automated tests, I hoped I would get the chance to try to create them on the next Extract, Transform, and Load project I was on.
My most recent project allowed me to again be part of a team that recreated the corporate Data Warehouse and also needed to create an entirely new ETL application to load the Data Warehouse. The database technology that was selected was the Microsoft stack. We used SQL Server 2012 as our database engine and SSIS as our technology to create the ETL application. The Data Warehouse we were loading had over 200+ main tables that the ETL process was loading.
One thing I wanted to ensure we did was to create a large number of automated tests for our ETL application. We investigated multiple frameworks that existed, but none of them seemed to allow us to easily create the number of automated tests we wanted. All of them seemed to still require a large amount of test set up and the tests themselves did not adapt easily to changes in the database schema. This had always been a problem in the past as I tried to created automated tests for ETL applications.
In my time as a Data Modeler/DBA I became very good at writing queries to read the Data Dictionary tables of the database itself to generate SQL statements to then be executed. Then was the serendipitous moment – could we also read the Data Dictionary tables to easily generate automated tests for the ETL application?
The answer was yes, with a small number of customized tables that contained the column to column data mapping information. (Since this information was not stored in the Data Dictionary). The Data Dictionary tables we accessed in SQL Server were the tables that are part of the INFORMATION_SCHEMA.
Our solution contained the following elements:
- tSQLt source framework for the automated testing framework
- Data Mapping Spreadsheet that defined the Column to Column mapping
- 4 custom tables that contain information from the Data Mapping Spreadsheet
- Stored Procedures that read the INFORMATION_SCHEMA and 4 custom tables to automatically generate the tests
4 Custom Tables
ETL Automated Tests
With these four custom tables loaded from the Data Mapping Spreadsheet we created the following Stored Procedures to generate tests:
Table to Table – Every Table
- TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables.
- TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
- TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.
Column – Every Column
- TstColumnDataMapping: Compares columns directly assigned from a source column on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
- TstColumnConstantMapping: Compares columns assigned a constant on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
- TstColumnNullMapping: Compares columns assigned a Null value on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
- TstColumnTransformedMapping: Compares transformed columns on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
By creating these 4 custom tables and stored procedures we are now able to generated 10’s of thousands tests nightly. More importantly, we are able to have these tests be flexible to schema changes as the tests are generated by reading the INFORMATION_SCHEMA and 4 custom tables. A large part of generating our tests is now data driven.