Agile, Data Modeling, Data Warehouse, Database, testing

How to create 10,000 Extract, Transform, and Load automated tests using 4 tables #agile #data

The thing I love about my chosen profession is the ability to learn new things and improve on lessons learned from past projects. Recently I was able to take on a problem that I have experienced on multiple past projects.

“How can we easily create automated tests for a Data Migration or Extract, Transform, and Load application?”

Recently I have been lucky to be on Agile projects where we were able to create a large number of automated tests. I was able to see the huge increase in quality that came with these automated tests. All of these projects were Web Applications developed in either C# or Java.

In my coding days, I was always either a Data Modeler or Database Programmer. I had been on three projects where I was the lead developer on the extract, transform, and load (ETL) application that was responsible for loading data into new enterprise databases. Sadly, we had absolutely no automated tests in any of these projects. As we developed our ETL application, we had to manually test the loads to ensure they were operating as intended. This became especially painful near the end of the project as a small insignificant change resulted in hours and hours of manual retesting. We soon discovered that we were more likely to make a mistake testing the changes than making the actual coding change. This situation was something that was not sustainable.

After working with Agile teams and seeing how they were able to easily create a large number of automated tests, I hoped I would get the chance to try to create them on the next Extract, Transform, and Load project I was on.

The Opportunity

My most recent project allowed me to again be part of a team that recreated the corporate Data Warehouse and also needed to create an entirely new ETL application to load the Data Warehouse. The database technology that was selected was the Microsoft stack. We used SQL Server 2012 as our database engine and SSIS as our technology to create the ETL application. The Data Warehouse we were loading had over 200+ main tables that the ETL process was loading.

One thing I wanted to ensure we did was to create a large number of automated tests for our ETL application. We investigated multiple frameworks that existed, but none of them seemed to allow us to easily create the number of automated tests we wanted. All of them seemed to still require a large amount of test set up and the tests themselves did not adapt easily to changes in the database schema. This had always been a problem in the past as I tried to created automated tests for ETL applications.

Serendipity

In my time as a Data Modeler/DBA I became very good at writing queries to read the Data Dictionary tables of the database itself to generate SQL statements to then be executed. Then was the serendipitous moment – could we also read the Data Dictionary tables to easily generate automated tests for the ETL application?

The answer was yes, with a small number of customized tables that contained the column to column data mapping information. (Since this information was not stored in the Data Dictionary). The Data Dictionary tables we accessed in SQL Server were the tables that are part of the INFORMATION_SCHEMA.

The Solution

Our solution contained the following elements:

  • tSQLt  source framework for the automated testing framework
  • Data Mapping Spreadsheet that defined the Column to Column mapping
  • 4 custom tables that contain information from the Data Mapping Spreadsheet
  • Stored Procedures that read the INFORMATION_SCHEMA and 4 custom tables to automatically generate the tests

4 Custom Tables

schema

ETL Automated Tests

With these four custom tables loaded from the Data Mapping Spreadsheet we created the following Stored Procedures to generate tests:

Table to Table – Every Table

  • TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables.
  • TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.

Column – Every Column

  •  TstColumn­DataMapping: Compares columns directly assigned from a source column on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­ConstantMapping: Compares columns assigned a constant on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­NullMapping: Compares columns assigned a Null value on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumnTransformedMapping: Compares transformed columns on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.

The Results

By creating these 4 custom tables and stored procedures we are now able to generated 10’s of thousands tests nightly. More importantly, we are able to have these tests be flexible to schema changes as the tests are generated by reading the INFORMATION_SCHEMA  and 4 custom tables. A large part of generating our tests is now data driven.

 

About Terry Bunio

Terry Bunio is passionate about his work as the Manager of the Project Management Office at the University of Manitoba. Terry oversees the governance on Information Technology projects to make sure the most important projects are being worked on in a consistent and effective way. Terry also provides leadership on the customized Project Methodology that is followed. The Project Methodology is a equal mix of Prince2, Agile, Traditional, and Business Value. Terry strives to bring Brutal Visibility, Eliminating Information islands, Right Sizing Documentation, Promoting Collaboration and Role-Based Non-Consensus, and short Feedback Loops to Minimize Inventory to the Agile Project Management Office. As a fan of pragmatic Agile, Terry always tries to determine if we can deliver value as soon as possible through iterations. As a practical Project Manager, Terry is known to challenge assumptions and strive to strike the balance between the theoretical and real world approaches for both Traditional and Agile approaches. Terry is a fan of AWE (Agile With Estimates), the Green Bay Packers, Winnipeg Jets, and asking why?

Discussion

3 thoughts on “How to create 10,000 Extract, Transform, and Load automated tests using 4 tables #agile #data

  1. Awesome, nicely done! I’d love to see a sample of one of the tests, and how you integrated the tests into your CI cycle.

    Posted by eselje | January 6, 2014, 2:56 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: