A Quick Checklist for Testing a Data Warehouse
When the Left Hand Doesn’t Know What the Right Hand is Doing.
As organizations grow, they are realizing that the various systems through which they store content and data are not the most competent way to effect business decisions and often turn to an integrated approach.
This is where a data warehouse comes in. Created to meet growing demands for business intelligence, data analysis, and management, it is a system designed to incorporate information from various heterogeneous sources such as relational databases, spreadsheets, or flat files and provide decision makers with access to this data via user-friendly, high-performance reporting tools.
All that sounds good, but as with a graceful swan frantically paddling underwater, what occurs behind the scenes is somewhat less glamorous.
Setting Up the Basics
The first step to tackle a data warehouse is to familiarize yourself with the data sources that will be fed up into the warehouse. It’s important to make sure that you have access to their environments and applications- since you will be dealing with several, it is unlikely that you’ll be completely familiar with their every detail. It’s helpful to develop a rapport with the people who regularly use them or preferably, test them. Make sure to do this, and it will come in handy further down the line.
ETL stands for Extract, Transform, Load and is the process through which a number of records is moved in bulk from a source to a target.
The way to test this differs from application testing. It is data-centric, and its objective is to validate that the records have been transformed and loaded into the target as expected according to the conversion and mapping rules for each source system data that needs to be loaded.
You will be comparing large volumes of information, possibly varying wildly in format. For this reason, as stated before, the testing will depend heavily in the availability and quality of test data at the source level.
Lastly, any transformation to the data will require complex SQL queries to be used in the comparisons to ensure the ETL processes and reporting processes work according to the requirements.
What To Look For
Once the initial setup is complete and before testing the ETL process itself, there are a few checks that can be run on the warehouse tables. At this stage you should perform structure comparisons between the source and target tables:
- Verify the data types and length of equivalent columns.
- Identify and match keys and constraints.
- Match columns that can have null values.
Depending on your particular data warehouse initiative, these can match between the source and target, or they can be different. It all depends on how the data needs to be stored in the data warehouse.
Once those initial checks pass, it is finally time to run the ETL and see the target tables populated with data. Once the process has executed successfully and you can check the results:
- Perform a row count on the source and target records to verify that they match.
- Since the large amount of data would make it very difficult to do a full comparison, look for max, min, avg, max length, min length of comparable values. But the more you can verify, the better.
- Query the tables for null values that do not exist at the source level.
After finishing those tests, it will be time to dive into a more granular level and perform checks based on the source and target’s existing rules.
- Look for gaps in the data in instances in which the system rules may be validated but still what is being displayed is blatantly wrong, like an employee’s date of hire set to two hundred years in the past.
- Conduct spot checking of data for a certain number of records to ensure the data was actually Extracted, Transformed, and Loaded correctly.
Dimensions and Facts
A commonly used schema in data warehouses is the star schema, which consists of a fact table at the center that displays facts of a particular business process, e.g. sales revenue by month by product, surrounded by a number of dimension tables that contains the textual descriptor of the business.
It is named this way because it resembles a star. Variations of it are the Snowflake Schema, an extension of a Star Schema with additional dimensions that is similar the shape of a snowflake, and the Galaxy Schema, which contains two fact table sharing dimensions. It is also called Fact Constellation Schema. (Guess the shape of that one.)
A fact table record captures a measurement or a metric. Dimensions are designed for query constraining/grouping/filtering and report labeling.
Armed with that knowledge, a good test at this instance is to:
- Look for null or invalid foreign keys at the Fact table.
Speed of Change
Fact tables by design are constantly receiving an influx of new rows. In contrast, changes in Dimensions occur at a much slower pace, so much so that the changes in them are labelled Slowly Changing Dimension (SCD).
There are commonly three types of SCD changes implemented in a warehouse:
- A simple record update. A value in the targeted column will be overwritten with that of the change.
- A change in which the affected row is preserved but marked as historical, while a new row, identical to the first except for the updated value, is inserted into the table and marked as current.
- A change to a pair of columns is created specifically to track current and previous values. Whenever a change occurs, both attributes get updated, and no new row is added. This preserves the information of the most recent change, not the entire history like a type 2.
These changes will only be encountered once the initial load of the warehouse has already taken place, and the ETL is then set to work on incremental changes. At this point, it will only bring in the changes that took place in the source from the moment of the previous load.
Tests for this instance of the process involve:
- Identifying the columns in the dimensions where the different types of changes occur and trigger each one, verifying that the table behaves correctly afterwards.
- Testing the reports to ensure they contain exactly what is requested.
The Final Stretch
The SCD change testing can be integrated with the End-to-End tests of the data transformation. This can go over in two different approaches:
- A white box approach, in which you review the source to target mapping design and insert records in the source system directly using SQL. Simulating the way it is normally inserted, run the ETL, and verify that the data has been transformed and loaded according to requirements.
- Alternatively, you can follow a black box approach and insert or update data in the source system as an end user would. Then run the ETL and match the results.
Regardless of the path you choose, this is the moment in which reaching out to the people already testing the sources pays off, since they will have a much better understanding of all the possible scenarios where records are added or altered in these systems.
What About Performance?
The ETL process can behave differently with different volumes of data, and often Dev/QA environments don’t have enough space for performance testing. Still, it’s always advisable to request for the estimated necessary space, setting up test data for performance testing either by generating samples or using and scrubbing source production records.
Make sure to:
- Verify the run time of Historical and Incremental loads.
- Check for bottlenecks in the process as the amount of data to be processed increases.
- Check for memory leaks that could cause problems for the ETL processes.