ETL, Data Conversion and Migration: All About the Data
Now that we have the definitions squared away with ETL, Data Conversion and Data Migration, the first step is: It’s All About the Data, Data, Data!
The main element ETL, Data Conversion, and Data Migration have in common is “Data.” Whether the data is from old databases on Mainframe or Client Server applications, Access databases, Excel files, or anywhere else data hides these days, it is all about extracting, converting and/or migrating the data. Without the data, there is nothing for you or your applications to do. The quality of the data is a huge factor in the perceived quality of the applications using this data.
When ETL, Data Conversion, or Data Migration is first being planned, the first question that needs to be answered is “What Data will be included?” – this includes the master records and transactional records, along with the date/time frame to be included.
For example: for an ETL process going from an old legacy system to a new system, the business may want to include up to 7 years’ worth of data out of the 20 years the legacy system contains. Or, take a daily ETL process running to feed a data warehouse. The cutoff may be 11:00 PM ET and the 24 hours prior, ensuring transactions that fall right around the cutoff time is included in today’s run – or yesterday’s run – but not both. This is the old boundary rule test case. There is more criteria for selecting the data and all need to be determined and tested.
The next question to be answered is “What is the quality of the Data being included from the source system(s)?” For a daily ETL process that will be running and extracting data from a currently running Production system to a data warehouse, the quality should be extremely good and there should be no work needed to improve the quality of this data. If the data quality is not good in this situation, you have bigger problems that need to be worked first.
If your ETL process is dealing with old Legacy data, the quality of the data may not be optimal and will need to be improved. This is referred to as data cleansing or data scrubbing. This is really a business problem to solve, as their department owns this data and the responsibility to know and understand their data. IT helps the business as much as possible with things like running queries to help determine where the problem areas are, and then if part of the solution is to do a mass update, IT can do this from the back end instead of having the business do it manually.
At times the best solution may be business resources spending time to improve the quality of the data; there is no way around this. This is usually caused by lack of enforcement of data maintenance over the years, and the quality of the data declining as time goes by. Or as time went on, more applications were added to the data workflow, and it became more complicated and inadequate testing was performed.
So, in any project that have an ETL, Data Migration, or Data Conversion tasks included, these two questions should be a top priority at the beginning of the project to answer. If the answer comes back as the data is low quality, then a plan and timeline need to be determined and worked on at the beginning of the project.
One of the benefits of getting the quality of the data improved and the ETL processes working at the beginning of a project is, this gives you “real” test data for the target system’s functionality to be tested with. If quality of the data was not improved and the ETL processes work was done until later in the project, you must test the target system again with the “real” data from the fully tested ETL processes. It would be costly to test the target system twice in this scenario.
Again, it is all about the data, data, data!