Test Data Management: Can I Borrow A Customer?
In our last article we spoke of the ‘discipline’ of Test Data Management: a set of principles, accepted best practices, and a market of tools and technologies designed to provision and manage test data sets.
We also had to acknowledge that many software projects we find ourselves working on do not practice this ‘discipline’, nor do we have access to any of these nifty tools and systems. Too often, test data is still treated as an afterthought, if it’s given any thought at all.
Typically the first – and often, unfortunately, the last – step in provisioning a set of testing data is to copy a production database into the testing environment, and then leaving us to our own devices to figure out how to make use of it. Often this production copy is incomplete, lacks referential integrity, is full of sensitive Personally Identifying Information (PII), and is vulnerable to sudden resets and ‘refreshes’ that wipe away weeks or months of testing work in one fell stroke.
This is not to say that there is no role for data originating in production. Not only are almost all modern software development and testing projects data centric, but many are also starting with a pre-existing database of some sort. Even if the application itself is brand new, it may be replacing an older system, or is repurposing data compiled by other systems for other purposes.
Online shopping through e-commerce webstores has become ubiquitous and has largely supplanted the old models of mail and telephone order. We find, however, that many systems originally implemented back in the 1970’s and 1980’s to automate mail order operations are still in use, and in fact serve as the back end to many of the slick customer-facing e-commerce sites through which we now do our shopping. When we click ‘Buy’, the website may be calling systems running on old AS/400-style mini-computers, or perhaps even a mainframe (hey, if it works!) to process the order, create a work order to route product from warehouse to shipping, and to update the retail accounting database.
The point is that data sometimes may represent a continuity going back 30 or 40 years, or perhaps even further. The online retailer’s catalog database is full of product, of all the needed variety and with all known edge cases accounted for. The inventory database may be similarly useful.
So, yes, production data is useful, and we will make use of it. In fact, it is a very large treasure trove…albeit one that sometimes resembles a haystack…made exclusively of needles…and lacking the hay. Finding what we need can be difficult, and it can be painful. But it can be done, especially if we are privileged to have access to the database itself and can write queries. We may even be able to use the application we are testing itself as a useful tool for mining the data nuggets we need for a particular test case.
There is some data, however, that we are inevitably going to find too difficult to locate or which simply does not exist in the precise form that we need. Generally, this data tends to involve people: customers, employees, and organizations. If we need a 25-year-old married female who lives in Anchorage, Alaska who is not a college graduate and not a student, drives a 2007 Honda, has a Golden Retriever and a credit score in the mid-600s, we are probably going to find it far easier and quicker to create her than to find her in the production database.
The term used for such ‘created data’ in Test Data Management circles is ‘synthetic data’. Don’t be put off by the word ‘synthetic’. A synthetic diamond is a diamond, arrived at via different means, unlike an artificial diamond which is likely to be glass or quartz. A synthetic customer in an e-commerce site’s test database is a customer. Their credit rating, on the other hand, which in production may come from an API call to a 3rd-party credit bureau, may, in testing, be an arbitrary or even random number returned by a mock of that API, and may be thought of as ‘artificial’, which is fine for our purposes because what counts is the customer.
Most serious TDM tools on the market advertise their prowess at synthesizing large amounts of data, but recall that we assume that such tools are not available to us. We’re thinking of data we can create ourselves with the tools at hand, the most useful of which is often the Application Under Test (AUT) itself.
All is not rosy, however; synthesizing data can be time consuming, sometimes tedious nearly beyond tolerance (have you ever been asked to place 500 e-commerce orders so that overnight batch processing can be stress-tested?), and sometimes vulnerable (that dreaded ‘database refresh’ done periodically to ‘clean up’ the data, which wipes out all our synthetic users) and it is sometimes not fun to deal with.
What is the best balance between real and synthetic data? TDM practitioners argue the question, and ultimately it will vary project to project, or even tester by tester. When I need a customer of an employee, I almost always find it is better to create one. When I need something for a customer to buy – an insurance policy, say, or a product on an e-commerce site – I’ll rely on the database to have suitable production-like data for the purpose.
But it is never hard and fast. If I need a catalog item with particular characteristics – one that cannot be shipped by air, say, and which therefore can’t qualify for overnight shipping, or one that cannot be delivered to a military APO – and I have the ability to update the catalog database, it may be easier to create a suitable product than to try to find one.
Aside from the time and effort required to synthesize data, I find that the most common problem with synthetic data is that it lacks historical context. A new, created-for-the-purpose e-commerce customer is not going to have a purchase history. During testing, however, we will be executing a lot of ordering test cases, and we will build that history as we go along. Another pitfall is that the synthetic customer is not going to generate a credit rating or insurance claim history via calls to external services. Since such services are often not available in a testing environment anyway, this is an unfortunate test coverage hole that will also apply to production-derived data as well.
We are likely to need a lot of synthetic data, especially if we have a lot of ‘use-once’ test cases. What is a ‘use-once’ test case? Well, a new customer is a new customer only once, unless there is a mechanism for removing or ‘resetting’ a record (spoiler alert: there usually will not be a mechanism for removing a customer record, or ‘resetting’ its status). Every time we need to repeat that test case, we will need to register a new customer. In other cases, creating data may be surprisingly complicated. To add an employee to a system, we may need to assign that employee a supervisor. So, we first must create that supervisor, and that supervisor also has a supervisor…
Remember also, often the most convenient (if not the only) tool available to us for creating our own test data is the AUT. The AUT will require that our data meet certain standards. An address, for example, may need to be a real valid address, findable via a call to Google maps, the zip code a real valid zip for that address, verifiable via a call to a USPS page. If these mechanisms are already implemented, then we have a ready supply of negative test cases. If they are not, we need to be careful to ensure that our data will be valid once these validations are in place.
Once we have data defined, let’s say it’s an e-commerce customer, it is worth hanging on to for further use. Though we may not be able to re-use that customer to test new customer registration, we have a lot of test cases that we can now use that customer for. Other test cases may also be ‘use once’: that customer’s first purchase, (they get a gift card perhaps), their achieving ‘Gold Customer’ status when they have spent a total of $1,000 on the site, and ‘Platinum Status’ when they have spent $5,000. We will be able to use this customer just once for each of these test cases, but clearly there are plenty of test cases for which we can use it repeatedly. If the most expensive thing in the catalog costs only $10 we’re obviously going to need to have this customer place a lot of orders if its ever going to get to that magical ‘Platinum’ status.
So, here’s what I’m going to do with our synthetic customer:
While testing new customer registration I create a new customer in the AUT. I record that customer’s details in a spreadsheet. When I need to complete the test case wherein that newly registered customer places their first order and earns a $50 gift card, I update that row in the spreadsheet indicating that this user has been ‘consumed’ for that purpose. As I continue testing, following different paths through the registration and/or first purchase processes, I invent new customers and add them to the spreadsheet.
These customers all have names like ‘John Smith’, ‘Michael Jordan’, ‘Carol Danvers’, or the ever-popular ‘ASDF ASDF’ (I don’t recommend twiddle-fingering data that you want to re-use however). They all live at ‘123 4th Street’, but in a real town with a real zip code, because the app requires that these items are valid per a query to a USPS API.
I then begin using these customers to place orders, request rainchecks, and see which nearby retail locations have a product in stock that I’d rather pick up than wait to have delivered. As I proceed through the test plans I realize there are additional markers I want to keep track of for that customer, so I add columns to the spreadsheet to, for example, keep track of the number of orders that customer places so that when it comes time to test a new customer loyalty feature for customers who have placed 100 orders, I have a customer I know qualifies. I also add a column recording the amount of each purpose, so I’ll know when a customer is approaching that cumulative $1,000 threshold to earn ‘Gold’ status, plus a ‘Gold’ column set to TRUE when they achieve that goal.
Ultimately, I have a spreadsheet in which I’m tracking a dozen or more customers that I’m carefully nursing along towards one milestone and then another. As I use them for a particular ‘use once’ test case, I add new ones and begin nursing them along. As testing proceeds, I learn of more customer characteristics that are worth keeping track of. Perhaps the year, make and model of their car, because when their insurance comes up for renewal the rates will be recalculated based on the car’s age and, this year, testing a new feature, may be eligible for a discount if it has electronic traction control (add an ‘ECT’ column to the sheet).
So, here’s today’s lesson in TDM: record everything you do.
Treat the data like children. Track their progression through the cycle of life within the system: when they are a new customer, when they are an existing customer. When they become a loyalty program member after being a customer for one year (for testing purposes, maybe this is reduced to one month), when they become a Gold and then a Platinum Customer.
Nurture them. Give them names that amuse you: characters from The Simpsons, or The Sopranos, or that are just goofy, like Hamilton Berger or Harry Laiggs. Treat them like they’re your kids. But don’t be afraid to loan them out (or even give them away, if you can spare them,)to your co-worker who needs a Platinum customer so that they can thoroughly test that all the discounts and perks that such a customer is due are applied correctly. When they become known and you hear other project members, not just fellow testers but even developers and business analysts, referring to them by name believe me you’ll feel gratified.
You’ll know that you can put that name in a defect, a screenshot, an email, a ‘how-to’ document and not have to worry about dropping PII into the world, or have to use the smudge brush in your screencap tool to obscure their name or address (besides, you don’t want to blur the name! You want the developer to know precisely what name they need to use to replicate your defect!).
If this sounds like you’re recording a lot of data, well yes, you are. But it’s your data…and your team’s data. You know what’s there, where it has been, what it’s for. Also, you have a model with which to recreate it, when it inevitably gets blown away by a careless ‘data refresh’, a task made easier if you have the ability to automate at least some of your data synthesizing practices, which is a subject that we’ll touch on next time when we look at an example or two of our use of some of these techniques in a subsequent post. Until then, Happy Data Management!