Why you really don’t need production data for Test and Dev

By Aengus Rooney - February 14, 2018

What valuable data for Test and Dev environments looks like.

With synthetic data often falling short of test requirements, there remains a persistent notion that only raw, sensitive data will do for fast and accurate Test and Dev. We make the case: you don’t need production data at all.

Using raw production data for Test and Dev is a bad idea.

Test and Dev environments, by necessity, don’t have the same security controls as production environments. When you move your sensitive data into one, you open it up to a much wider group of users – who may or may not be employed by your business. The result is a massively increased risk of a breach, whether malicious or inadvertent, and of a hefty fine from a regulator.

And yet, it’s been common practice for a long time. The argument being, there’s no viable alternative.

There’s a misconception that raw production data is the only data fit for Test and Dev. The argument goes as follows…

  1. Synthetic data can’t cut it.

You could use a synthetic data generator to create completely artificial datasets, using the same schema as your production data – but it won’t replicate your production data’s nuanced structure, complexity and referential integrity.

Use that synthetic data for Test and Dev, and you’ll find:

  • It’s hard to reliably test for issues and bugs, since you don’t have an accurate replica model
  • You can’t be confident an application you’re developing won’t fall over the moment you deploy it into your production environment. Synthetic data isn’t deep and rich enough to cover edge cases realistically.

Given issue resolution and new developments are the two reasons most companies turn to their Test and Dev environment, the case against synthetic data is clear.

  1. Masking data traditionally takes forever – and rarely eliminates risk

Another approach is to take a snapshot of your production data, and mask it – anonymising the information that could be used to instantly identify an individual within the overall dataset.

But for most organisations, masking data remains unsophisticated, lengthy, manual work – mostly due to the one-off permission processes your team has to go through. What’s more, it rarely goes far enough.

Secondary identifiers – which, when taken together, still enable an individual to be identified – are almost always left untouched because masking these requires much more sophisticated techniques, from introducing noise by perturbation to grouping values by generalisation.

The lack of fast, secure alternatives to raw production data has made special dispensations for Test and Dev the norm.

Businesses have simply crossed their fingers and authorised the use of their most sensitive data. Today, however, thanks to growing regulatory pressure – and a better understanding of privacy risks – these special dispensations are coming under ever closer scrutiny.

Simply put, there has to be another option.

The problem is, organisations have relied on raw data so long, they’ve lost sight of what makes data valuable for Test and Dev.

It’s true that synthetic and manually masked data are no good – but that doesn’t mean you need to reach for raw data straight away.

It’s not the rawness of production data that makes it fit for Test and Dev purposes. It’s two characteristics that synthetic and manually masked data can’t match:


Those structures and linkages within data sets, the referential integrity needed if you’re going to replicate bugs and issues effectively – and have confidence in your test results.

To get more granular and to be truly valuable, Test and Dev data must:

  • Preserve the original data types and formats
  • Preserve the consistency of primary key values
  • Preserve referential integrity, considering up and down stream data flows

(And that’s just for starters. See our checklist for a detailed rundown of what makes Test and Dev data useful and safe.)


When your production systems go down, you need to be able to act fast. Valuable Test and Dev data is data that’s readily available – allowing teams to work on patches and fixes the moment it’s clear something’s wrong.

Here’s the good news. A mature approach to data protection can help you rapidly provision rich data in a Test and Dev environment, without relying on raw production data, and opening your business up to a world of risk.

A more sophisticated method of anonymisation can preserve what matters in your production data – its richness and ready availability – while effectively protecting the sensitive information it contains, and even increasing its usefulness for Test and Dev. How? By…

  • Extending data anonymisation – so it acts on secondary identifiers, and genuinely protects privacy
  • Automating data anonymisation according to centralised rules and policies – so the process of moving data from production to Test and Dev takes hours rather than months, and produces the consistent, standardised masking library needed for repeatable testing
  • Enriching data through anonymisation – tweaking rules to create the huge datasets (complete with edge cases and boundary values, and all their original linkages and referential integrity) that are necessary to really push the boundaries of software systems.

The result isn’t just safer data for Test and Dev, it’s even more valuable Test and Dev data than you started with.

In the end, this isn’t just about Test and Dev. It’s about any transfer of production data to a secondary environment.

We’re living in the age of big data, but without a way to securely provision to Analytics, Machine Learning and Test and Dev environments, many organisations still aren’t feeling a big difference.

A faster, smarter approach to data anonymisation can help open up all of these data flows, while keeping access to sensitive data genuinely locked down. We know, because we’ve pioneered it, and turned it into packaged solution – and because our customers are proving its value every day.

If you’re working to quickly and safely provision data to your own Test and Dev environment, we’ve a checklist to help. It lays out six principles to follow to ensure your data is both safe and useful for Test and Dev purposes – you can download your copy here.