Obfuscating data from one database to create another.

Levent · March 23, 2022

Hi,

I got a database I pulled out of a production database, this database contains sensitive information and I blanked out of most of the sensitive data before even pulling it out of the server BUT issue I am facing is, I need some data to be obfuscated or mixed instead of being blanked out so that it can be used in the test environment.

Let say I got two columns, Make and Model. I would be fine if one row of Make is replaced with random Model from another row.

Any ideas?

Sauron · March 23, 2022

You could query the columns separately and in random order, only keeping track of what indexes have already been queried and not which entries correspond to them. Then insert the results next to each other.

Eigenvektor · March 23, 2022

We had something like that at my previous employer, which we used when we asked customers for a database dump. You could specify tables, columns and a cycle count. It would then randomly select two rows/columns from the table/column set and swap their contents. With a high enough cycle count the data was sufficiently anonymized for developer use as test data.

So basically:

Specify a set of tables and associated columns you want to mix
Select two random tables from tables as table1 and table2
Select random columns from table1.columns and table2.columns
Select a random index for each table (based on total count)
Swap their contents
Repeat n times

It's quite possible n was automatically selected based on the number of database entries. As far as I remember the software didn't keep track of the the things it had already swapped. I think the argument was that this actually improves randomness because otherwise the amount of data to choose from is reduced over time.