Anonymize sensitive data with rake

Posted by Jon
on Wednesday, April 08

When troubleshooting a nasty bug, it’s often useful to take a look actual production or staging data, or even pull it down into your development database. But this is a huge potential privacy and security concern. Your local environment likely isn’t as secure as your production environment, and you might not want to access this sensitive data (or give it to another team member).

Similarly, you might want to replicate your production data on a staging or QA environment to see how new code will interact with real data. Also a privacy concern.

Simple solution: anonymize the data!

In my current project, I put together an anonymize.rake task to deal with this. The most sensitive data in our app is name and phone number. Without that, private information can’t really be linked back to someone. So I pulled the 200 most common first names and 1000 most common last names (in the United States) and put them into an Anonymizer class. Call Anonymizer.random_name for a random, but realistic, name. The class also includes a simple phone number and email anonymizer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Anonymizer
  def self.random_name
    "#{random_first_name} #{random_last_name}"
  end
  
  def self.random_first_name
    FIRSTNAMES[rand(FIRSTNAMES.size)]
  end
  
  def self.random_last_name
    LASTNAMES[rand(LASTNAMES.size)]
  end
  
  def self.random_phone
    "612-555-#{rand(8000) + 1000}"
  end
  
  FIRSTNAMES = %w(James
  John
  Robert
  Michael

  # etc.

The rake task is simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
namespace :db do
  namespace :data do
    desc "Anonymize sensitive information"
    task :anonymize => :environment do
      if RAILS_ENV == 'production'
        puts "Refusing to anonymize production data. You don't really want to do that."
      else
        puts "Anonymizing all name and email records in the #{RAILS_ENV} database."
        
        # User.find(:all).each do |user|
        # user.name = Anonymizer.random_name
        # user.email = Anonymizer.random_email(user.name)
        # puts "Saving #{user.name} (#{user.email})"
        # user.save!
        # end
      end
    end
  end
end

You’ll need to do the actual implementation yourself (see the sample User.all.each {} block). It would be easy enough to extend this to work with social security numbers, addresses, etc. Run with:

rake db:data:anonymize

Code: anonymize.rake

Comments

Leave a response

  1. atmosApril 08, 2009 @ 06:09 PM

    Neat trick, You might be able to remove some of the manual name definition and take advantage of randexp(http://github.com/benburkert/randexp/tree/master). I’ve been using it for a while now but mainly just for generating random data in tests. You could totally use it here though.

  2. Adam HunterApril 08, 2009 @ 08:48 PM

    This is a great idea. I frequently pull down my production data on one of my projects and have had the concern of my dev machine not being secure. Thanks for the great info!

  3. Peter EdstromApril 08, 2009 @ 09:06 PM

    I recently used populator and faker to build up a good set of data to play with in development. Seemed to work well. http://railscasts.com/episodes/126-populating-a-database

  4. Marcos Wright KuhnsApril 09, 2009 @ 06:45 AM

    Looks like you guys have talked about random data generation before. Personally, I’ve been using the forgery gem recently to generate random names, addresses, ect. It’s quite flexible.

  5. Steve AgallocoApril 10, 2009 @ 11:00 PM

    One thing we’re doing to make it easy to avoid accidentally sending an email out to a user from our development and staging environments is to update the email address of our users using the email address configured in our local gitconfig. Like so:

    git_email = `git config user.email` email_parts = git_email.split(’@’)

    User.all.each do |u| u.email = ”#{email_parts0}+#{u.username}@#{email_parts1}” u.save! end

    This allows us to send all test email as we would normally without worrying about something slipping out that shouldn’t have. It has the added benefit of not relying on additional configurations and all outgoing emails get sent to our email addresses.

  6. EricApril 16, 2009 @ 01:19 PM

    Reminds me of “filter_parameter_logging”, which can be used to omit sensitive parameters (e.g. passwords) when logging.

    http://api.rubyonrails.org/classes/ActionController/Base.html#M000622