Loading seed data

Posted by Luke Francl
on Thursday, January 31

At acts_as_conference next week (there’s still room to register) I’m going to be talking about challenges facing Rails teams. Today, I’d like to talk about loading your application’s seed data.

Seed data?

Seed data is anything that must be loaded for an application to work properly. An application needs its seed data loaded in order to run in development, test, and production.

Examples include everything from an initial administrator account to small enumerations to huge amounts of data (one example of seed data given by a developer on the Ruby Users of Minnesota included every airport in the world).

Seed data is mostly unchanging. It typically won’t be edited in your application. But requirements can and do change, so seed data may need to be reloaded on deployed applications.

The ideal solution would be automatic: you shouldn’t have to think about it. When you check out the code and start up your app, it should be ready. It should provide data integrity: the created records should pass your validations. And it should be easy to update your seed data.

Migrations

Since migrations are just Ruby code, they can be used to initialize data in the up method. This is demonstrated in the Rails documentation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class AddSystemSettings < ActiveRecord::Migration
  def self.up
    create_table :system_settings do |t|
      t.string  :name
      t.string  :label
      t.text  :value
      t.string  :type
      t.integer  :position
    end

    SystemSetting.create :name => "notice", :label => "Use notice?", :value => 1
  end

  def self.down
    drop_table :system_settings
  end
end

Using migrations is attractive because they get run automatically.

However, they have some downsides. Adding or changing data is troubling. Adding a new migration seems annoying. But going back into your old migrations to change your data won’t work either.

The biggest problem with using migrations to load seed data is “migration decay.” The more migrations you have, the less likely the older ones are to work. If your migrations load data, they are more likely to break as your models change.

Furthermore, the movement in the Rails community is that schema.rb is the authoritative source of your DB schema, and that new databases should be created using that:

Note that this schema.rb definition is the authoritative source for your database schema. If you need to create the application database on another system, you should be using db:schema:load, not running all the migrations from scratch. The latter is a flawed and unsustainable approach (the more migrations you’ll amass, the slower it’ll run and the greater likelihood for issues).

That means no data loading migrations can be run.

Fixtures

At first glance, fixtures seem well suited for loading data. And because of that, a lot of projects go down the primrose path of using them—usually with poor results.

There are two ways to use fixtures to load seed data.

First, simply use test fixtures with rake db:fixtures:load. This is almost certainly a mistake. Your test fixtures will contain data not necessary for your application.

Second, create a separate set of fixtures, unrelated to your tests, and load those. Jeffery Allan Hardy has a good post about how to use fixtures to load seed data. This is better, but I don’t like fixtures because they don’t validate data. It’s way too easy to end up with broken models.

One caveat about seed data, fixtures, and tests: If you use fixtures for tests, your data is deleted and the fixtures loaded. So your fixed seed data needs to be duplicated in the fixtures.

Fixture scenario builder

I haven’t used this one myself, but a number of people on the RUM list recommended using Chris Wanstrath’s Fixture Scenario Builder as a way to use fixtures without sucking (see above).

The Fixture Scenario Builder, uh, builds on Fixture Scenarios, letting you define them in Ruby (so they’re valid) and then generating fixture files for loading. Most people use this for test cases, but it can be used to load your initial data as well.

ActiveRecord::Base loader

If only there were some way to create records that were valid. Oh wait, ActiveRecord does this. Why not write a task that loads the seed data with ActiveRecord?

You’d have to make sure this gets run whenever you set up a new application. Josh Knowles has created db-populate to facilitate this approach. It provides a db:populate rake task that will run Ruby files in the db/fixtures directory.

Here’s a helper method that makes it easy to create or update records, so it can be run regardless.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class ActiveRecord::Base
  # given a hash of attributes including the ID, look up the record by ID. 
  # If it does not exist, it is created with the rest of the options. 
  # If it exists, it is updated with the given options. 
  #
  # Raises an exception if the record is invalid to ensure seed data is loaded correctly.
  # 
  # Returns the record.
  def self.create_or_update(options = {})
    id = options.delete(:id)
    record = find_by_id(id) || new
    record.id = id
    record.attributes = options
    record.save!
    
    record
  end
end

You can use it like this (in db/fixtures/venues.rb):

1
2
Venue.create_or_update(:id => 1, :name => "Coffman Union")
Venue.create_or_update(:id => 2, :name => "Alumni Center")

If you need to change the data, just edit the file:

1
2
3
Venue.create_or_update(:id => 1, :name => "Coffman Union")
Venue.create_or_update(:id => 2, :name => "McNamara Alumni Center")
Venue.create_or_update(:id => 3, :name => "Lind Hall")

I like this approach. The data is validated by ActiveRecord. It’s easy to update, and you can add it to your deploy recipe to make it automatic.

Loading lots and lots of data

I’ve read both fixtures and ActiveRecord data loaders are too slow if you have lots of data (See Tonkatsufan’s comment here). In that case, the best thing to do is use your database’s preferred method of batch loading SQL inserts.

Your method here?

So that’s my survey of the available methods of loading seed data. I’m interested to hear what other people out there are doing. How do you do it?

Comments

Leave a response

  1. SethFebruary 01, 2008 @ 09:50 AM

    It seems like most of our seed data comes from business analyst types that don’t understand YAML/Fixtures. We usually have them create a CSV file (using Excel) and then parse the file and load our local development database using a simple Ruby script.

    We then dump our local development database and load our staging and prod systems using these awesome rake tasks: http://blog.leetsoft.com/2006/5/29/easy-migration-between-databases

    You can do great things like:

    rake db:backup:write rake db:backup:read

    The db:backup:write saves files right in RAILS_ROOT/db/backup. We then check into SVN and deploy on servers using Capistrano (which has a task that calls db:backup:read).

  2. Tamer SalamaFebruary 01, 2008 @ 10:44 AM

    I’ve came across another strategy to speed up the loading through the ar-extensions gem

  3. PedroFebruary 07, 2008 @ 07:57 PM

    The last alternative (using a rake task to load fixtures from the db folder) is implemented in a plugin called yaml_db

  4. MichaelFebruary 20, 2008 @ 09:50 AM

    When I try this, I get an error message saying “undefined method `save’ ”. Have you encountered this before?