ActiveRecord refererential integrity is broken. Let's fix it!

Posted by Jon
on Tuesday, August 18

ActiveRecord supports cascading deletes to preserve referential integrity:

1
2
3
class User
  has_many :posts, :dependent => :destroy
end

But you really only want cascading deletes about half the time. The other half, you want to actually restrict deletion of a record with dependencies. ActiveRecord doesn’t support this.

Think of an e-commerce system where a user has many orders. Once an order has gone through, you shouldn’t be able to delete the user who placed the order. You need a record of the order and the user who placed it.

Or even more obvious, think of a lookup table. An Order might have several of these dependencies; OrderStatus, Currency, DiscountLevel, etc. In all of these cases, you want ON DELETE restrict, not ON DELETE cascade. But Rails doesn’t support this. That’s dumb.

If you agree, head on over to the Rails UserVoice site and make your opinion known! There is a ticket for this already. Vote it up if you think Rails should implement this.

The solution to the problem is really pretty simple. ActiveRecord just needs something like this:

1
2
3
class User
  has_many :posts, :dependent => :restrict
end

In this case, if you try to destroy a user that has one or more posts, Rails should complain. You’ve told the app: “Don’t let me delete users who have posts!” The easiest way to do this is to have Rails throw an exception, and have your controller capture the exception and print a flash message. Other approaches could work too.

So why is this important?

1. It’s common. Every project should maintain referential integrity in some way, and :dependent => :destroy isn’t always appropriate. Who wants to do a cascading delete from roles to users, or manufacturers to products, or order_statuses to orders? I don’t think I’ve ever worked on a project where cascading deletes were always appropriate. Any lookup table, at minimum, needs this feature. (I personally prefer to maintain referential integrity with foreign keys, but even still, I’d love to have an application-level check first, which would be easier to rescue. And some projects don’t use foreign keys.)

2. It fits with the Rails philosophy. Rails says “Let your application handle referential integrity, not the database”. But without :dependent => :restrict, one of the most important pieces of referential integrity is missing.

3. It’s easy. 9 lines of code to add this to has_many. Check out this gist: http://gist.github.com/170059.

Someone wrote a plugin for this, but it has the distinct disadvantage of not working anymore. This should really be a core feature anyway, at least as long as :dependent => :destroy is a core feature.

The UserVoice suggestion for this is at http://rails.uservoice.com/pages/10012-rails/suggestions/103508-support-dependent-restrict-and-dependent-nullify.

Rails 2.3.3 upgrade notes: rack, mocha, and _ids

Posted by Jon
on Wednesday, July 29

I upgraded two apps to Rails 2.3.3 today. It’s a minor release, and there’s not much to report. But I did run into three minor problems.

Mocha

Mocha 0.9.5 started throwing an exception:

NameError: uninitialized constant Mocha::Mockery::ImpersonatingAnyInstanceName

A quick update to Mocha 0.9.7 cleared this up.

Array parameters in tests

In functional tests with Test::Unit, passing an array to a parameter stopped working. Previously, I had something like this:


post :create, :user => {:role_ids => [1,2,3]}

This would post the following parameters:


"role_ids"=>["1", "2", "3"]

But after the 2.3.3 update, I started seeing an error:

NoMethodError: undefined method `each' for 1:Fixnum

I’m not sure why this stopped working. (Anyone know?) Changing the integers to strings clears up the error:


post :create, :user => {:role_ids => ["1","2","3"]}

Or


post :create, :user => {:role_ids => [1.to_s,2.to_s,3.to_s]}

Rack

Rack apparently no longer comes bundled with Rails. Or at least deployment failed on cap deploy: RubyGem version error: rack(0.4.0 not ~> 1.0.0).

The solution was simple: install (or vendor) Rack 1.0.0.


config.gem 'rack', :version => '>= 1.0.0'

Estimating software: a rule of thumb

Posted by Jon
on Tuesday, June 02

Estimating software is hard, but most of us have to do it – whether we’re estimating an entire project for a client, or a new feature for a boss, or a change to one of our own projects.

I’ve found the following rule helpful when estimating software. This comes from about four years of estimating Rails projects to consulting clients, and moving from bad – dramatically underestimating fixed-bid projects – to pretty good – usually overestimating time & materials projects slightly. (And more importantly, knowing when I can’t estimate, because the scope is too vague or too large.)

Jon’s Law of Estimates

Software difficulty is primarily determined by volume, logic, and integration.

Jon’s Law of Estimates, explained

1. Volume is easy to understand. If you’re building software that does more, it will require more work. So if you’re estimating a project that stores recipes, and you’re estimating another project that stores recipes AND shopping lists, you can expect that the second one will take more work (if everything else is equal).

2. Logic refers to the rules or business logic behind a feature. The more rules there are, the more work there is. Imagine that our recipe system requires that recipes from some users are manually approved by an administrator, and checks to see that each ingredient in the recipe is present in the step-by-step instructions, and only allows a user to post 3 recipes per hour, and lets users propose alternative versions of a recipe, and lets an alternative version replace the regular version if it achieves a certain rating, etc. That’s more work than a recipe system that just lets users create and rate recipes, even though the volume of features may not be any larger.

Interestingly, a technology can make some logic trivial and some logic hard. Nested forms are a great example of this. Before Rails 2.3, Rails made it trivial to do CRUD on a single table at a time, but difficult handle multiple tables. Now it is (almost) trivial to do CRUD on multiple tables at a time.

3. Integration points are usually deserving of special consideration in an estimate. This includes talking to a web services API, another local software system, a data feed, a complex library, etc. Not only do integration points often take time to get right, but they can become sinkholes of time when the documentation is inadequate or incorrect, the other system doesn’t play nice, or you can’t easily test the integration. And your estimate depends on something out of your control: the other system.

External factors

These rules only apply to the difficulty of the software. Several external factors are important as well. These include, most notably, the client and the team. The client can make a project easy, or they can make a project difficult. Similarly, the right team might be able to blaze through a project quickly, while the wrong team may never finish at all.

The other side of estimating

Here’s the thing about these rules: they’re relative, not absolute. There is no rule that says “Features take 5 days, and integration points take 10”. So estimating requires comparisons. This means that if you’ve never built a Rails app before, you’ll have trouble estimating a Rails project. But once you’ve built a few, you can compare the volume, logic, and integration points of a new project to volume, logic, and integration points of the previous ones.

So estimating requires intuition and experience as well as analysis (e.g. Jon’s Law of Estimates). The key to estimating is to combine analysis and intuition, and to let each side refine the other.

Benchmarking your Rails tests (updated)

Posted by Jon
on Friday, April 03

Update: stubbing a single integration point shaved 22 seconds off of my unit tests, reducing test time from 35 seconds to 13. See below.

The first step to faster tests is knowing what is slow. Fortunately, this is dead simple with the test_benchmark plugin by Tim Connor, and originally built by Geoffrey Groschenbach. Install the plugin, and when you run your tests via Rake, you’ll see handy output showing you the slowest tests, and the slowest test classes.

Step 1: Install the plugin.

script/plugin install git://github.com/timocratic/test_benchmark.git

Step 2: Run your tests

rake test

Here is a bit of output when I run the unit tests for FanChatter:

Finished in 34.838173 seconds.

Test Benchmark Times: Suite Totals:
25.393 MailReceiverTest
4.520 PhotoTest
1.429 REXMLTest
0.961 TeamTest
0.846 MessageTest

Pretty useful information. Almost 75% of our unit testing time is taken up in the MailReceiverTest. So if we want to speed up our tests, we need to make our MMS testing faster. Looking at that code, I see this line over and over:


MailReceiver.receive(fixture_mms(:fixture_name))

This method reads a test email message from the filesystem, and runs it through our mail parsing method. This is basically an integration test, hitting at least two integration points. So if we can remove these bottlenecks, we can reasonably expect a fairly large improvement in our unit test speed.

I think we could realistically reduce our unit testing time from 34 seconds to <15 seconds just by refactoring this one test method.

Other options

The test_benchmark plugin fires whenever you run your tests with rake. Tim recently patched the plugin to not fire when run with autotest, which is great. Personally, though, I don’t want to see this benchmark information every time I run my tests. So I added the following line to my test.rb environment file:

ENV['BENCHMARK'] ||= 'none'

Now, the benchmarks don’t run by default. If I want to see them, I call:

rake test BENCHMARK=true

And if to see full tests, showing the time it takes to run every test in the system, just call:

rake test BENCHMARK=full

That’s it. You still have to speed up your tests, and there are many ways to do that (from mocking to simply reducing the number of calls to expensive methods), but knowing what’s slow is half the battle.

The stirring conclusion (update)

I spent a few minutes optimizing these slow tests today. First, I tried rearranging the tests to reduce unnecessary calls to the slow method (MailReceiver.receive(message)). I was able to speed MailReceiverTest from about 25 seconds to 17. Not bad, but still slow.

The real problem is that this method saves a photo. It creates a Photo record that includes a file, treated sort of like an upload, like this:

1
photo.uploaded_data = mms.file

This is what was slow. But my unit tests don’t actually deal with the file being saved to the filesystem; they test other things, like the right records being created, confirmation emails being sent, etc.

So I decided to try bypassing this file save/upload by stubbing the uploaded_data= method. I put the following at the top of my test class:

1
2
3
def setup
    Photo.any_instance.stubs(:uploaded_data=)
  end

And voila! MailReciverTest went from 25 seconds to 17 seconds to 3 seconds.

Slow tests are a bug

Posted by Jon
on Tuesday, March 10

I’ve been doing TDD for about three years now. Once I figured out how to do it right, it became a natural part of how I program, and I can’t really imagine doing development without it. This isn’t to say that TDD is the only approach to writing quality software or that unit testing it the only kind of testing that matters. But it sure is useful.

The Ruby world talks a lot about TDD, moreso than many other developer communities. We have not one, not two, but at least half a dozen testing libraries that are actively being used and developed. For most Ruby developers, the question isn’t “Do you test?” but “BDD or TDD?” or even “RSpec, Shoulda, or Bacon?” We often use at least 2-3 layers of automated testing, and sometimes use different tools for each layer. Most Ruby conferences devote at least a few talks each day to testing-related topics. We’re test fanboys and -girls, for better or for worse.

But in spite of this, we rarely talk about test speed. Sure, there are purists who believe that unit tests shouldn’t touch the database because anything that touches the DB is actually an integration test. But few Ruby testers actually take this long and lonely road, and I personally prefer tests that talk to a database, at least some of the time.

And it’s true that others have written libraries to distribute their tests across multiple machines. But that’s the exception that proves the rule – the only reason to distribute your tests is that they’re too slow to begin with.

Most Rails projects I’ve worked on have ended up at around 3,000-15,000 lines of code, with a roughly as many lines of test code, and most have test suites that take a minute or more to run. Our test suite for Tumblon, for instance, churns along for 2.5 minutes. This is a too slow. And slow tests are a problem for at least two reasons: they slow down your development and decrease code quality.

1. Slow tests slow down development. If you’re practicing TDD, you want to see a test fail before you make it succeed. Two minutes is far too long for this feedback loop to be effective. Of course, you can (and should) just run the test classes that correspond to your code as you program – no need to run your entire test suite every time you write your failing tests. But even still, the test time bar should ideally be set quite low. Frequent 5-10 second delays are enough to break my concentration, and I find myself cmd-tabbing over to other programs if I have to wait more than a few seconds for a test to run. I don’t know of any hard-and-fast rules, but I know that as soon as my test suite runs longer than 30-45 seconds, and individual test classes take longer than 2-3 seconds, I’m less happy and less productive.

2. Slow tests decrease code quality. There are two simple reasons for this. First, if slow tests break your flow, you’re not only going to write code more slowly: you’re also going to write worse code. Second, if your tests are too slow, you’re not going to wait for them to finish before you move on to the next task. Or worse, you’re not going to run them at all.

So, how can I speed up my tests?

Fortunately, this problem can be addressed. There are plenty of ways to speed up tests. On a current project, we’ve managed to cut our test time substantially – a recent test refactoring cut test time from 129.45 seconds to 31.04 seconds, without removing any tests. That’s a 76% speedup. But we still have room for improvement.

Really quickly, here are at least five ways to speed up your test suite. I hope to post more on each of these over the next month or two.

1. Use a test database instead of fixtures/factories/etc.

2. Only touch the database when necessary

3. Organize your tests to avoid duplicate execution

4. Separate slow tests out into a lazier testing layer

5. Run a Rails test server

I’d love to see the Rails community devote more of its enthusiasm for testing to the question of test speed. There’s nothing wrong with improving our test frameworks, and let’s keep doing that. But let’s also make these frameworks fast.

Pivotal Tracker > bug trackers

Posted by Jon
on Monday, February 02

I’ve used just about every defect tracking system there is. That includes Trac, Fogbugz, Lighthouse, a Rails-based Trac clone that I forget the name of, spreadsheets, note cards, and even Bugzilla. I haven’t tried Mingle, and I only used Jira for a few days, but I’ve got most of the other bases covered. Last year at Tumblon, we settled on Fogbugz as less painful than the other options. It’s a pretty good defect tracking system, with most of the right features, and it’s reasonably easy to use.

A month ago, I tried Pivotal Tracker for a new project. It blew everything else away.

I think there are two reasons for this.

First, it’s a story tracker, not a defect tracker. And when I’m building software, I need to track stories more than defects. Of course, Pivotal Tracker handles both, and so do most bug trackers. But bug trackers usually seem natural when I’m using them to track bugs, and unnatural when I’m trying to map out new development. Pivotal Tracker feels natural for both.

Second, it prevents sabotage. This is the real key. Software development projects are hard to get right. It’s really easy to unintentionally sabotage a software development project, and everyone on the team can do it. That’s why there’s a huge publishing and training industry around project management, and why most of us are interested in the software development process. (Things like Agile, XP, and Scrum) are probably interesting to you and I, while CMM and CMMI are interesting to other people.)

Basically, Pivotal Tracker enforces a lightweight agile process, and makes this painless. If you use Tracker properly, you’ll write relatively atomic stories, estimate their difficulty, prioritize them, step each one through a simple workflow from new to accept/reject, track weekly or bi-weekly iterations, and see where you’ve come from and where you’re going. You can do this with most bug trackers, but most of them are missing one important thing.

Constrained velocity.

Pivotal Tracker requires that every feature get an estimate on a simple scale – 0-3 is the default. These are relative velocity points, not hours, and they give you an idea of how much you can accomplish in an iteration. After your first iteration, Tracker uses the average velocity of the previous few weeks as the predictor of your velocity for the current iteration. So if you did 17, 15, and 20 points over the last three weeks, Tracker thinks you can do 17 points next week. Chances are, this is a decent guess.

The Current Iteration and the Backlog are a continuous list in Tracker, and the line between them is based on velocity. If you only have 12 points of work on an iteration, you can’t add stories to the next iteration – you still have capacity on the current iteration, so the stories will show up there. If you have 17 points scheduled for this week, and you try to add 2 more, something has to give. And this is the key.

By managing velocity in this way:

  • The product owner can’t shove an extra 5 points of work on an iteration just because he wants it to get done, at least not without recognizing that it will take more resources (i.e. more programmers).
  • The product owner can’t make every new feature the top priority just because it is new, at least without clearly seeing that other features will be delayed.
  • Developers are forced to estimate everything. You can’t start/finish/deliver a story until it has been estimated first. As long as estimates are optional, they won’t be done consistently.
  • Everyone has a decent idea of how much can be done over time. The velocity estimates aren’t perfect, of course, and that’s just fine. They don’t have to be perfect – they just have to be Good Enough. (For what it’s worth, our velocity over the last 4 weeks was 20, 15, 16, 17 – good enough that we can expect to do around 15-20 points/week.)

There are also some smart touches. For example, you can set the team strength for a given iteration – if half your team is going to be pulled off onto another project next week, or out of town at a conference, set the iteration strength to 50%. Tracker will halve the expected velocity. Or if you have another developer you can pull onto the project, increase the strength to 125%.

Of course, Tracker isn’t perfect. It isn’t particularly client (Product Owner) friendly; I’d love to see a dumbed-down interface that is less intimidating to non-technical clients, but still has all of the features that they need (prioritization, Accept/Reject). And I’m not exactly sure where the QA role fits into the process – there is no “Verified by QA” state, so QA either needs to usurp the Accept/Reject role, or needs to be responsible for Delivery (deployment).

I’m also not sure how well it will work on large/long projects. After 4 weeks of work, we’ve added a total of 250 stories to the system, and 100 are still active (unfinished). It’s working fine now. But if we had 6 months of work mapped out in the Icebox, it might be a little hard to find things.

But Pivotal is actively working on improving Tracker, and even if they weren’t, it’s already better than most bug trackers.

Is your Rails application safe?

Posted by Eric Chapweske
on Monday, September 22

Rails provides many great security features. It’s design can also create significant security holes. In the case of ActiveRecord’s mass assignment vulnerability, the security issues are more servere and widespread than many of us recognize.

Nearly every open source Rails application I’ve seen is vulnerable, and most closed source ones as well. There’s some great solutions for protecting your application from attack, but first, the problem:

The Problem

By default ActiveRecord allows visitors access to any writer method, that is, any method ending with an equal sign. This comes courtesy of the ActiveRecord::Base#attributes= method, which is used internally by the main methods that handle creating and updating records, including new(), create(), and update_attributes().

The way most applications are designed means that whatever data a visitor sends to the server will likely find its way through the attributes=() method, and if not protected, ActiveRecord will happily update the records based on what was sent. In less technical terms: ActiveRecord is insecure by default.

As an example, let’s look at a request against vulnerable code:

1
2
3

# The request
$ curl -X PUT -d "order[price_in_cents]=0" example.com/orders/225
app/models/order.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

class Order < ActiveRecord::Base
  # Table name: orders
  #  id          :integer(11)     not null, primary key 
  #  price_in_cents     :integer(11)
  #  user_id     :integer(11)     
  #  state       :string(255)               

  has_many :line_items
  
  acts_as_state_machine :initial => :pending

  state :pending
  state :paid
    
  def name
    ... 
  end
  
  def shipped_on=(shipping_date)
    ...
  end

end
app/controllers/orders_controller.rb
1
2
3
4
5
6
7
8
9

class OrdersController < ApplicationController
  ...

  def update
    @order.update_attributes(params[:order])
  end

end
Pop quiz: which Order instance methods are exposed to the world?
  • Attributes generated from its table: price_in_cents=, user_id=, state=
  • Attributes generated by association macros: line_item_ids=
  • Other defined writer methods: shipped_on=

Ruby’s dynamic nature and ActiveRecord’s changing API make this excercise more of a guess than anything else. Does Rails 2.1 dynamicly generate different writer methods? Will Rails 2.2? How about the plugins and libraries the application relies on?

Theoritically, this isn’t a problem since ActiveRecord provides a solution out of the box: “Sensitive attributes can be protected from this form of mass-assignment by using the attr_protected macro. Or you can alternatively specify which attributes can be accessed with the attr_accessible macro”

The Reality

Naturally, profesional developers experienced with the framework use attr_accessible/attr_protected and don’t suffer from these problems. As a quick poll, here’s a few of the more popular open source code bases:

  1. Insoshi is second only to Rails as a top forked project on github. It’s a social networking app developed by a seed funded startup whose team includes the author of a very well-reviewed book on developing social applications in Rails.
  2. Mephisto, the Rails-based blogging application, which Railspikes runs on.
  3. Anonymous App is a large Rails project with seasoned developers. I’m withholding its details since it has security issues that are still being addressed.
  4. Rubyflow, the codebase of Peter Cooper’s very useful Ruby news aggregation site.
  5. Spree is a rapidly-maturing ecommerce project and powers RailsEnvy’s new screencast store.

Good projects. Professional developers. Every project except Mephisto is vulnerable. Any forum thread in Insoshi will raise exceptions and be unusable after the user_id is changed to a non-existent user. A similiar approach worked on Anonymous and Rubyflow. Since these projects lacked any strategy for handling this kind of problem, it’s highly probable that much more damaging attacks exist. One example: Spree’s public exposure of the 'state' attribute allowed me to make my order appear as though it was paid for when I hadn’t even entered my payment information. While these projects vary in terms of risk, in each case the cost of solving this issue is cheap when compared to the cost of cleaning up after an attack.

I’m singling out these applications because they’re popular and open source, but every project I’ve developed has experienced the same security issues. The only thing that seems to change is how much data is vulnerable and how important it is. It’s a difficult problem to manage. Retrofitting security on existing code is a very unpleasant experience. It’s easy to forget when developing new applications. Educating other developers on the problem has proved unreliable.

As an aside, I’m impressed by the response of the developers on these projects. Insoshi, Rubyflow, and Spree addressed the issue almost instantly after being informed. It was a reminder to me of how lucky I am to be involved in such a passionate, professional community. Michael Hartl of Insoshi went so far as to write a mass assignment auditing plugin and offers some great advice on how he ended up tackling the problem.

A solution

  1. Don’t use attr_protected. I haven’t seen a compelling use case for it. It’s functionality is confusing. It should probably be removed from ActiveRecord.
  2. Do use attr_accessible. Its white list approach forces an explicit decision on the mass assignablity of attributes. A rule of thumb: if an attribute shouldn’t be in a user submitable form, it shouldn’t be accessible.
  3. Review and audit. Even with attr_accessible, a developer can still shoot themselves in the foot without code audits and reviews. Even if the application is secure today, holes will eventually be introduced into the code. In addition to peer review, automated auditing tools are a great, inexpensive way to find such security problems.
  4. Make it automatic. Disable mass assignment by default, requiring attr_accessible to be specified for each attribute. I’ve taken this approach on maybe 5 projects now. Here’s how to do it:
config/initializers/disable_mass_assignment.rb
1
2

ActiveRecord::Base.send(:attr_accessible, nil)

It’s worked quite well, with the exception of two cases where I had to retrofit it on larger applications. That was a nightmare. I’ve been tinkering with a plugin that aims to reduce some of the problems caused by attr_accessible, and make retrofitting a more pleasant experience. It’s not production ready, but I think there’s some small improvements in it worth stealing.

The downside: it’s pretty much a guarantee that you’ll run into confusing bugs during development. This is a major problem for developers new to the framework, and is annoying for the more experienced. ActiveRecord used to raise exceptions in development when mass assignment was attempted with an inaccessible attribute. This was great, but there were a few complaints, and conflicts with ActiveResource, so the change was pulled.

A better solution?

An alternative approach worth exploring is the route taken by Merb, which decided this is the controller’s problem, and has a plugin providing params_accessible functionality. There’s a similar plugin for Rails . This approach may be especially appreciated by developers who want to add some level of protection to an existing application, since less code needs to change.

I’ve hesitated to use this on applications that use ActiveRecord, which has a bad habit of making methods part of the public api when they should be privately scoped (those ending in _id, _ids, _count, most enumerables, etc) Because of this, attr_accessible serves double duty by discouraging public use of writer methods that should be private. Not really the best excuse, and I’d like to give the params_protected approach a try on my next Rails project.

Regardless of the solution, the cost of designing applications to handle potential mass assignment abuse from the beginning is so much cheaper than attempting to retroactively address the issue. Rails should step up and encourage such design decisions. Whether it’s something as extreme as disabling mass assignment from the start, or an unobtrusive change like adding a commented out attr_accessible line in generated models, the risk shouldn’t be ignored.

Security Tools

There’s a few other related tools that look promising for developing securer code:
  • Tarantula: A fuzzing plugin that spiders your application looking for problems. Via Stuart Halloway’s post on Revelance’s blog: “It crawls your rails app, fuzzing inputs and analyzing what comes back. We have pointed Tarantula at about 20 Rails applications, both commercial and open source, and have never failed to uncover flaws.” Aaron Bedrak’s Rails Security Audit PDF on Peepcode devotes significant space to getting this up and running. It also covers a few of the common mistakes developers can make when using a framework like Rails, and that alone may make it a worthwhile read.
  • ratproxy: Happened upon this on Google’s excellent security blog . From their announcement post: “[ratproxy] is designed to transparently analyze legitimate, browser-driven interactions with a tested web property and automatically pinpoint, annotate, and prioritize potential flaws or areas of concern.”
  • Audit Mass Assignment: Scans ActiveRecord models looking for potential mass assignment mistakes.
  • Find Mass Assignment: Searches controller actions for likely mass assignment, and then find the corresponding models that don’t have attr_accessible defined.
References

Lambda, etc. in Ruby 1.9

Posted by Jon
on Monday, September 08

Ruby 1.9 introduces some improvements to Ruby’s lambdas. This is great, in my opinion; much of the power and beauty of Ruby comes from its combination of object oriented programming (the dominant paradigm) with functional programming (baked in quite deeply). I’m glad to see Matz et al committed to improving Ruby’s functional programming support.

Here are a few of the changes in Ruby 1.9.

1. Block arguments are local. This is an obvious choice and a nice improvement. In Ruby 1.8, block arguments clashed with local variables. For example:

1
2
3
4
5
6
7
i = "hello"
3.times { |i| puts i }
puts i
# 0
# 1
# 2
# 2

We see that the i variable was overwritten by the i block argument. You almost never want this to happen. And Ruby 1.9 fixes this.

1
2
3
4
5
6
7
i = "hello"
3.times { |i| puts i }
puts i
# 0
# 1
# 2
# "hello"

2. proc is now an alias of Proc.new. In 1.8, proc was an alias for lambda, and Proc.new was slightly different. An improvement, definitely – but why do we need both proc and Proc.new (plus blocks, plus method(:foo), plus the new stabby lambda)?

3. New lambda syntax. Before I show the “stabby lambda” (as described by David A. Black at RailsConf Europe), why do we need another way to define a closure? Doesn’t Ruby already have enough, from blocks to Proc.new to proc to lambda to method()? (See Paul Cantrell’s classic closures in Ruby for more on this.)

The answer is that Ruby’s block/lambda syntax has a significant limitation. Arguments are defined between pipes like this


{|a,b| #code }

or

1
2
3
do |a,b|
# code
end

But block arguments have limitations that regular method arguments don’t. In Ruby 1.8, you couldn’t do the following, though it looks like you can in 1.9.


{|&b| #code }

But default parameter values are still out. This code doesn’t work in either 1.8 or 1.9:


{|a = 0| #code }

And that’s too bad. For blocks to be true anonymous methods, they should work like regular methods, which includes allowing default values for optional arguments.

Apparently, it would be really tough to get this to work with Ruby’s parser, because this would introduce the possibility of |a, b=c|d| (the middle pipe being an “or”), and the parser would get confused. I’m not a language designer, so I don’t know just how tough this problem is, but I would love to see it solved. Because the alternative is the “stabby lambda”.


a_function = ->(a, b=0) { # code }

If there’s one thing that Ruby doesn’t need, it’s a new lambda syntax. Again, see Paul’s article on the subject. And while I don’t necessarily mind the syntax of the stabby lambda – it’s among the uglier things in Ruby, but I could live with it – I really wish we didn’t need another lambda syntax to fix a shortcoming of the Ruby parser.

There may be a second reason for this -> operator – synatctic sugar. Some (including Dave Thomas) consider the stabby lambda syntax to be more clear when passing two anonymous functions to a method. Ruby methods only allow a single argument to be received as a block, like this:

1
2
3
4
5
6
7
def some_method(&b)
  b.call
end

some_method do 
  puts "hello world"
end
You can’t, for instance, pass two such blocks to a method, like:
1
2
3
4
5
6
7
8
def some_method(&a, &b)
  a.call
  b.call
end

some_method { puts "first block" } do 
  puts "hello second block"
end

Instead, if you want to pass two such functions to a method, you have to explicitly pass one as a lambda or proc, like so:

1
2
3
4
5
6
7
8
def some_method(a, &b)
  a.call
  b.call
end

some_method lambda { puts "first block" } do 
  puts "hello second block"
end

Notice the lambda keyword? It’s reasonably clear what is going on here, but the “method_name lambda {}” syntax is a little funny. So as a possible bonus, the stabby lambda enables a shorter, and (possibly?) more clear syntax:

1
2
3
some_method -> { puts "first block" } do 
  puts "hello second block"
end

I’m mixed on whether this is an improvement or not. Personally, I embrace the lambda keyword, because doing so has helped me to better understand the language and connect Ruby to my explorations in functional programming (like SICP). So if I had to pass two anonymous functions to a method, I’d probably stick with the lambda keyword in Dave’s first example, but parenthesized to make it clear that it’s just an argument:


some_method(lambda { puts "first block" }) { puts "second block" }

There are other minor changes to procs, blocks, and lambdas in Ruby 1.9. See this Ruby 1.9 changelog at Eigenclass for more details, including dozens of other non-lambda-related changes to Ruby 1.9, like fibers, enumerators, new methods, and syntax changes.

Disabling ActiveRecord query caching when needed

Posted by Luke Francl
on Monday, August 18

In Rails 2.0 and later, all requests are wrapped in a block that enables query caching.

What this means is that if you execute the exact same query in a single request, the previous results of the query will be returned instead of fetching them from the database again.

Controller actions are wrapped with this automatically, but you can also enable it elsewhere like this:

1
2
3
User.cache do
  # do stuff with caching turned on.
end

However, sometimes you do not want this to happen. For example, if you want to fetch random records from the database, having this cached will cause you to get the same record each time you query.

Fortunately, the cache is easy to disable for parts of your code, with the uncached method (see also):

1
2
3
4
5
6
7
8
9
10
class User < ActiveRecord::Base
  def self.random
    # query for example purposes only -- 
    # ordering by rand() is slow, see here: 
    # http://jan.kneschke.de/projects/mysql/order-by-rand
    uncached do 
      find(:first, :order => "rand()") 
    end
  end
end

Disabling the cache only affects the code within the block, so unlike clearing the cache (which would also work) the rest of your code will still get the benefit of the query cache.

MapReduce, with inspiration from functional programming

Posted by Jon
on Thursday, August 14

MapReduce is the architecture that Google uses to do things like index the web and calculate PageRank. It’s a somewhat popular topic for developers and bloggers, for two reasons. First, because Google uses it to such dramatic effect, and it’s easy to think that it must be the greatest and most powerful way to handle distributed processing. Second, it is a little hard to understand at first, which means that there is always a market for “intro to MapReduce” blog posts.

The thing is, there is nothing magical about MapReduce. It is fairly simple on the surface, once you understand a few basic concepts (like map and reduce, though MapReduce != map + reduce, as we’ll see soon). It also isn’t the “best” approach to distributed processing, because there are so many types of problems that need distributed processing, and MapReduce is only appropriate for a small subset.

So this post isn’t about DIY MapReduce. Instead, it’s about understanding MapReduce. Specifically, understanding it from a languages standpoint, with reference to map and reduce, rather than understanding it from a systems standpoint.

Iterators and MapReduce

If you understand map and reduce, you’re about a third of the way to understanding MapReduce. But it is also important to note that MapReduce doesn’t even strictly need map or reduce functions, and is implemented at Google in C++ (not exactly a functional language). So map and reduce are more the conceptual foundation of MapReduce, rather than the underlying code.

But still, the MapReduce framework gets its name from these higher-order functions, and the basic pattern is simple:

  1. Split a large problem into smaller chunks, and perform a function on each chunk (map)
  2. Aggregate the output (reduce)

Because map just applies a function to an element in an array, with no side effects, order doesn’t matter. You could run map backwards or forwards, and the result would be the same. Therefore, map operations can easily be parallelized over different CPU cores, or across multiple machines. (This is one of the advantages we get from greater abstraction. We could replace map with an each iterator, reduce, or even a for or while loop, but by using map, we know immediately that parallelization is possible.)

MapReduce also parallelizes its reduce stage, even though reduce is not inherently parallelizable. It does this not by parallelizing a single reduce, but by distributing each reduction to a different machine. So while the map stage is a single map, distributed over several computers, the reduce stage is multiple reduces, each operating on a single machine, and each independent of the reduction happening on the next machine.

An example: from local reduce to MapReduce

What kind of problem can be solved with MapReduce? Counting words is a basic example. Let’s say you want to count the number of occurrences of each word in War and Peace. You could do this simply in terms of inject, returning a hash of key-value pairs that list a word and its number of instances, like {"war" => 225, "peace" => 341}.

1
2
3
4
5
words = File.open("/path/to/war_and_peace.txt", "r").to_a.join(" ").split(" ")
word_counts = words.reduce(Hash.new(0)) do |results, word|
  results[word.to_sym] += 1
  results
end

This code is not parallelizable. Fortunately, it only takes Ruby about a second to count the instances of each word in War and Peace, which means that distributed MapReduce is only needed for much larger problems. But what if we wanted word counts of every book in Project Gutenberg? Or of every page on the entire internet? Or what if our calculation function took longer? There are 574,780 total words in the English translation of War and Peace that I’m using; if each word took a second to process, due to a network call or a complex calculation, it would take 6.5 days to process the book. Proust would take three weeks. Yikes!

That’s where MapReduce comes in. Instead of processing the entire list with a single reduce, imagine splitting the text of War and Peace into 200 even chunks. These chunks would then be mapped to 200 different servers, with each server doing its own parallel word counting, like this:

1
2
3
word_chunks.map do |chunk|
  assign_to_server(count_words(chunk)) #[{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}]
end

In reality, MapReduce works slightly differently; each chunk is represented as the value in a key-value pair, with the key being the identifier for that chunk (like 1..200 when using 200 even chunks, or the ID of a Google FileSystem cluster, or a filename when each server gets a different file). This key is useful for managing a MapReduce operation; if server XYZ is goes down, the master program knows that XYZ was handling chunk 19, and we can process chunk 19 again. So what we have is more like this:

1
2
3
word_chunks.each do |chunk_key, words|
  assign_to_server(count_words(chunk_key, words))
end

You might be asking yourself: if the map phase always creates values of “1” attached to a particular word, so a mapper might end up with [{"the" => 1}, {"the" => 1], {"the" => 1}], why even use a hash? Why not just create a big nested array of words ([["the","the","the"]]), group by word, and count the elements in each array?

Well, first, MapReduce isn’t always used for word counts, so in another use, the values returned by map might be more significant. Second, this lets us introduce a “combiner” stage between map and reduce to optimize the process. This stage creates a local count for a particular server, which reduces bandwidth and makes life easier for the reduce stage. Without this stage, a mapper might return [{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}, {"the" => 1}], leaving it up to which ever reducer handles “the” to sum these numbers (along with other instances of “the”). But the combiner creates local sums, meaning the mapper will actually return [{"the" => 3}, {"cat" => 1}, {"dog" => 1}].

When all the distributed word counts finish, the results are grouped by key. So all the “cat” key/value pairs are grouped into one list, and the “dog” key/value pairs are grouped into another list. These are then reduced for a final word count. Our reduce pseudocode might look something like this.

1
2
3
4
5
grouped_results.each do |key, values|
  #key: "cat"
  #values: [1,3,12,9,1,2]
  total[key] = values.sum
end

These reductions can be distributed, so each pass through grouped_results can be handled by a different server, because the summing of instances of “war” is completely independent of the summing of instances of “peace”.

What’s next?

There is more to MapReduce than this; other pieces handle the fault tolerance, the grouping of keys between the map and reduce stages, etc. And the actual distribution of processing introduces a lot more complexity. If you actually want to use MapReduce, take a look at Hadoop, or a few Ruby distributed processing systems inspired by MapReduce (Skynet, Starfish).

Hopefully this post will give you a conceptual understanding of MapReduce; it’s an interesting and powerful architecture. Just remember that it is not the end-all of distributed processing, and just because it’s appropriate for Google doesn’t mean it is appropriate for you. In fact, MapReduce can only handle a certain array of problems; if you want to distribute video transcoding across multiple machines, for example, MapReduce can’t really help you. Keep in mind too that you shouldn’t reinvent the wheel – if Hadoop can help you, it’s already built and built well. But it never hurts to understand something new.

More Resources

Cluster Computing and MapReduce (video series)

MapReduce whitepaper

Hadoop

Using Ruby with Hadoop

Understanding map and reduce

Posted by Jon
on Monday, August 11

Map and reduce are two of the most important internal iterators in functional programming. But in my experience as a Ruby developer, while map is frequently used, it should be used a bit more; and reduce (== inject) is underused and often misunderstood.

So how do you know when to use map or reduce on a collection? Simple. When iterating through an array, if you don’t want a return value from the operations, use each; and if you’re looking for a return value, use the iterator method that delivers the type of value you want returned. So if you want to take a collection and return a subset of that collection based on some criteria, use select. (See an earlier article for more.) If you want to return a transformed version of each element, use map. And if you want to return any value whatsoever, or a value that doesn’t match another iterator method, use reduce.

As an aside, do reduce and map have anything to do with the MapReduce architecture for distributed processing? Not surprisingly, the answer is “yes,” and I’ll talk more about that later this week.

inject, reduce, fold

One function, three names. If you’re a Ruby user and have access to Ruby 1.8.7, I suggest you forget the name inject altogether; I find it confusing, personally, and moving forward, inject has another name: reduce. This is much better, and I’ll discuss terminology in a minute (along with a third common name for this function: fold).

reduce takes in an array and reduces it to a single value. It does this by iterating through a list, keeping and transforming a running total along the way. This running total can be a single value (0, 3.7, “abcdefg”), a collection ([], {}), or anything else, really. Each iteration starts with the return value of the previous iteration and does something with it.

Formally, reduce takes three arguments: a collection, an initial value (which is used on the first pass), and a function to apply at each pass through the collection. Here is a Ruby example that uses reduce to sum a series of numbers:

1
2
3
(5..10).reduce(0) do |sum, value|
  sum + value
end

Let’s walk through this example in more detail. Here are the three arguments passed to reduce in this example:

  • Collection: 5, 6, 7, 8, 9, 10
  • Initial Value: 0
  • Function: add current value to running total
Pass # Collection Value Running Total Return Value
1 5 0 (initializer value) 5
2 6 5 11
3 7 11 18
4 8 18 26
5 9 26 35
6 10 35 45

The return value from this function will be 45. At each pass, the function takes two values: the current element in the array, and the return value from the previous return value (or the initializer value for the first pass). (This is the |sum, value| part of the Ruby example.)

What would this example look like using each instead of reduce?

1
2
3
4
5
6

sum = 0
(5..10).each do |value|
  sum += value
end
sum

Any time you see this (anti)pattern – initializing a variable, looping to change the variable, and returning the variable – you know you need a new collection function. In this case, reduce does the trick.

Digression on blocks

Personally, while Ruby’s block syntax makes code beautifully readable, I sometimes have trouble keeping track of how this syntax relates to a straightforward functional syntax. After all, I described reduce as taking three arguments: a collection, a starter value, and a function. But in the Ruby example above, I’m only passing one argument (0) to reduce. So if it helps, here is a another way to think about reduce, in pseudo-scheme.


(reduce + 0 (range 5 10))

Here we’re explicitly passing three arguments to reduce: + (the addition operator), 0 (the seed value), and the range of numbers from 5 to 10 (our collection). Remember that (5..10).reduce(0) {|sum, value| sum + value } does exactly the same thing, just rearranged a bit.

Back on track

Let’s look at a slightly more complicated case. reduce can be used to implement just about any other collection function, from map to sort to select. Here is a way to emulate select using reduce.

1
2
3
4
5
(1..10).reduce([]) do |result, value| 
  result << value if value > 5
  result
end
# [6, 7, 8, 9, 10]

You can also emulate map with reduce, like this:

1
2
3
4
5
(1..10).reduce([]) do |result, value| 
  result << value * value
  result
end
# [1, 4, 9, 25, 36, 49, 64, 81, 100]

Of course, you wouldn’t want to do this. Whenever possible, you’re generally better off using a more specific function, like map in this case. If you want to sum numbers, use a sum function instead of reduce. If you want a hash, try build_hash. (I say “generally”, because there are also diminishing returns – creating a new reduce-style iterator for every possible use of reduce is overkill. Use your judgment.)

But this shows you the power of reduce; reduce can be used to implement any other internal iterator. Any time you want to take a collection return something else – a value, another collection, etc. – reduce is capable.

Why 3+ names?

This function has three names: “inject”, “reduce”, and “fold”. All make sense from one perspective.

  • fold is used by Haskell, Scheme, and OCaml. This name highlights the fact that this function “folds” the return value of one pass into the next pass. Actually, this function is really divided into fold-left and fold-right, referring to the direction of the reduction. Do you start at the left of the list, moving right, or go from right to left? For associative operations (like addition), it doesn’t make a difference. 1 + 2 + 3 == 3 + 2 + 1. But for non-associative operations, like division, exponents, and string concatenation, order matters: 12^ != 21^, and “a” + “b” != “b” + “a”.
  • reduce, used by Common Lisp, Python, Javascript, and now Ruby, describes the ultimate goal of the function: reduce a collection to a single return value. But keep in mind that the single return value can be a collection. So reduction has nothing to do with size – a reduce function called on a 10 element array could return a 100 element array, or it could return a single integer, or a hash, or something else.
  • inject, the Smalltalk name for this function (and the dominant Ruby name until recently), is my least favorite. I think it refers to “injecting” the return value of the previous function call into the next function call, but I could be wrong.
  • If it helps, you can even think of this function as accumulate, which is what C++ calls it. This name is generally appropriate; accumulate some return value through iterating through a collection. Just remember that there isn’t actually a “global” accumulated variable that is carried over through each function call, and returned at the end; each pass just folds its return value into the next pass. That’s it.

So that’s reduce. If you’re having trouble getting your mind around it, I recommend reading up a bit more, because it is an important concept. It is also important to understanding MapReduce.

map

map takes an array, applies a function to each element, and returns a new array with the results. Here is its equivalent using each.

1
2
3
4
5
email_addresses = []
users.each do |user|
  email_addresses << user.email
end
email_addresses

We can improve upon this using map.

1
2
3
users.map do |user|
  user.email
end

This is quite a bit simpler than reduce, and I’m not going to spend much time on it. If you’re an experienced Ruby programmer, you’ve probably used map hundreds of times. If it’s new to you, just remember that map takes an array and returns an array of exactly the same size. And think of some practical uses of map:

  • Convert [1,2,3,4,5] to [“one”, “two”, “three”, “four”, “five”]
  • Convert [“Jon Dahl”, “Luke Francl”, “Eric Chapweske”] to [“Jon”, “Luke”, “Eric”]
  • Convert [“72%”, “1%”, “50%”] to [.72, .01, .5]
  • Convert Tag.find(:all) to [”<span class=’small-tag’>Ruby</span>”,”<span class=’large-tag’>Merb</span>”,”<span class=’small-tag’>Perl</span>”]

Other functions

These aren’t the only important iterator functions, by any means. But map, reduce, and select are among the most important. Get them solidly under your belt, and you’ll write better code. They’ll also help you from a conceptual standpoint; MapReduce isn’t exactly map + reduce; it can even be implemented in languages that don’t have map or reduce capabilities. But it forms the conceptual foundation of MapReduce, and MapReduce works because of specific properties of map and reduce. More on that later this week.

Functional programming and looping

Posted by Jon
on Tuesday, July 29

If you’re a programmer, you’ve probably worked through one or more books teaching you the syntax of a new language. I’ve had this experience with half a dozen languages, like C, Javascript, and Perl. These books are typically introduce loops midway through the syntax discussion, after datatypes and control flow, but before I/O and advanced features.

Loops are almost always presented according to this formula.

  • Inane intro text: “what if you want to do an operation more than once”?
  • Introduce while loop, with difference between do while and while do.
  • Introduce for loop, the while loop’s crazy cousin.
  • (Bonus) Introduce foreach loop if language is sufficiently high-level. And that’s it – you know how to loop through code; time to move on.

Not so fast. If you’re lucky enough to use a language that draws from functional programming, you shouldn’t loop like this.

The point

From now on, I’m going to use Ruby for examples, but this article isn’t about Ruby. It is about transitioning from primitive loops to iterating through collections, and from generic collection functions (like each) to more specific functions (like map).

From loops to array traversal

For the last several months, I’ve been working on Tumblon, a medium-sized Rails application. I’ve worked on 15-20 Ruby applications over the last three years, probably totaling 50,000 lines of Ruby code.

I’ve only used a primitive loop once.

That primitive loop was a loop {} loop, forever polling a task list looking for jobs. In other words, a loop with no exit condition beyond ^C or a server crash. As far as I know, Ruby doesn’t have a for loop at all, which would explain why I haven’t used it. It has a foreach loop (for item in arr), but that’s syntactic sugar for arr.each {}.

So the first reason why I’ve only used a simple loop in one case: the each concept usually a better option. Its Ruby implementation will be familiar to anyone who’s seen Ruby code before:

1
2
3
["horse", "pig", "cow"].each do |animal|
  puts "Old MacDonald has a #{animal}"
end

(Yes, I have a small child.)

This is far cleaner than its for or while loop alternatives. And it is a better abstract representation of what we’re doing: we aren’t looping with an exit condition, we are iterating through an array. But what if you want to do something a fixed number of times? Even that can be understood as traversing a list, like [1,2,3,4,5,6,7,8,9,10].each {}. Of course, Ruby provides a cleaner version: 10.times {}.

So if your loop is working through a list of some sort, each is a better abstraction of the problem. And in my experience building Ruby applications, every loop but one has been traversing a list. Parsing XML? Traversing a collection. Summing numbers? Traversing a collection. Reading in a textfile? Listening to STDIN? Working with rows in a database? Traversing a collection. That’s what each loops do well.

Beyond arr.each

But each isn’t the final word. It is a step up from a primitive for or while loop when working with a collection of values, but many each loops should be replaced with other array methods, like map, inject, and select.

When is each useful? Simple: when you want to create side-effects, like saving to the database, printing a result, or sending a web service call. In these cases, you’re not concerned with the return value; you want to change state on the screen, the disk, the database, or something else. Take a look at this code.

1
2
3
User.find(:all).each do |user|
  Notification.deliver_email_newsletter(user)
end

You don’t need a return value from this – you need emails to be delivered.

But don’t use each if you want to extract some new value from an array. That’s not what it’s for. Instead, take a look at three other powerful functions: map, inject, or select. To see why, let’s take a look at select. Here is code that takes in an array, and creates a new array from elements that match a certain condition, using each.

1
2
3
4
5
active_users = []
users.each do |user|
  active_users << user if user.active?
end
active_users

Man, the first and last lines are ugly. Why do you have to initialize and return active_users? Answer: because this is a misuse of each. You are much better off using select (or its equivalent, find_all):

1
2
3
users.select do |user|
  user.active?
end

Using select is shorter, easier to understand, and less bug-prone. And more importantly, it clearly encapsulates one common use of each (and looping in general).

Two other key functions – map and inject (or reduce) – complement select and follow a similar pattern. And not surprisingly, they form the foundation of the mapreduce approach to distributed processing. I’ve written more about map and reduce in another article, and here is shorthand for knowing which of these functions to use:

Desired Return Value Function
New array with same number of values map
New array composed of part of the old array select
Single value (though this value can be an array) inject
none each

The point, redux

Use each for changing state. Otherwise, avoid side-effects and use “functional” array methods that return a value. Simple. Your code will be cleaner and less bug prone.

And remember the dead giveaway:

  1. Initialize an empty value, or array, or whatever (new_arr = [])
  2. arr.each, changing the initialized value
  3. Return the value (return new_arr)

Whenever you see this pattern, you know you’ve got an each loop that needs swapping out.

(Edit: I’ve posted a follow-up article with more about map and reduce.)

Why programmers should play Go

Posted by Jon
on Monday, July 14

Go is an ancient strategy game with simple rules and a profound degree of complexity.

Software development is the art of managing complexity using a limited number of rules, structures, and patterns.

Programmers should play Go.

Go in 28 words or less.

The beauty of Go is its combination of simplicity and complexity. On the one hand, go has only a handful of rules. Place stones, don’t get completely surrounded, control territory. Like chess, the mechanics can be picked up in a few minutes, though Go only has a single type of “move”, and only one edge case (the ko rule). And like chess, one can spend a lifetime discovering the strategic and tactical layers of the game.

While chess is quite complex and rich, such that it took a 30-node supercomputer to defeat the reining chess champion, no computer comes close to defeating even a skilled amateur Go player. There are 361 positions on a Go board, and with two players, there are 2.08168199382×10170 valid positions. That’s quite a bit bigger than a googol (yes, that is the correct spelling). Realistically, there are something on the order of 10400 possible ways that a typical game could play out. And the number of possible moves roughly follows 361!, which means that only 20 moves in, there are many googols of possible ways that the game could shake down. (As a fun exercise, try plugging 361! into an online factorial calculator.)

Managing complexity

So how does one play Go, given this near-infinite complexity? On a tactical level, a player approaches Go like chess, thinking several moves ahead. But this only works in small spaces, like a tight battle in a small sector of the board. Beyond there, there are just too many possibilities. So on a strategic level, a player must think in shapes or patterns. These shapes provide shorthand ways of managing the complexity of Go. As a non-master, I may have no idea how things will proceed in one sector of the board, but I may be able to recognize strong and weak patterns of stones, vulnerable shapes and effective formations.

But there’s more: Go has several sorts of patterns. Beyond shapes, there are Go proverbs. These can be general: “Your opponent’s good move is your good move”; specific: “Don’t try to cut the one-point jump”; funny: “Even a moron connects against a peep”; and meta: “Don’t follow proverbs blindly.” These proverbs are principles which help a player make good decisions. They are less specific than shapes, and so they provide guidelines for whatever situations may arise on the Go board. Proverbs often conflict, and a player must determine when and how to apply them.

Finally, there are joseki. Joseki are patterns of play that are considered even for both sides. They typically happen in the corners of the board, and typically at the beginning of the game. Interestingly, there is a Go proverb that says “Learning joeski costs two stones,” meaning that memorizing these patterns isn’t helpful. Instead, a player should learn from joseki by understanding what is going on in each move.

Patterns in Go, patterns in software design

Each of these Go patterns has a rough programming analogue.

Shapes in Go aren’t unlike software design patterns. While there is nothing preventing you from placing logic in your views, this shape is recognized to be a weak one. Think of Gang-of-Four design patterns: the MVC, Adapter, and Factory patterns are recognized to be helpful in some circumstances (and not appropriate in others). On a lower level, iteration and recursion have commonly recognized shapes, as do database normalization vs. denormalization. Even if you can’t hold an entire program or algorithm in your head at once, recognizing common shapes helps you to understand what is going on.

Go proverbs are like another type of pattern in software: CapitalizedPrinciples (for lack of a better term) made popular by Extreme Programming. Think DontRepeatYourself, YouArentGonnaNeedIt, CollectiveCodeOwnership, DailyBuild, TestFirst. These aren’t specific code “shapes”, like a singleton class – they are general principles that guide the practice of programming.

Because joseki is about exchange between competing parties, its programming parallel is a little less clear. The closest comparison, in my mind, is programming exercises. This article, for instance, suggests 9 exercises to help you become a better OO programmer, like:

  • Use only one dot per line
  • Use only one level of indentation per method
  • Don’t use setters, getters, or properties

In a real-world program, you’re unlikely to stick to these principles 100% of the time. But forcing yourself to write code in this way can be an eye-opening experience and can make you a better developer.

So what can Go really do for you?

Obviously, these parallels are structural. Specific Go proverbs (“Your opponent’s good move is your good move”) may not have direct relevance to software development. So can Go really make you a better developer?

I think it can, and I’ll go one further. I think Go can make you smarter. There is a lot of anecdotal evidence to this effect [1] [2] [3], for example [4]:

In fact, all of our minds can benefit from playing Go, which officially has the capacity to make you smarter. Research has shown that that children who play Go have the potential for greater intelligence, since it motivates both the right and left sides of the brain.

The research mentioned isn’t footnoted, unfortunately, so take statements like this with a grain of salt.

But it makes sense: like chess, Go requires pattern recognition, a mix of strategic and tactical thinking, and comprehension of complex structures, though in Go the patterns are larger and the complexity is greater. A mind trained to think in these ways is going to have an easier time attacking similar problems in other spheres.

Like software development.

Image by andres_colmen: http://flickr.com/photos/andres-colmen/2539473895/

Testing is overrated

Posted by Luke Francl
on Friday, July 11

Next week at RubyFringe, I’ll be taking on one of the programming world’s favorite topics: testing.

Hear me out. Like everyone who’s had their bacon saved by a unit test, I think testing is great. In a dynamic language like Ruby, tests are especially important to give us the confidence our code works. And once written, unit tests provide a regression framework that helps catch future errors.

However, testing is over-emphasized. If our goal is high-quality software, developer testing is not enough.

This is important because of what Steve McConnell calls The General Principle of Software Quality. Most development time is spent debugging. “Therefore, the most obvious method of shortening a development schedule is to improve the quality of the product.” (Code Complete 2, p. 474.)

Problems with developer testing

Developer testing has some limitations. Here are a few that I’ve noticed.

Testing is hard...and most developers aren’t very good at it!

Programmers tend write “clean” tests that verify the code works, not “dirty” tests that test error conditions. Steve McConnell reports, “Immature testing organizations tend to have about five clean tests for every dirty test. Mature testing organizations tend to have five dirty tests for every clean test. This ratio is not reversed by reducing the clean tests; it’s done by creating 25 times as many dirty tests.” (Code Complete 2, p. 504)

You can’t test code that isn’t there

Robert L. Glass discusses this several times in his book Facts and Fallacies of Software Engineering. Missing requirements are the hardest errors to correct, because often times only the customer can detect them. Unit tests with total code coverage (and even code inspections) can easily fail to detect missing code. Therefore, these errors can slip into production (or your iteration release).

Tests alone won’t solve this problem, but I have found that writing tests is often a good way to suss out missing requirements.

Tests are just as likely to contain bugs

Numerous studies have found that test cases are as likely to have errors as the code they’re testing (see Code Complete 2, p. 522).

So who tests the tests? Only review of the tests can find deficiencies in the tests themselves.

Developer testing isn’t very effective at finding defects

To cap it all off, developer testing isn’t all that effective at finding defects.

Defect-Detection Rates of Selected Techniques (Code Complete 2, p. 470)
Removal Step Lowest Rate Modal Rate Highest Rate
Informal design reviews 25% 35% 40%
Formal design inspections 45% 55% 65%
Informal code reviews 20% 25% 35%
Modeling or prototyping 35% 65% 80%
Formal code inspections 45% 60% 70%
Unit test 15% 30% 50%
System test 25% 40% 55%

Don’t put all your eggs in one basket

The most interesting thing about these defect detection techniques is that they tend to find different errors. Unit testing finds certain errors; manual testing others; usability testing and code reviews still others.

Manual testing

As mentioned above, programmers tend to test the “clean” path through their code. A human tester can quickly make mincemeat of the developer’s fairy world.

Good QA testers are worth their weight in gold. I once worked with a guy who was incredibly skilled at finding the most obscure bugs. He could describe exactly how to replicate the problem, and he would dig into the log files for a better error report, and to get an indication of the location of the defect.

Joel Spolsky wrote a great article on the Top Five (Wrong) Reasons You Don’t Have Testers—and why you shouldn’t put developers on this task. We’re just not that good at it.

Code reviews

Code reviews and formal code inspections are incredibly effective at finding defects (studies show they are more effective at finding defects than developer testing, and cheaper too), and the peer pressure of knowing your code will be scrutinized helps ensure higher quality right off the bat.

I still remember my first code review. I was doing the ArsDigita Boot Camp which was a 2-week course on building web applications. At the end of the first week, we had to walk through our code in front of the group and face questions from the instructor. It was incredibly nerve-wracking! But I worked hard to make the code as good as I could.

This stresses the importance of what Robert L. Glass calls the “sociological aspects” of peer review. Reviewing code is a delicate activity. Remember to review the code…not the author.

Usability tests

Another huge problem with developer tests is that they won’t tell you if your software sucks. You can have 1500% test coverage and no known defects and your software can still be an unusable mess.

Jeff Atwood calls this the ultimate unit test failure:

I often get frustrated with the depth of our obsession over things like code coverage. Unit testing and code coverage are good things. But perfectly executed code coverage doesn’t mean users will use your program. Or that it’s even worth using in the first place. When users can’t figure out how to use your app, when users pass over your app in favor of something easier or simpler to use, that’s the ultimate unit test failure. That’s the problem you should be trying to solve.

Fortunately, usability tests are easy and cheap to run. Don’t Make Me Think is your Bible here (the chapters about usability testing are available online). For Tumblon, we’ve been conducting usability tests with screen recording software that costs $20. The problems we’ve found with usability tests have been amazing. It punctures your ego, while at the same time giving you the motivation to fix the problems.

Why testing works

Unit testing forces us to think about our code. Michael Feathers gets at this in his post The Flawed Theory Behind Unit Testing:

One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense….It’s a nice theory, but it’s wrong….

In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.

All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.

That’s the magic, and it’s why unit testing works also. When you write unit tests, TDD-style or after your development, you scrutinize, you think, and often you prevent problems without even encountering a test failure.

So: adapt practices that make you think about your code; and supplement them with other defect detection techniques.

Testing testing testing

Why do we developers read, hear, and write so much about (developer) testing?

I think it’s because it’s something that we can control. Most programmers can’t hire a QA person or conduct even a $50 usability test. And perhaps most places don’t have a culture of code reviews. But they can write tests. Unit tests! Specs! Mocks! Stubs! Integration tests! Fuzz tests!

But the truth is, no single technique is effective at detecting all defects. We need manual testing, peer reviews, usability testing and developer testing (and that’s just the start) if we want to produce high-quality software.

Resources

MapReduce at RailsConf Europe

Posted by Jon
on Thursday, July 03

This September, I’ll be presenting at RailsConf Europe on EC2, MapReduce, and Distributed Processing. The talk will explain the MapReduce approach to distributed processing, will show a few example implementations, and will discuss MapReduce vs. other distributed processing techniques.

Whether you’ll be there or not, if you’re interested in learning more about MapReduce, here are some resources. I’ll write a few more posts on the subject before the conference, so watch this space as well.

Cluster Computing and MapReduce is a great series of video lectures given to Google interns in 2007. The first two are the most appropriate: the first introduces distributed processing concept, while the second covers MapReduce itself.

MapReduce: Simplified Data Processing on Large Clusters is the paper by Jeffrey Dean and Sanjay Ghemawat of Google that got things going in the first place.

MapReduce for Ruby: Ridiculously Easy Distributed Programming discusses MapReduce and introduces Starfish, a Ruby library for distributed processing. Starfish is not a MapReduce implementation, however – it takes a somewhat different approach to distributed processing.

Skynet (a few writeups: InfoQ, Dion Almaer) is another Ruby-based distributed processing system inspired by MapReduce.

Writing Ruby Map-Reduce programs for Hadoop discusses using Ruby to wrap Hadoop, a MapReduce-like system built in Java.

Introduction to Parallel Programming and MapReduce at Google Code University, a good overview of distributed processing and the MapReduce approach.

And finally, one article that you should avoid:

MapReduce: A major step backwards compares MapReduce to relational databases, and says that MapReduces loses out because it doesn’t support database indices, database views, Crystal reports, etc. Basically, the complaint is that MapReduce isn’t SQL compliant. WTF? Clearly, the author(s) didn’t understand what MapReduce is. The problem, as explained elsewhere, is that the authors thought that MapReduce == CouchDB/SimpleDB. Which is obviously not true. %s/MapReduce/SimpleDB the original article and it makes some sense. But long story short, this article will teach you nothing about MapReduce, and will likely confuse you further. So stay away.