Testing is overrated

Posted by Luke Francl
on Friday, July 11

Next week at RubyFringe, I’ll be taking on one of the programming world’s favorite topics: testing.

Hear me out. Like everyone who’s had their bacon saved by a unit test, I think testing is great. In a dynamic language like Ruby, tests are especially important to give us the confidence our code works. And once written, unit tests provide a regression framework that helps catch future errors.

However, testing is over-emphasized. If our goal is high-quality software, developer testing is not enough.

This is important because of what Steve McConnell calls The General Principle of Software Quality. Most development time is spent debugging. “Therefore, the most obvious method of shortening a development schedule is to improve the quality of the product.” (Code Complete 2, p. 474.)

Problems with developer testing

Developer testing has some limitations. Here are a few that I’ve noticed.

Testing is hard...and most developers aren’t very good at it!

Programmers tend write “clean” tests that verify the code works, not “dirty” tests that test error conditions. Steve McConnell reports, “Immature testing organizations tend to have about five clean tests for every dirty test. Mature testing organizations tend to have five dirty tests for every clean test. This ratio is not reversed by reducing the clean tests; it’s done by creating 25 times as many dirty tests.” (Code Complete 2, p. 504)

You can’t test code that isn’t there

Robert L. Glass discusses this several times in his book Facts and Fallacies of Software Engineering. Missing requirements are the hardest errors to correct, because often times only the customer can detect them. Unit tests with total code coverage (and even code inspections) can easily fail to detect missing code. Therefore, these errors can slip into production (or your iteration release).

Tests alone won’t solve this problem, but I have found that writing tests is often a good way to suss out missing requirements.

Tests are just as likely to contain bugs

Numerous studies have found that test cases are as likely to have errors as the code they’re testing (see Code Complete 2, p. 522).

So who tests the tests? Only review of the tests can find deficiencies in the tests themselves.

Developer testing isn’t very effective at finding defects

To cap it all off, developer testing isn’t all that effective at finding defects.

Defect-Detection Rates of Selected Techniques (Code Complete 2, p. 470)
Removal Step Lowest Rate Modal Rate Highest Rate
Informal design reviews 25% 35% 40%
Formal design inspections 45% 55% 65%
Informal code reviews 20% 25% 35%
Modeling or prototyping 35% 65% 80%
Formal code inspections 45% 60% 70%
Unit test 15% 30% 50%
System test 25% 40% 55%

Don’t put all your eggs in one basket

The most interesting thing about these defect detection techniques is that they tend to find different errors. Unit testing finds certain errors; manual testing others; usability testing and code reviews still others.

Manual testing

As mentioned above, programmers tend to test the “clean” path through their code. A human tester can quickly make mincemeat of the developer’s fairy world.

Good QA testers are worth their weight in gold. I once worked with a guy who was incredibly skilled at finding the most obscure bugs. He could describe exactly how to replicate the problem, and he would dig into the log files for a better error report, and to get an indication of the location of the defect.

Joel Spolsky wrote a great article on the Top Five (Wrong) Reasons You Don’t Have Testers—and why you shouldn’t put developers on this task. We’re just not that good at it.

Code reviews

Code reviews and formal code inspections are incredibly effective at finding defects (studies show they are more effective at finding defects than developer testing, and cheaper too), and the peer pressure of knowing your code will be scrutinized helps ensure higher quality right off the bat.

I still remember my first code review. I was doing the ArsDigita Boot Camp which was a 2-week course on building web applications. At the end of the first week, we had to walk through our code in front of the group and face questions from the instructor. It was incredibly nerve-wracking! But I worked hard to make the code as good as I could.

This stresses the importance of what Robert L. Glass calls the “sociological aspects” of peer review. Reviewing code is a delicate activity. Remember to review the code…not the author.

Usability tests

Another huge problem with developer tests is that they won’t tell you if your software sucks. You can have 1500% test coverage and no known defects and your software can still be an unusable mess.

Jeff Atwood calls this the ultimate unit test failure:

I often get frustrated with the depth of our obsession over things like code coverage. Unit testing and code coverage are good things. But perfectly executed code coverage doesn’t mean users will use your program. Or that it’s even worth using in the first place. When users can’t figure out how to use your app, when users pass over your app in favor of something easier or simpler to use, that’s the ultimate unit test failure. That’s the problem you should be trying to solve.

Fortunately, usability tests are easy and cheap to run. Don’t Make Me Think is your Bible here (the chapters about usability testing are available online). For Tumblon, we’ve been conducting usability tests with screen recording software that costs $20. The problems we’ve found with usability tests have been amazing. It punctures your ego, while at the same time giving you the motivation to fix the problems.

Why testing works

Unit testing forces us to think about our code. Michael Feathers gets at this in his post The Flawed Theory Behind Unit Testing:

One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense….It’s a nice theory, but it’s wrong….

In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.

All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.

That’s the magic, and it’s why unit testing works also. When you write unit tests, TDD-style or after your development, you scrutinize, you think, and often you prevent problems without even encountering a test failure.

So: adapt practices that make you think about your code; and supplement them with other defect detection techniques.

Testing testing testing

Why do we developers read, hear, and write so much about (developer) testing?

I think it’s because it’s something that we can control. Most programmers can’t hire a QA person or conduct even a $50 usability test. And perhaps most places don’t have a culture of code reviews. But they can write tests. Unit tests! Specs! Mocks! Stubs! Integration tests! Fuzz tests!

But the truth is, no single technique is effective at detecting all defects. We need manual testing, peer reviews, usability testing and developer testing (and that’s just the start) if we want to produce high-quality software.

Resources

Measuring your test coverage with Heckle and RCov

Posted by Jon
on Thursday, November 29

I gave a presentation at RUM on Monday about code metrics. In particular, I showed tools for measuring two aspects of code: test coverage and complexity. Here are my slides.

Saikuro and Flog measure code complexity. Saikuro measures cyclomatic complexity, the number of independent paths through a method. Flog, on the other hand, parses your code and assigns a complexity value to assignments, branches, and calls. The goal, of course, is to minimize code complexity. This is an important goal, but I’m not sure yet what I think of these measurement tools. I haven’t used them enough to know if they have practical value.

Heckle and RCov on the other hand, are useful. I’m going to look at each in more detail here.

RCov

RCov measures C0 code coverage. That is, it runs your test suite, and looks at what lines of your application were run or not run. It then gives you a nice HTML report with red and green lines – red for lines of code that are not run, and green for lines that are run.

If your test suite doesn’t execute a line of your application code, it is safe to say that that line is not tested. On the other hand, if a line of your application is run, it is NOT safe to say that it IS tested. A test method with no asserts works just fine for RCov’s purposes, thank you very much. Take a look at this code.

def test_user_assignment
  User.assign
end

This test is enough to mark the User.assign method as tested. But nothing is asserted, and so nothing is tested. The problem is equally true even if you aren’t in the habit of writing tests without assertions; you may make assertions about some aspects of a method, but forget about other aspects. And RCov won’t tell you this.

Logically speaking, RCov tells you that if line_is_red, then !line_is_tested. From this, you can also infer the contrapositive: if line_is_tested, then !line_is_red. But that’s all you know. If a line is green, RCov tells you nothing at all. Saying if !line_is_red, then line_is_tested is a formal fallacy (denying the antecedent). And that’s bad.

So 100% RCov coverage is not equal to 100% test coverage. In fact, the two have nothing to do with each other. Your code could have 100% or 95% or 75% RCov coverage, and be extremely poorly tested.

In my experience, RCov is a one-time tool. That’s because green lines in RCov don’t tell you anything at all about your test coverage. Red lines provide the real value. If you run RCov, find an untested method, and write up a quick test hack that provides C0 coverage, RCov will never complain about that method again. It will be off your RCov radar. This is too bad, because it is really useful to know what is poorly tested. So whenever you see red in RCov, take the time to write comprehensive tests to cover the untested code.

Heckle

Heckle is a mutation tester that changes your code and checks to see whether your tests catch the changes. If Heckle is able to change instances of true to false (or 32 to nil, or remove method calls) in your application without creating a test failure, then your code isn’t tested well enough. To run it effectively, do this:

heckle Class method -t /test/units/class_test.rb -T 30

heckle is the tool, installed as a Ruby gem. Class is the name of the Ruby class you want to heckle. method is a method on the class; you can leave this out, but I don’t recommend it. -t /test/units/class_test.rb is the path to the unit test you want to use (also optional). Finally, -T 30 specifies a timeout for the test, in case your mutation creates an infinite loop.

You can leave out the last three options and just run Heckle with a class:

heckle Class

But I don’t recommend it.

First, it will take forever.

Second, you may run into infinite loops.

Third, heckle will unfortunately test EVERY method available to a class, including methods included by modules, superclasses, etc. So if you’re heckling an ActiveRecord class, you’re going to see dozens of Rails magic methods, not just the methods that you wrote.

Fourth, your UserTest should cover your User class on its own, if your code is well written and well tested; it shouldn’t rely on the ProductTest class (or another test). One problem with Heckle is that it doesn’t distinguish between well tested code and highly coupled code, where a small change somewhere causes the application to fall apart somewhere else. This problem can be minimized by only comparing a single method to a single test class.

I like Heckle and find it pretty useful. Unfortunately, it needs a little developer love. The -T timeout parameter is flaky; it doesn’t always play nice with its dependencies (especially ParseTree 2.0.x, the current version); and it would be more useful if by default it only heckled the methods directly added by a class, not methods brought in through parent classes, includes, or fancy metaprogramming. This is a shame, because it is really a great tool. Hopefully Kevin Clark and Ryan Davis have an update in the works.

Testing ActiveRecord Transactions

Posted by Luke Francl
on Wednesday, March 28

ActiveRecord allows you to start transactions that will be rolled back in the event of an error.

A good example is importing records from a CSV file. If you want the entire import to roll back if any of the rows fail to import, you could write your code like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def import_csv
  csv_file = params[:csv_file]
  
  begin
    Record.transaction do 
      fastercsv = FasterCSV.new( csv_file )
      while row = fastercsv.readline
        foo, bar = row
        Record.create!( :foo => foo, :bar => bar )
      end
    end
    redirect_to success_action_path
  rescue 
    # do something with the error
    flash[:error] = "CSV import failed"
    redirect_to import_path
  end
end

The Record.create! call will throw an ActiveRecord::InvalidRecord error if one of the rows can’t be saved. Then the rescue block catches the error and reports it to the user instead of showing them an ugly 500 error (or, worse, a corrupted import).

However, this doesn’t play nicely with your tests.

You’d like to do something like this:

1
2
3
4
5
def test_import_csv_failure
  assert_no_difference Record :count do 
    post :import_csv, :csv_file => fixture_file_upload('files/invalid.csv')
  end
end

But this won’t work, because running the test starts a transaction, and ActiveRecord doesn’t support nested transactions. There’s been a patch open on this problem for 9 months, but no action has been taken.

I was able to work around the problem by turning off transactional fixtures for the entire test case class.

1
2
3
4
5
6
7
8
9
class MyTest < Test::Unit::TestCase
  self.use_transactional_fixtures = false

  def test_import_csv_failure
    assert_no_difference Record :count do 
      post :import_csv, :csv_file => fixture_file_upload('files/invalid.csv')
    end
  end
end

This makes the test run slower, but now it passes. If you’re feeling adventurous, you can install the ActiveRecord nested transactions plugin.

Lots of people have hit this problem. Jerry Kuch blogged about it in January 2006 and ticket 5457 was filed back in June. But hopefully this post will help someone else figure out the problem.