Testing is overrated

Posted by Luke Francl
on Friday, July 11

Next week at RubyFringe, I’ll be taking on one of the programming world’s favorite topics: testing.

Hear me out. Like everyone who’s had their bacon saved by a unit test, I think testing is great. In a dynamic language like Ruby, tests are especially important to give us the confidence our code works. And once written, unit tests provide a regression framework that helps catch future errors.

However, testing is over-emphasized. If our goal is high-quality software, developer testing is not enough.

This is important because of what Steve McConnell calls The General Principle of Software Quality. Most development time is spent debugging. “Therefore, the most obvious method of shortening a development schedule is to improve the quality of the product.” (Code Complete 2, p. 474.)

Problems with developer testing

Developer testing has some limitations. Here are a few that I’ve noticed.

Testing is hard...and most developers aren’t very good at it!

Programmers tend write “clean” tests that verify the code works, not “dirty” tests that test error conditions. Steve McConnell reports, “Immature testing organizations tend to have about five clean tests for every dirty test. Mature testing organizations tend to have five dirty tests for every clean test. This ratio is not reversed by reducing the clean tests; it’s done by creating 25 times as many dirty tests.” (Code Complete 2, p. 504)

You can’t test code that isn’t there

Robert L. Glass discusses this several times in his book Facts and Fallacies of Software Engineering. Missing requirements are the hardest errors to correct, because often times only the customer can detect them. Unit tests with total code coverage (and even code inspections) can easily fail to detect missing code. Therefore, these errors can slip into production (or your iteration release).

Tests alone won’t solve this problem, but I have found that writing tests is often a good way to suss out missing requirements.

Tests are just as likely to contain bugs

Numerous studies have found that test cases are as likely to have errors as the code they’re testing (see Code Complete 2, p. 522).

So who tests the tests? Only review of the tests can find deficiencies in the tests themselves.

Developer testing isn’t very effective at finding defects

To cap it all off, developer testing isn’t all that effective at finding defects.

Defect-Detection Rates of Selected Techniques (Code Complete 2, p. 470)
Removal Step Lowest Rate Modal Rate Highest Rate
Informal design reviews 25% 35% 40%
Formal design inspections 45% 55% 65%
Informal code reviews 20% 25% 35%
Modeling or prototyping 35% 65% 80%
Formal code inspections 45% 60% 70%
Unit test 15% 30% 50%
System test 25% 40% 55%

Don’t put all your eggs in one basket

The most interesting thing about these defect detection techniques is that they tend to find different errors. Unit testing finds certain errors; manual testing others; usability testing and code reviews still others.

Manual testing

As mentioned above, programmers tend to test the “clean” path through their code. A human tester can quickly make mincemeat of the developer’s fairy world.

Good QA testers are worth their weight in gold. I once worked with a guy who was incredibly skilled at finding the most obscure bugs. He could describe exactly how to replicate the problem, and he would dig into the log files for a better error report, and to get an indication of the location of the defect.

Joel Spolsky wrote a great article on the Top Five (Wrong) Reasons You Don’t Have Testers—and why you shouldn’t put developers on this task. We’re just not that good at it.

Code reviews

Code reviews and formal code inspections are incredibly effective at finding defects (studies show they are more effective at finding defects than developer testing, and cheaper too), and the peer pressure of knowing your code will be scrutinized helps ensure higher quality right off the bat.

I still remember my first code review. I was doing the ArsDigita Boot Camp which was a 2-week course on building web applications. At the end of the first week, we had to walk through our code in front of the group and face questions from the instructor. It was incredibly nerve-wracking! But I worked hard to make the code as good as I could.

This stresses the importance of what Robert L. Glass calls the “sociological aspects” of peer review. Reviewing code is a delicate activity. Remember to review the code…not the author.

Usability tests

Another huge problem with developer tests is that they won’t tell you if your software sucks. You can have 1500% test coverage and no known defects and your software can still be an unusable mess.

Jeff Atwood calls this the ultimate unit test failure:

I often get frustrated with the depth of our obsession over things like code coverage. Unit testing and code coverage are good things. But perfectly executed code coverage doesn’t mean users will use your program. Or that it’s even worth using in the first place. When users can’t figure out how to use your app, when users pass over your app in favor of something easier or simpler to use, that’s the ultimate unit test failure. That’s the problem you should be trying to solve.

Fortunately, usability tests are easy and cheap to run. Don’t Make Me Think is your Bible here (the chapters about usability testing are available online). For Tumblon, we’ve been conducting usability tests with screen recording software that costs $20. The problems we’ve found with usability tests have been amazing. It punctures your ego, while at the same time giving you the motivation to fix the problems.

Why testing works

Unit testing forces us to think about our code. Michael Feathers gets at this in his post The Flawed Theory Behind Unit Testing:

One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense….It’s a nice theory, but it’s wrong….

In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.

All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.

That’s the magic, and it’s why unit testing works also. When you write unit tests, TDD-style or after your development, you scrutinize, you think, and often you prevent problems without even encountering a test failure.

So: adapt practices that make you think about your code; and supplement them with other defect detection techniques.

Testing testing testing

Why do we developers read, hear, and write so much about (developer) testing?

I think it’s because it’s something that we can control. Most programmers can’t hire a QA person or conduct even a $50 usability test. And perhaps most places don’t have a culture of code reviews. But they can write tests. Unit tests! Specs! Mocks! Stubs! Integration tests! Fuzz tests!

But the truth is, no single technique is effective at detecting all defects. We need manual testing, peer reviews, usability testing and developer testing (and that’s just the start) if we want to produce high-quality software.

Resources

Comments

Leave a response

  1. Jon DahlJuly 11, 2008 @ 02:35 PM

    I really like the quadfecta of testing you propose: unit (developer) testing, human QA testing, usability testing, and peer testing in the form of code review. I’d be happy with software that had high test coverage using all four of these techniques. :)

  2. phillcJuly 11, 2008 @ 02:40 PM

    can you point to any reference that goes along with your statement saying that code reviews are both more effective and cheaper? id be really interested!

  3. Luke FranclJuly 11, 2008 @ 02:58 PM

    phillc—

    Sure.

    Most studies have found that inspections are cheaper than testing. A Study at the Software Engineering Laboratory found that code reading detected about 80 percent more faults per hour than testing (Basili and Selby 1987). Another organization found that it cost six times as much to detect design defects by using testing as by using inspections (Ackerman, Buchwald, and Lweski 1989). A later study at IBM found that only 3.5 staff hours were needed to find each error when using code inspections, whereas 15-25 hours were needed to find each error through testing (Kaplan 1995).

    —Code Complete 2, p. 472

  4. Michael JankeJuly 11, 2008 @ 03:52 PM

    At what point in the test cycle do you perform some kind of analysis or test of application and database performance (unit performance) and scalability (load test)?

    As the person who hosts the apps, I regularly see applications written and deployed without the slightest idea how the application will use CPU, Memory and I/O, or how the application will perform under load.

    —Mike

  5. DrMarkJuly 11, 2008 @ 04:05 PM

    Kudos!

    I completely agree with you. Unit tests by themselves will not help you make a perfect app nor will Selenium or any of the other testing tools or methods. Only by combining multiple different approaches (including user testing) can you hope to produce a quality app.

    I am glad to see someone admit that Unit tests are important but by no means the end of your testing obligation. Hopefully your Venn diagram will stick in people’s heads :)

  6. TimJuly 11, 2008 @ 06:09 PM

    Excellent article! Thanks very much!

  7. AnonymousJuly 11, 2008 @ 06:29 PM

    FYI, The “Code Complete 2” link is broken. There’s an extraneous : at the beginning.

    Great article, BTW.

  8. AnonymousJuly 11, 2008 @ 06:32 PM

    huh, seems Mephisto mangled my comment. I was trying to say that the link to Code Complete 2 is broken.

  9. Luke FranclJuly 11, 2008 @ 06:34 PM

    Thanks, I fixed the link (and your first comment). Textile was playing games.

  10. akshayJuly 11, 2008 @ 08:16 PM

    Great article. I am a full-time integration tester and used rails for my graduate project. And I was quick to point this out to my advisor, that coming from the testing background, I know that rails fails miserably in testing. (can you believe rails has a lot of bugs in the testing side that no one even cares for?!?). Just curious though… have you tried django? what is your take on it… how do THEY do their testing? Some friends are urging me to try out django.

  11. vicayaJuly 11, 2008 @ 11:09 PM

    Sensational title is overrated.

    It’s a good article arguing for more complete bug finding approaches, but the title is just sensationalism.

  12. ShoJuly 12, 2008 @ 02:21 AM

    Hm, I agree with above commentor re. sensationalist headline. Testing is hardly “overrated”, your article simply makes recommendations for a balanced approached, which I agree with.

    I was kind of hoping to see some spearing of sacred cows – I have seen way too much testing put into very implementation-specific operation of very unlikely-to-ever-break methods, etc – some developers , having heard that “testing is good!”, seem to think the more time they then put into writing highly specific variations of any number of “clean” scenarios the better. To me, they’re just wasting time.

    I would like to see an article ruthlessly arguing that testing should generally be higher-level “sanity testing” rather than checking again and again whether Ruby’s plus operator still works. For me, testing is not so much about finding errors, although that is obviously useful – the main value is forcing the developer to think more about what they write at the time of writing, and in providing a kind of insurance against future breakages of long-forgotten functionality.

    Finding every single bug and functional oversight in your program is an impossible task with exponentially increasing effort and proportionally decreasing returns. 5 dirty tests for every clean one? That is an awful lot of overhead and kind of flies in the face of any rapid development advantages you might hope to get from using a dynamic language in the first place. If you’re really writing 20+ times as many test LOC as actual code, why are you using Ruby at all? You’re paying a high price in terms of execution speed for the privilege of a highly dynamic and flexible language – and then you build an insurmountable wall of tests around every little thing. Might as well just use Java.

    If you really need to catch a lot of bad data or input in your app – maybe that’s an upstream problem, catch it there.

    Don’t get me wrong, testing is good and absolutely necessary, but just test what you need, what you use, and what you think is reasonably likely to actually happen. Test for expected behaviour, sanity of results, and as a way of mapping out what you write before you write it. Exception notifications should handle the edge cases and you can add that as necessary.

    The users would rather have an error every now and again on a site which innovates and progresses quickly rather than never any error on a site which sits still for months at a time because the developers can’t do a thing without rewriting hundreds upon hundreds of mundane low-level tests.

    Now that might be controversial : )

  13. Nick KallenJuly 12, 2008 @ 11:36 AM

    Sorry to be overly critical but I found this article really dumb. Unit testing proponents claims to that unit tests help reduce defect rates, support rapid development and support refactoring. No one claims they are a replacement for usability testing, qa, or good product design. To claim you’re debunking some canard is absolutely ridiculous. Quotes like these: “But perfectly executed code coverage doesn’t mean users will use your program,” are so stupid.

    Some smaller criticisms I have:

    1. Your quote regarding testing vs. code reviews is obviously comparing QA testing to code review not unit testing to reviews. Either you’ve misread the quote or taken it out of context.

    I have not read code complete, but these theses are, in my experience so wrongheaded that that book should not be treated with any authority:

    2. “Numerous studies have found that test cases are as likely to have errors as the code they’re testing”. Tests are code. Code has bugs. Tests have bugs. QED. This is insight? The point of testing is to have redundancy; it’s less likely that two things are wrong-the code and the test-in just the right way so that tests are green. It’s not impossible, just sufficiently unlikely as to reduce your defect rate noticeably. If you follow good TDD practice: red-green-red-green, the problem of incorrect tests is even less likely.

    3. “Immature testing organizations tend to have about five clean tests for every dirty test”—this obviously does not reflect good OO design and testing techniques. TDD encourages/forces you to write easily testable code. Testing edge cases in well written code are “clean”. Or you refactor your code till the tests are clean. The most unmaintainable codebases are those littered with a suite of highly integrated edge-case tests (for defects detected in production). Good engineers refactor the code such that edge cases are reduced and easy to test at the unit level.

    Sorry to be a dick but I just thought this article was rhetorically horrible and misleading.

  14. Tim CaseJuly 12, 2008 @ 04:02 PM

    This would have been an awesome article if you had changed the snarky title to something like, “Don’t conflate TDD with QA.” One of the longstanding cardinal rules of software is that the developers shouldn’t be the ones who test the code, that role needs to be performed by outside testers, TDD doesn’t change this. What??? Then what’s the point of TDD? It’s to focus the developer on writing code in small discrete units, doing only what needs to get done, and providing a check that the code does what the developer intended. 100% defect free code is only a partial objective in that it’s one tool in a much bigger process that includes outside QA.

    “If you’re the head of QA and you hear that your developers are working test-first,” says Ward Cunningham , an independent consultant pioneering the approach, “you should think, ‘Good for them—now we can focus on the truly diabolical tests instead of worrying about these dead-on-arrival problems,’ “

  15. rowanJuly 14, 2008 @ 01:10 AM

    Nick—

    I wouldn’t recommend knocking ‘Code Complete’ without reading it.

    It’s on the desk of every serious programmer.

    —rowan

  16. mondeJuly 14, 2008 @ 07:43 AM

    Great article. It made me realize a false sense of security I get out of testing.

    About 75% of the time I’m able to practice proper BDD while coding by using autotest from ZenTest http://rubyforge.org/projects/zentest . I run autotest in a terminal next to gvim. I have gvim split into two vertical panes, the upper for application code, the lower for the corresponding rspec behavior testing code. While I’m coding I get immediate feedback if my code is working from autotest. And by really thinking about the application code first by writing its tests first, I have a higher probability that my code is correct.

    But the system I’ve described is really just enabling me to increase the rate at which I’m writing accurate application code. Its not necessarily guaranteeing that my solution is correct. Or that my that my solution will be accurate in a production environment where ultimately many more unknowns are introduced.

    When I said false sense of security at the beginning that too is somewhat of a false assumption. I do have some security in my experience as a programmer. Andy Hunt explains an insight in the podcast “Andy Hunt on Pragmatic Wetware” http://www.pragprog.com/podcasts that the less experienced we are in a problem domain the higher the probability that we will select a complicated solution. The corollary being the more experienced we are in a problem domain the more likely it is that we will choose the most simple solution.

    It seems that simplicity is where magnitudes of ‘code complete’-ness are produced.

  17. Jon DahlJuly 14, 2008 @ 09:22 AM

    Nick: what percentage of software projects do you think employ at least 2 of the 4 kinds of testing discussed in this article? I’m going to guess less than 20%. If that number were 80%, you’d be right, but as it stands, I think a lot of developers and organizations need a better testing strategy. And in the Rails world, unit testing gets far more attention than other types of testing – there is even an attitude that QA testing isn’t important, because users will pick up on problems and fixes can be deployed quickly.

  18. Luke FranclJuly 14, 2008 @ 12:27 PM

    monde—very interesting.

    It reminds me of some other stuff I’ve been reading about this, namely that errors tend to cluster. Typically, 80% of errors are in 20% of the code. And of course, this is the complicated code.

    There’s two kinds of complicated code: complicated solutions to complicated problems, and complicated solutions to simple problems.

    It seems like testing and QA effort should be focused on that 20%, and that we should always be on the lookout to simplify, simplify, simplify where ever possible.

  19. Phil KirkhamJuly 15, 2008 @ 03:17 AM

    I got suckered into visiting after seeing the sensationalist headline so I suppose you could say that it worked

    So how are these shops that only do dev testing because they cant afford to hire a proper QA person ( false economy ) going to do all the testing ?

    Good luck with your talk though, anything that helps educate devs on testing is a good thing

  20. Scott MeadeJuly 16, 2008 @ 02:11 PM

    Built-in Code Reviews are the best reason to do pair-programming. Catching potentially problematic code via a second brain and set of eyes during code designed and authoring provides the earliest possible defect detection. And, as McConnell points out, the earlier you spot problem code, the cheaper to fix and the better a project stays on schedule.

    Nick – look again at the table Luke posted. McConnell compares all types of testing, including Unit testing to code reviews. A 100% unit test pass rate is little indicator of software quality. Unit, functional, and integration tests often miss out on assessing: correctness (was the right thing even built?), usability, efficiency, and code readability. All are attributes of quality software that we as developers strive for, yet cannot say their test suites cover.

  21. Jon DahlJuly 16, 2008 @ 03:05 PM

    Phil: come on, the title isn’t that bad. :) I think your question about shops who don’t think they can afford formal QA is a good one, and the answer is that most software projects can’t afford not to have QA. You can get good QA help in the neighborhood of $20/hour, and could probably find someone for $12/hour if you tried. A good tester is going to find more bugs than a mediocre tester, but even a mediocre tester is better than nothing. On the other hand, a good developer is going to cost $40-$60/hour as an employee, and double that as a contractor. So if you plan on 1 hour of QA for every 4 hours of programming (Joel recommends more QA than that), your QA budget is 5%-12% of your programmer budget. Throw in design, project management, and other project costs, and the total QA burden is even lower. To my mind, it’s a no-brainer: test an application, or save 5% to ship it untested. And that doesn’t take into account the cost savings of not having to ship patches or deploy fixes.

    The funny thing is that on larger projects, QA is often the main (or only) type of testing that is done, to the neglect of unit/usability/peer testing. Not so in the bootstrapped startup world.

  22. R. Elliott MasonJuly 16, 2008 @ 10:20 PM

    You know when I first started learning about testing, some of these points went through my head. Particularly the idea that tests are just as prone to bugs as the code being tested. This actually appeared in practice when I was having trouble with a way earlier version of DataMapper that wasn’t performing correctly. I ran the tests and the tests all passed, which had me scratching my head for a good while. Until I realized that someone had used an = instead of ==. So I found myself trudging through both the source and the tests to ultimately return to the source and try to fix a bug.

    Don’t remember if I ever fixed it or not.

  23. YarrowJuly 17, 2008 @ 11:09 AM

    “To cap it all off, developer testing isn’t all that effective at finding defects.”

    ...followed by Table 20-2, which is drawn from papers published in 1986, 1996, and 2002.

    It seems unlikely that the unit testing given in the table is TTD-style testing, partly from the dates and partly because measuring the defect detection rate of TTD unit tests is problematic: do we count them as detecting an “error” in every few lines of code (because every test fails before it succeeds), or do we count them as detecting no errors except those that are injected when working on another piece of code? In the latter case, low detection rates are good if they indicate low rates of error injection.

    In other words, what Tim Case said: “Don’t conflate TDD with QA.”

  24. Luke FranclJuly 17, 2008 @ 04:09 PM

    Yarrow, that’s a valid complaint. More data is needed.

    I would count them as detecting errors for every problem the programmer finds while programming, because those would be errors if they were checked in.

    FWIW, “Don’t conflate TDD [or tests] with QA” is kind of the point I’m trying to make.

  25. Nathan ZookAugust 04, 2008 @ 02:09 PM

    Good QA @ $20/hr? Guess again. QA is engineering like any other. What’s more, developers can do whatever they want, it is the QA rep who is in the risk meeting. QA guards a company’s reputation.

    -=

    What’s funny about this article is that it fails to mention the obvious problem with unit tests: they make the code brittle by raising the sunken costs of refactoring.

  26. Jon DahlAugust 05, 2008 @ 11:26 AM

    Nathan: I’ve worked with a few top-notch QA folks, and they (a) cost more than $20/hour, and (b) are worth their weight in gold. But still, my experience is that $20/hour QA can still be quite effective on a budget.

    As for your other point – that unit tests make refactoring harder – I couldn’t disagree more. Well-written unit tests make refactoring far, far easier. And if unit test need to be changed every time code is refactored, then the unit tests are poorly written.