Functional programming and looping

Posted by Jon
on Tuesday, July 29

If you’re a programmer, you’ve probably worked through one or more books teaching you the syntax of a new language. I’ve had this experience with half a dozen languages, like C, Javascript, and Perl. These books are typically introduce loops midway through the syntax discussion, after datatypes and control flow, but before I/O and advanced features.

Loops are almost always presented according to this formula.

  • Inane intro text: “what if you want to do an operation more than once”?
  • Introduce while loop, with difference between do while and while do.
  • Introduce for loop, the while loop’s crazy cousin.
  • (Bonus) Introduce foreach loop if language is sufficiently high-level. And that’s it – you know how to loop through code; time to move on.

Not so fast. If you’re lucky enough to use a language that draws from functional programming, you shouldn’t loop like this.

The point

From now on, I’m going to use Ruby for examples, but this article isn’t about Ruby. It is about transitioning from primitive loops to iterating through collections, and from generic collection functions (like each) to more specific functions (like map).

From loops to array traversal

For the last several months, I’ve been working on Tumblon, a medium-sized Rails application. I’ve worked on 15-20 Ruby applications over the last three years, probably totaling 50,000 lines of Ruby code.

I’ve only used a primitive loop once.

That primitive loop was a loop {} loop, forever polling a task list looking for jobs. In other words, a loop with no exit condition beyond ^C or a server crash. As far as I know, Ruby doesn’t have a for loop at all, which would explain why I haven’t used it. It has a foreach loop (for item in arr), but that’s syntactic sugar for arr.each {}.

So the first reason why I’ve only used a simple loop in one case: the each concept usually a better option. Its Ruby implementation will be familiar to anyone who’s seen Ruby code before:

1
2
3
["horse", "pig", "cow"].each do |animal|
  puts "Old MacDonald has a #{animal}"
end

(Yes, I have a small child.)

This is far cleaner than its for or while loop alternatives. And it is a better abstract representation of what we’re doing: we aren’t looping with an exit condition, we are iterating through an array. But what if you want to do something a fixed number of times? Even that can be understood as traversing a list, like [1,2,3,4,5,6,7,8,9,10].each {}. Of course, Ruby provides a cleaner version: 10.times {}.

So if your loop is working through a list of some sort, each is a better abstraction of the problem. And in my experience building Ruby applications, every loop but one has been traversing a list. Parsing XML? Traversing a collection. Summing numbers? Traversing a collection. Reading in a textfile? Listening to STDIN? Working with rows in a database? Traversing a collection. That’s what each loops do well.

Beyond arr.each

But each isn’t the final word. It is a step up from a primitive for or while loop when working with a collection of values, but many each loops should be replaced with other array methods, like map, inject, and select.

When is each useful? Simple: when you want to create side-effects, like saving to the database, printing a result, or sending a web service call. In these cases, you’re not concerned with the return value; you want to change state on the screen, the disk, the database, or something else. Take a look at this code.

1
2
3
User.find(:all).each do |user|
  Notification.deliver_email_newsletter(user)
end

You don’t need a return value from this – you need emails to be delivered.

But don’t use each if you want to extract some new value from an array. That’s not what it’s for. Instead, take a look at three other powerful functions: map, inject, or select. To see why, let’s take a look at select. Here is code that takes in an array, and creates a new array from elements that match a certain condition, using each.

1
2
3
4
5
active_users = []
users.each do |user|
  active_users << user if user.active?
end
active_users

Man, the first and last lines are ugly. Why do you have to initialize and return active_users? Answer: because this is a misuse of each. You are much better off using select (or its equivalent, find_all):

1
2
3
users.select do |user|
  user.active?
end

Using select is shorter, easier to understand, and less bug-prone. And more importantly, it clearly encapsulates one common use of each (and looping in general).

Two other key functions – map and inject (or reduce) – complement select and follow a similar pattern. And not surprisingly, they form the foundation of the mapreduce approach to distributed processing. I’ve written more about map and reduce in another article, and here is shorthand for knowing which of these functions to use:

Desired Return Value Function
New array with same number of values map
New array composed of part of the old array select
Single value (though this value can be an array) inject
none each

The point, redux

Use each for changing state. Otherwise, avoid side-effects and use “functional” array methods that return a value. Simple. Your code will be cleaner and less bug prone.

And remember the dead giveaway:

  1. Initialize an empty value, or array, or whatever (new_arr = [])
  2. arr.each, changing the initialized value
  3. Return the value (return new_arr)

Whenever you see this pattern, you know you’ve got an each loop that needs swapping out.

(Edit: I’ve posted a follow-up article with more about map and reduce.)

Why programmers should play Go

Posted by Jon
on Monday, July 14

Go is an ancient strategy game with simple rules and a profound degree of complexity.

Software development is the art of managing complexity using a limited number of rules, structures, and patterns.

Programmers should play Go.

Go in 28 words or less.

The beauty of Go is its combination of simplicity and complexity. On the one hand, go has only a handful of rules. Place stones, don’t get completely surrounded, control territory. Like chess, the mechanics can be picked up in a few minutes, though Go only has a single type of “move”, and only one edge case (the ko rule). And like chess, one can spend a lifetime discovering the strategic and tactical layers of the game.

While chess is quite complex and rich, such that it took a 30-node supercomputer to defeat the reining chess champion, no computer comes close to defeating even a skilled amateur Go player. There are 361 positions on a Go board, and with two players, there are 2.08168199382×10170 valid positions. That’s quite a bit bigger than a googol (yes, that is the correct spelling). Realistically, there are something on the order of 10400 possible ways that a typical game could play out. And the number of possible moves roughly follows 361!, which means that only 20 moves in, there are many googols of possible ways that the game could shake down. (As a fun exercise, try plugging 361! into an online factorial calculator.)

Managing complexity

So how does one play Go, given this near-infinite complexity? On a tactical level, a player approaches Go like chess, thinking several moves ahead. But this only works in small spaces, like a tight battle in a small sector of the board. Beyond there, there are just too many possibilities. So on a strategic level, a player must think in shapes or patterns. These shapes provide shorthand ways of managing the complexity of Go. As a non-master, I may have no idea how things will proceed in one sector of the board, but I may be able to recognize strong and weak patterns of stones, vulnerable shapes and effective formations.

But there’s more: Go has several sorts of patterns. Beyond shapes, there are Go proverbs. These can be general: “Your opponent’s good move is your good move”; specific: “Don’t try to cut the one-point jump”; funny: “Even a moron connects against a peep”; and meta: “Don’t follow proverbs blindly.” These proverbs are principles which help a player make good decisions. They are less specific than shapes, and so they provide guidelines for whatever situations may arise on the Go board. Proverbs often conflict, and a player must determine when and how to apply them.

Finally, there are joseki. Joseki are patterns of play that are considered even for both sides. They typically happen in the corners of the board, and typically at the beginning of the game. Interestingly, there is a Go proverb that says “Learning joeski costs two stones,” meaning that memorizing these patterns isn’t helpful. Instead, a player should learn from joseki by understanding what is going on in each move.

Patterns in Go, patterns in software design

Each of these Go patterns has a rough programming analogue.

Shapes in Go aren’t unlike software design patterns. While there is nothing preventing you from placing logic in your views, this shape is recognized to be a weak one. Think of Gang-of-Four design patterns: the MVC, Adapter, and Factory patterns are recognized to be helpful in some circumstances (and not appropriate in others). On a lower level, iteration and recursion have commonly recognized shapes, as do database normalization vs. denormalization. Even if you can’t hold an entire program or algorithm in your head at once, recognizing common shapes helps you to understand what is going on.

Go proverbs are like another type of pattern in software: CapitalizedPrinciples (for lack of a better term) made popular by Extreme Programming. Think DontRepeatYourself, YouArentGonnaNeedIt, CollectiveCodeOwnership, DailyBuild, TestFirst. These aren’t specific code “shapes”, like a singleton class – they are general principles that guide the practice of programming.

Because joseki is about exchange between competing parties, its programming parallel is a little less clear. The closest comparison, in my mind, is programming exercises. This article, for instance, suggests 9 exercises to help you become a better OO programmer, like:

  • Use only one dot per line
  • Use only one level of indentation per method
  • Don’t use setters, getters, or properties

In a real-world program, you’re unlikely to stick to these principles 100% of the time. But forcing yourself to write code in this way can be an eye-opening experience and can make you a better developer.

So what can Go really do for you?

Obviously, these parallels are structural. Specific Go proverbs (“Your opponent’s good move is your good move”) may not have direct relevance to software development. So can Go really make you a better developer?

I think it can, and I’ll go one further. I think Go can make you smarter. There is a lot of anecdotal evidence to this effect [1] [2] [3], for example [4]:

In fact, all of our minds can benefit from playing Go, which officially has the capacity to make you smarter. Research has shown that that children who play Go have the potential for greater intelligence, since it motivates both the right and left sides of the brain.

The research mentioned isn’t footnoted, unfortunately, so take statements like this with a grain of salt.

But it makes sense: like chess, Go requires pattern recognition, a mix of strategic and tactical thinking, and comprehension of complex structures, though in Go the patterns are larger and the complexity is greater. A mind trained to think in these ways is going to have an easier time attacking similar problems in other spheres.

Like software development.

Image by andres_colmen: http://flickr.com/photos/andres-colmen/2539473895/

Testing is overrated

Posted by Luke Francl
on Friday, July 11

Next week at RubyFringe, I’ll be taking on one of the programming world’s favorite topics: testing.

Hear me out. Like everyone who’s had their bacon saved by a unit test, I think testing is great. In a dynamic language like Ruby, tests are especially important to give us the confidence our code works. And once written, unit tests provide a regression framework that helps catch future errors.

However, testing is over-emphasized. If our goal is high-quality software, developer testing is not enough.

This is important because of what Steve McConnell calls The General Principle of Software Quality. Most development time is spent debugging. “Therefore, the most obvious method of shortening a development schedule is to improve the quality of the product.” (Code Complete 2, p. 474.)

Problems with developer testing

Developer testing has some limitations. Here are a few that I’ve noticed.

Testing is hard...and most developers aren’t very good at it!

Programmers tend write “clean” tests that verify the code works, not “dirty” tests that test error conditions. Steve McConnell reports, “Immature testing organizations tend to have about five clean tests for every dirty test. Mature testing organizations tend to have five dirty tests for every clean test. This ratio is not reversed by reducing the clean tests; it’s done by creating 25 times as many dirty tests.” (Code Complete 2, p. 504)

You can’t test code that isn’t there

Robert L. Glass discusses this several times in his book Facts and Fallacies of Software Engineering. Missing requirements are the hardest errors to correct, because often times only the customer can detect them. Unit tests with total code coverage (and even code inspections) can easily fail to detect missing code. Therefore, these errors can slip into production (or your iteration release).

Tests alone won’t solve this problem, but I have found that writing tests is often a good way to suss out missing requirements.

Tests are just as likely to contain bugs

Numerous studies have found that test cases are as likely to have errors as the code they’re testing (see Code Complete 2, p. 522).

So who tests the tests? Only review of the tests can find deficiencies in the tests themselves.

Developer testing isn’t very effective at finding defects

To cap it all off, developer testing isn’t all that effective at finding defects.

Defect-Detection Rates of Selected Techniques (Code Complete 2, p. 470)
Removal Step Lowest Rate Modal Rate Highest Rate
Informal design reviews 25% 35% 40%
Formal design inspections 45% 55% 65%
Informal code reviews 20% 25% 35%
Modeling or prototyping 35% 65% 80%
Formal code inspections 45% 60% 70%
Unit test 15% 30% 50%
System test 25% 40% 55%

Don’t put all your eggs in one basket

The most interesting thing about these defect detection techniques is that they tend to find different errors. Unit testing finds certain errors; manual testing others; usability testing and code reviews still others.

Manual testing

As mentioned above, programmers tend to test the “clean” path through their code. A human tester can quickly make mincemeat of the developer’s fairy world.

Good QA testers are worth their weight in gold. I once worked with a guy who was incredibly skilled at finding the most obscure bugs. He could describe exactly how to replicate the problem, and he would dig into the log files for a better error report, and to get an indication of the location of the defect.

Joel Spolsky wrote a great article on the Top Five (Wrong) Reasons You Don’t Have Testers—and why you shouldn’t put developers on this task. We’re just not that good at it.

Code reviews

Code reviews and formal code inspections are incredibly effective at finding defects (studies show they are more effective at finding defects than developer testing, and cheaper too), and the peer pressure of knowing your code will be scrutinized helps ensure higher quality right off the bat.

I still remember my first code review. I was doing the ArsDigita Boot Camp which was a 2-week course on building web applications. At the end of the first week, we had to walk through our code in front of the group and face questions from the instructor. It was incredibly nerve-wracking! But I worked hard to make the code as good as I could.

This stresses the importance of what Robert L. Glass calls the “sociological aspects” of peer review. Reviewing code is a delicate activity. Remember to review the code…not the author.

Usability tests

Another huge problem with developer tests is that they won’t tell you if your software sucks. You can have 1500% test coverage and no known defects and your software can still be an unusable mess.

Jeff Atwood calls this the ultimate unit test failure:

I often get frustrated with the depth of our obsession over things like code coverage. Unit testing and code coverage are good things. But perfectly executed code coverage doesn’t mean users will use your program. Or that it’s even worth using in the first place. When users can’t figure out how to use your app, when users pass over your app in favor of something easier or simpler to use, that’s the ultimate unit test failure. That’s the problem you should be trying to solve.

Fortunately, usability tests are easy and cheap to run. Don’t Make Me Think is your Bible here (the chapters about usability testing are available online). For Tumblon, we’ve been conducting usability tests with screen recording software that costs $20. The problems we’ve found with usability tests have been amazing. It punctures your ego, while at the same time giving you the motivation to fix the problems.

Why testing works

Unit testing forces us to think about our code. Michael Feathers gets at this in his post The Flawed Theory Behind Unit Testing:

One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense….It’s a nice theory, but it’s wrong….

In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.

All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.

That’s the magic, and it’s why unit testing works also. When you write unit tests, TDD-style or after your development, you scrutinize, you think, and often you prevent problems without even encountering a test failure.

So: adapt practices that make you think about your code; and supplement them with other defect detection techniques.

Testing testing testing

Why do we developers read, hear, and write so much about (developer) testing?

I think it’s because it’s something that we can control. Most programmers can’t hire a QA person or conduct even a $50 usability test. And perhaps most places don’t have a culture of code reviews. But they can write tests. Unit tests! Specs! Mocks! Stubs! Integration tests! Fuzz tests!

But the truth is, no single technique is effective at detecting all defects. We need manual testing, peer reviews, usability testing and developer testing (and that’s just the start) if we want to produce high-quality software.

Resources

MapReduce at RailsConf Europe

Posted by Jon
on Thursday, July 03

This September, I’ll be presenting at RailsConf Europe on EC2, MapReduce, and Distributed Processing. The talk will explain the MapReduce approach to distributed processing, will show a few example implementations, and will discuss MapReduce vs. other distributed processing techniques.

Whether you’ll be there or not, if you’re interested in learning more about MapReduce, here are some resources. I’ll write a few more posts on the subject before the conference, so watch this space as well.

Cluster Computing and MapReduce is a great series of video lectures given to Google interns in 2007. The first two are the most appropriate: the first introduces distributed processing concept, while the second covers MapReduce itself.

MapReduce: Simplified Data Processing on Large Clusters is the paper by Jeffrey Dean and Sanjay Ghemawat of Google that got things going in the first place.

MapReduce for Ruby: Ridiculously Easy Distributed Programming discusses MapReduce and introduces Starfish, a Ruby library for distributed processing. Starfish is not a MapReduce implementation, however – it takes a somewhat different approach to distributed processing.

Skynet (a few writeups: InfoQ, Dion Almaer) is another Ruby-based distributed processing system inspired by MapReduce.

Writing Ruby Map-Reduce programs for Hadoop discusses using Ruby to wrap Hadoop, a MapReduce-like system built in Java.

Introduction to Parallel Programming and MapReduce at Google Code University, a good overview of distributed processing and the MapReduce approach.

And finally, one article that you should avoid:

MapReduce: A major step backwards compares MapReduce to relational databases, and says that MapReduces loses out because it doesn’t support database indices, database views, Crystal reports, etc. Basically, the complaint is that MapReduce isn’t SQL compliant. WTF? Clearly, the author(s) didn’t understand what MapReduce is. The problem, as explained elsewhere, is that the authors thought that MapReduce == CouchDB/SimpleDB. Which is obviously not true. %s/MapReduce/SimpleDB the original article and it makes some sense. But long story short, this article will teach you nothing about MapReduce, and will likely confuse you further. So stay away.