validates_length_of byte counting gotcha

Posted by Luke Francl
on Sunday, July 19

Watch out for validates_length_of if you need to make sure a string is a certain number of bytes long. For example, SMS messages can be no longer than 160 bytes in length. I recently got bit by this because some unicode “curly” quotes slipped into a reply message, but they weren’t detected by the validation.

Here’s the problem.

Consider this string:

str = "€"

It is 3 bytes long:

str.size => 3

However, ActiveRecord’s validates_length_of records this as only one character, because it uses str.split(//).size to measure the size.

If you NEED to be certain that a string is less than a certain number of bytes, you’ll need to override the default behavior of validate_length_of.

Fortunately, you can supply your own tokenizer, which makes this easy. The tokenizer is called, and size is called on its return value to find out how many tokens there are. Since String responds to size which returns the number of bytes, you can simply return the attribute value itself as the tokenizer, like this:

validates_length_of :message, :maximum => 160, :tokenizer => lambda { |str| str }

Will this still work in Ruby 1.9? I’m not sure. I now have a test case which will warn me if it doesn’t…

Comments

Leave a response

  1. kbJuly 20, 2009 @ 02:50 PM

    This problem should not exist in ruby 1.9 because strings are no longer a stream of bytes

  2. Luke FranclJuly 20, 2009 @ 09:43 PM

    Actually, that will probably make it worse, because SMS messages ARE streams of bytes. I need to know if a given Unicode string is longer than 160 bytes. Hopefully there will be an easy way to figure that out.