Watch out for validates_length_of if you need to make sure a string is a certain number of bytes long. For example, SMS messages can be no longer than 160 bytes in length. I recently got bit by this because some unicode “curly” quotes slipped into a reply message, but they weren’t detected by the validation.
Here’s the problem.
Consider this string:
str = "€"
It is 3 bytes long:
str.size => 3
However, ActiveRecord’s validates_length_of records this as only one character, because it uses str.split(//).size to measure the size.
If you NEED to be certain that a string is less than a certain number of bytes, you’ll need to override the default behavior of validate_length_of.
Fortunately, you can supply your own tokenizer, which makes this easy. The tokenizer is called, and size is called on its return value to find out how many tokens there are. Since String responds to size which returns the number of bytes, you can simply return the attribute value itself as the tokenizer, like this:
validates_length_of :message, :maximum => 160, :tokenizer => lambda { |str| str }
Will this still work in Ruby 1.9? I’m not sure. I now have a test case which will warn me if it doesn’t…


This problem should not exist in ruby 1.9 because strings are no longer a stream of bytes
Actually, that will probably make it worse, because SMS messages ARE streams of bytes. I need to know if a given Unicode string is longer than 160 bytes. Hopefully there will be an easy way to figure that out.