Dealing with text from the wild is often a painful experience. Even though ruby has great libraries like Nokogiri which can help you parse XML and HTML and avoid losing your mind, it can’t save you from some subtle string encoding issues that occasionally crop up.
On a current project, I’ve been scraping a bunch of data from a variety of websites that offer similar products to each other. Each product has a code that looks something like
AAA 0213; i.e. three or four letters, a space, and then four numbers. This is easy to pull out with a ruby regular expression - just use
string.scan(/([A-Z]+\s[0-9]+)/), stick the result in the database, and go.
Eventually, in the applications' search component, we ran into problems - although I could clearly see record
KAA 1011 in the product listing and in the database, it wouldn’t turn up in search.
To debug the problem, I opened up a REPL with pry and confirmed the bug. Even though the
product.code string appeared indistinguishable from the string that I created by typing out ‘K - A - A - spacebar - 1 - 0 - 1 - 1’, the two were not equal. Ruby’s .ord method, which returns the numerical value of a single-character string, helped diagnose the problem.
.ord, I confirmed that
product.code.ord was 160, while
"KAA 1011".ord was 32. A quick search confirmed that the Unicode character with integer value 160 is in fact the “no-break space”, explaining how they were indistinguishable.
The fix from that point was trivial - in the pre-processing filters, replace all non-breaking spaces with regular, breaking spaces - but this issue would have been far harder to debug without the Ruby String