Another in our ongoing series on "Parsing Real-World HTML". It's wrong, of course. But Firefox will accept as HTML escapes
& > < as well as the correct forms & > < To be "compatible", a Python screen scraper at has a function "htmldecode", which is supposed to recognize HTML escapes and generate Unicode. (Why isn't this a standard Python library function? Its inverse is available.) This uses the regular expression charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE) to recognize HTML escapes. Note the ";?", which makes the closing ";" optional. This seems fine until we hit something valid but unusual like� for which "htmldecode" tries to convert "1234567" into a Unicode character with that decimal number, and gets a Unicode overflow. For our own purposes, I rewrote "htmldecode" to require a sequence ending in ";", which means some bogus HTML escapes won't be recognized, but correct HTML will be processed correctly. What's general opinion of this behavior? Too strict, or OK? John Nagle SiteTruth --