Another in our ongoing series on "Parsing Real-World HTML".

   It's wrong, of course.  But Firefox will accept as HTML escapes

        &amp
        &gt
        &lt

as well as the correct forms

        &
        >
        <

To be "compatible", a Python screen scraper at

http://zesty.ca/python/scrape.py

has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode.  (Why isn't this a standard
Python library function?  Its inverse is available.)

This uses the regular expression

charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

to recognize HTML escapes.

Note the ";?", which makes the closing ";" optional.

This seems fine until we hit something valid but unusual like

        http://www.example.com?foo=1&#1234567

for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.

For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior?  Too strict, or OK?

                                John Nagle
                                SiteTruth
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to