On Jul 23, 11:53 am, Paul McGuire <pt...@austin.rr.com> wrote:
> On Jul 22, 5:43 pm, Filip <pink...@gmail.com> wrote:

>
> # Needs re.IGNORECASE, and can have tag attributes, such as <BR
> CLEAR="ALL">
> line_break_re = re.compile('<br\/?>', re.UNICODE)

Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for <br />

> # what about HTML entities defined using hex syntax, such as &#xxxx;
> amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

What about the decimal syntax ones? E.g. not only &nbsp; and &#xa0;
but also &#160;

Also, entity names can contain digits e.g. &sup1; &frac34;
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to