Re: HTML Encoded Translation

Fredrik Lundh Tue, 17 Oct 2006 11:28:02 -0700

Dave wrote:

> How can I translate this:
> 
> &#103;&#105;
> 
> to this:
> 
> "gi"


the easiest way is to run it through an HTML or XML parser (depending on 
what the source is).  or you could use something like this:

     import re

     def fix_charrefs(text):
         def fixup(m):
             text = m.group(0)
             try:
                 if text[:3] == "&#x":
                     return unichr(int(text[3:-1], 16))
                 else:
                     return unichr(int(text[2:-1]))
             except ValueError:
                 pass
             return text # leave as is
         return re.sub("&#?\w+;", fixup, text)

     >>> fix_charrefs("&#103;&#105;")
     'gi'

also see:

     http://effbot.org/zone/re-sub.htm#strip-html

> I've tried urllib.unencode and it doesn't work.

those are HTML/XML character references, not encoded URL characters.

</F>

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTML Encoded Translation

Reply via email to