Got it, great. This worked like a charm. I knew I was barking up the wrong tree with urllib, but I didn't know which tree to bark up...
Thanks! Fredrik Lundh wrote: > Dave wrote: > > > How can I translate this: > > > > gi > > > > to this: > > > > "gi" > > the easiest way is to run it through an HTML or XML parser (depending on > what the source is). or you could use something like this: > > import re > > def fix_charrefs(text): > def fixup(m): > text = m.group(0) > try: > if text[:3] == "&#x": > return unichr(int(text[3:-1], 16)) > else: > return unichr(int(text[2:-1])) > except ValueError: > pass > return text # leave as is > return re.sub("&#?\w+;", fixup, text) > > >>> fix_charrefs("gi") > 'gi' > > also see: > > http://effbot.org/zone/re-sub.htm#strip-html > > > I've tried urllib.unencode and it doesn't work. > > those are HTML/XML character references, not encoded URL characters. > > </F> -- http://mail.python.org/mailman/listinfo/python-list