Re: converting html escape sequences to unicode characters

Craig Ringer Fri, 10 Dec 2004 00:10:14 -0800

On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8.  Stuff like:


I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

>>> escapeseq = '&#48708;'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape
u'\ube44'
>>> print uescape
비
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

>>> entities = ['&#48708;', '&#54665;', '&#44592;', '&#47196;',
'&#48372;', '&#45244;', '&#44144;', '&#50640;', '&#50836;', '&#45236;',
'&#47732;', '&#44552;', '&#51060;', '&#50620;', '&#47560;', '&#51648;',
'&#51104;']
>>> def unescape(escapeseq):
...     return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
...
>>> print ' '.join([ unescape(x) for x in entities ])
비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠

--
Craig Ringer

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: converting html escape sequences to unicode characters

Reply via email to