-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 [EMAIL PROTECTED] wrote: > Some web feeds use decimal character entities that seem to confuse > Python (or me). For example, the string "doesn't" may be coded as > "doesn’t" which should produce a right leaning apostrophe. > Python hates decimal entities beyond 128 so it chokes unless you do > something like string.encode('utf-8'). Even then, what should have > been a right-leaning apostrophe ends up as "’". The following script > does just that. Look for the string "The Canuck iPhone: Apple doesnâ > €™t care" after running it. > > # coding: UTF-8 > import feedparser > > s = '' > d = feedparser.parse('http://feeds.feedburner.com/Mathewingramcom/ > work') > title = d.feed.title > link = d.feed.link > for i in range(0,4): > title = d.entries[i].title > link = d.entries[i].link > s += title +'\n' + link + '\n' > > f = open('c:/x/test.txt', 'w') > f.write(s.encode('utf-8')) > f.close() > > This useless script is adapted from a "useful" script. Its only > purpose is to ask the Python community how I can deal with decimal > entities > 128. Thanks in advance, Bill > > > -- > http://mail.python.org/mailman/listinfo/python-list >
This is a two-fold issue: encodings/charsets and entities. Encodings are a way to _encode_ charsets to a sequence of octets. Entities are a way to avoid a (harder) encoding/decoding process at the expense of readability: when you type #8217; no one actually see the intended character, but those are easily encoded in ascii. When dealing with multiples sources of information, like your script may be, I always include a middleware of normalization to Python's Unicode Type. Web sites may use whatever encoding they please. The whole process is like this: 1. Fetch the content 2. Use whatever clue in the contents to guess the encoding used by the document, e.g Content-type HTTP header; <meta http-equiv="content-type" ...>; <?xml version="1.0" encoding="utf-8"?>, and so on. 3. If none are present, then use chardet to guess for an acceptable decoder. 4. Decode ignoring those character that cannot be decoded. 5. The result is further processed to find entities and "decode" them to actual Unicode characters. (See below) You may find these helpful: http://effbot.org/zone/unicode-objects.htm http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html http://www.amk.ca/python/howto/unicode This is function I have used to process entities: [code] from htmlentitydefs import name2codepoint def __processhtmlentities__(text): assert type(text) is unicode, "Non-normalized text" html = [] (buffer, amp, text) = text.partition('&') while amp: html.append(buffer) (entity, semicolon, text) = text.partition(';') if entity[0] != '#': if entity in name2codepoint: html.append(unichr(name2codepoint[entity])) else: html.append(int(entity[1:]))) (buffer, amp, text) = text.partition('&') html.append(buffer) return u''.join(html) [/code] Best regards, Manuel. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkh2S+sACgkQI2zpkmcEAhil6gCgkAnRE4s5b8oQHamk6utkbAl7 m+YAoIZH2/u73hDcs0G/u294use27v17 =mXuK -----END PGP SIGNATURE----- -- http://mail.python.org/mailman/listinfo/python-list