Bugs item #1599325, was opened at 2006-11-19 14:40
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Erik Demaine (edemaine)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding

Initial Comment:
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the 
codepoint is <= 0xff.  This should be <= 0x7f.

As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'.  But this 
is only "true" in the Latin-1 encoding.  For example, in UTF8, the same 
character (u'\xa0') would be encoded '\xc2\xa0'.  While I think it is 
reasonable for entitydefs to use the ASCII codec for characters encodable in 
that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 
encoding.

This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls 
handle_data.  The passed data can be '\xa0', which handle_data is forced to 
assume is Latin-1, when the source string might be encoded otherwise.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to