Ezio Melotti <ezio.melo...@gmail.com> added the comment: http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 entities (see also attached file for a dict generated from that table). Currently html.entities only has 252 entities, organized in 3 dicts: 1) name -> intvalue (e.g. 'amp': 0x0026); 2) intvalue -> name (e.g. 0x0026: 'amp'); 3) name -> char (e.g. 'amp': '&');
In HTML 5, some of the entities map to a sequence of 2 characters, for example ≂̸ corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING LONG SOLIDUS OVERLAY). This means that: 1) the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead; 2) the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these). 3) The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities; Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict. Also note that the entities are case-sensitive and some of them include different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict won't work too well. Having '&' -> 'amp' seems better than '&' -> 'AMP', but this might not be obvious for all the entities and requires some extra logic in the code to get it right. ---------- Added file: http://bugs.python.org/file23803/entities_dict.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue11113> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com