Ezio Melotti <ezio.melo...@gmail.com> added the comment:

http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 
entities (see also attached file for a dict generated from that table).
Currently html.entities only has 252 entities, organized in 3 dicts:
  1) name -> intvalue (e.g. 'amp': 0x0026);
  2) intvalue -> name (e.g. 0x0026: 'amp');
  3) name -> char (e.g. 'amp': '&');

In HTML 5, some of the entities map to a sequence of 2 characters, for example 
&NotEqualTilde; corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING 
LONG SOLIDUS OVERLAY).

This means that:
  1) the current approach of having a dict with name -> intvalue doesn't work 
anymore, and a name -> valuelist should be used instead;
  2) the reverse dict for this would have to use tuples as keys, but I'm not 
sure how useful would that be (producing entities is not a common case, 
especially "unusual" ones like these).
  3) The name -> char dict might still be useful, and can easily become a name 
-> str dict in order to deal with the multichar entities;

Since 1) is not backward-compatible the HTML5 entities should probably go in a 
separate dict.

Also note that the entities are case-sensitive and some of them include 
different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict 
won't work too well.  Having '&' -> 'amp' seems better than '&' -> 'AMP', but 
this might not be obvious for all the entities and requires some extra logic in 
the code to get it right.

----------
Added file: http://bugs.python.org/file23803/entities_dict.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11113>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to