[issue18059] Add multibyte encoding support to pyexpat

Serhiy Storchaka Fri, 22 Nov 2013 14:54:31 -0800

Serhiy Storchaka added the comment:

> I'm not sure that multibyte encodings other than UTF-8 are used in the world.


I don't use any of them but I heard some of them are still widely used.

This issue was provoked by issue13612. See also related issue15877.

> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs 
> can be used with your patch?

All codecs which can be supported by expat.

"""
   1. Every ASCII character that can appear in a well-formed XML document,
      other than the characters

      $@\^`{}~

      must be represented by a single byte, and that byte must be the
      same byte that represents that character in ASCII.

   2. No character may require more than 4 bytes to encode.

   3. All characters encoded must have Unicode scalar values <=
      0xFFFF, (i.e., characters that would be encoded by surrogates in
      UTF-16 are  not allowed).  Note that this restriction doesn't
      apply to the built-in support for UTF-8 and UTF-16.

   4. No Unicode character may be encoded by more than one distinct
      sequence of bytes.
"""

14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, 
cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, 
shift-jis-2004, shift-jisx0213.

> A whitelist of multibyte codecs may be less reliable. What do you think?

pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of 
supported encodings with minimal required tables.

pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat 
criteria and builds all needed data at first access (tens kilobytes). After 
heavy start it works much faster than previous patch.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18059>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18059] Add multibyte encoding support to pyexpat

Reply via email to