Serhiy Storchaka added the comment: > I'm not sure that multibyte encodings other than UTF-8 are used in the world.
I don't use any of them but I heard some of them are still widely used. This issue was provoked by issue13612. See also related issue15877. > pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs > can be used with your patch? All codecs which can be supported by expat. """ 1. Every ASCII character that can appear in a well-formed XML document, other than the characters $@\^`{}~ must be represented by a single byte, and that byte must be the same byte that represents that character in ASCII. 2. No character may require more than 4 bytes to encode. 3. All characters encoded must have Unicode scalar values <= 0xFFFF, (i.e., characters that would be encoded by surrogates in UTF-16 are not allowed). Note that this restriction doesn't apply to the built-in support for UTF-8 and UTF-16. 4. No Unicode character may be encoded by more than one distinct sequence of bytes. """ 14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, shift-jis-2004, shift-jisx0213. > A whitelist of multibyte codecs may be less reliable. What do you think? pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of supported encodings with minimal required tables. pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat criteria and builds all needed data at first access (tens kilobytes). After heavy start it works much faster than previous patch. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18059> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com