Marc-Andre Lemburg <m...@egenix.com> added the comment: Antoine Pitrou wrote: > > Antoine Pitrou <pit...@free.fr> added the comment: > > The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy > when codec endianness doesn't match the native endianness (not to mention it > could also crash if the underlying CPU arch doesn't support unaligned access > to 4-byte integers): > > #ifndef Py_UNICODE_WIDE > for (i = pairs = 0; i < size/4; i++) > if (((Py_UCS4 *)s)[i] >= 0x10000) > pairs++; > #endif
Good catch ! I wonder whether it wouldn't be better to preallocate a Unicode object with size of e.g. size/4 + 16 and then resize the object as necessary in case a surrogate pair needs to be created (won't happen that often in practice). The extra scan for pairs can take long depending on how much data you have to decode and likely doesn't go down well with CPU caches. ---------- title: utf-32be codec failing on UCS-2 python build for 32-bit value -> utf-32be codec failing on UCS-2 python build for 32-bit value _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8941> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com