[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

Marc-Andre Lemburg Wed, 09 Jun 2010 05:17:06 -0700

Marc-Andre Lemburg <[email protected]> added the comment:

Antoine Pitrou wrote:
> 
> Antoine Pitrou <[email protected]> added the comment:
> 
> The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy 
> when codec endianness doesn't match the native endianness (not to mention it 
> could also crash if the underlying CPU arch doesn't support unaligned access 
> to 4-byte integers):
> 
> #ifndef Py_UNICODE_WIDE
>     for (i = pairs = 0; i < size/4; i++)
>         if (((Py_UCS4 *)s)[i] >= 0x10000)
>             pairs++;
> #endif


Good catch !

I wonder whether it wouldn't be better to preallocate
a Unicode object with size of e.g. size/4 + 16 and
then resize the object as necessary in case a surrogate
pair needs to be created (won't happen that often in
practice).

The extra scan for pairs can take long depending on
how much data you have to decode and likely doesn't
go down well with CPU caches.

----------
title: utf-32be codec failing on UCS-2 python build for 32-bit value -> 
utf-32be codec failing on UCS-2 python build for 32-bit value

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue8941>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

Reply via email to