Serhiy Storchaka added the comment:
Yes, I have came to the same as random832. String objects have "fast path" for
concatenating, and in this path cached UTF8 representation is not cleaned.
Pickle is one of simplest ways to reproduce this issue. May be it can be
reproduced with compile() or ty
random832 added the comment:
I can't reproduce without pickle. I did some further digging, though, and it
*looks like*...
1. Pickle causes the built-in UTF-8 representation of a string to be populated,
whereas encode('utf-8') does not. Can anyone think of any other operations that
do this?
2.
Eryk Sun added the comment:
unicode_modifiable in Objects/unicodeobject.c should return 0 if there's cached
PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in place but
instead concatenate a new string.
--
nosy: +eryksun
___
Python
random832 added the comment:
I've looked at the raw bytes [through a ctypes pointer to id(s)] of a string
affected by the issue, and decoded enough to be able to tell that the bad
string has an incorrect UTF-8 length and data, which pickle presumably relies
on.
HEADlength..hash...
Serhiy Storchaka added the comment:
Here is reproducer without IDLE. Looks as pickle is a culprit.
>>> import pickle
>>> s = ''
>>> for i in range(5):
... s += chr(0xe0)
... print(len(s), s, s.encode(), repr(s))
... print(' ', pickle.dumps(s))
...
1 à b'\xc3\xa0' 'à'
Serhiy Storchaka added the comment:
Confirmed on IDLE.
>>> s = ''
>>> for i in range(5):
s += '\xe0'
print(s)
à
àà
àà
àà
àà
>>> s = ''
>>> for i in range(5):
s += chr(0xe0)
print(s)
à
àà
àà
àà
àà
>>> s = ''
>>> for i in range(5):
s += '
Steven D'Aprano added the comment:
I'm afraid I'm unable to replicate this bug report in Python 3.4.
If you are able to replicate it, can you tell us the exact version number of
Python you are using? Also, which operating system are you using?
--
nosy: +steven.daprano
New submission from Árpád Kósa:
One of my students found this bug. For ascii characters it works as you expect,
but for greek alphabet it works unexpectedly.
The program works correctly for Python 3.2.x but for 3.4.x and 3.5 it gives
erroneous result.
--
files: greekbug.py
messages: 25