On Mon, Feb 1, 2016 at 9:34 AM, Fillmore <fillmore_rem...@hotmail.com> wrote: > On 01/30/2016 05:26 AM, wxjmfa...@gmail.com wrote: > >>> Python 2 vs python 3 is anything but "solved". >> >> >> >> Python 3.5.1 is still suffering from the same buggy >> behaviour as in Python 3.0 . > > > > Can you elaborate?
This is jmf. His posts are suppressed from the mailing list, because the only thing he ever says is that Python 3's "Unicode by default" behaviour is fundamentally and mathematically wrong, on the basis of microbenchmarks showing a performance regression compared to his beloved - and buggy - narrow build of Python 2.7. (I'm not certain, but I think the regression might even have been fixed now. Or maybe he has other regressions to moan about.) Here's a facts-only summary of Unicode handling in several different CPython [0] builds. * Python 2.7 comes in two flavours, selected at compile time. A "Wide" build is the default on Unix-like platforms, and it uses 32-bit Unicode characters. In other words, the string b"abc" takes up three bytes, but the string u"abc" takes up twelve. [1] These builds are perfectly consistent; a Unicode character *always* takes exactly 4 bytes, and indexing and subscripting are perfectly correct. * A "Narrow" build of Python 2.7 (the default on Windows) uses 16-bit Unicode characters. The string b"abc" still takes up three bytes, but u"abc" takes only six - however, the same string with three astral characters would take up twelve bytes. These builds are thus inconsistent, but potentially more efficient - a thousand BMP characters followed by a single SMP character would take up only 2004 bytes, rather than 4004 as a wide build would use. * Starting with Python 3.0, a default quoted string is a Unicode string. That doesn't change anything about these considerations, but it does mean that "abc" suddenly takes up a lot more room than it used to (because it's now equivalent to u"abc" rather than b"abc"). * Python 3.3 introduced a new "Flexible String Representation", which you can read about in detail in PEP 393. Strings are now stored as compactly as possible; u"Hello!" (all ASCII) takes up six bytes, u"¡Hola!" (Latin-1) also takes up six bytes, u"Привет" (Basic Multilingual Plane) takes up twelve, and u"Hi! 😀😁" (or u"Hi! \U0001f600\U0001f601" if your mailer doesn't have those characters) takes up twenty-four. Each string has a length of 6, as given by len(x), but takes up differing amounts of space according to actual needs. The issue jmf has is with the way the FSR has to "widen" a string. If you take a megabyte of all-ASCII text (stored one byte per character) and append one astral character to it, the resulting string has to be stored four bytes per character, even for the ASCII ones. This is to make sure that indexing and slicing work correctly and efficiently, but it does come at a cost - it takes time to copy all those characters into the new wider string. On microbenchmarks doing exactly this, it's clear that Python 3 is paying a price. But has it truly suffered? rosuav@sikorsky:~$ python -m timeit -s "s=u'a'*1048576" "len(s+u'\U0001f600')" 10000 loops, best of 3: 197 usec per loop rosuav@sikorsky:~$ python3 -m timeit -s "s=u'a'*1048576" "len(s+u'\U0001f600')" 10000 loops, best of 3: 148 usec per loop rosuav@sikorsky:~$ python -m timeit -s "s=u'a'*1048576" "len(s+u'b')" 10000 loops, best of 3: 187 usec per loop rosuav@sikorsky:~$ python3 -m timeit -s "s=u'a'*1048576" "len(s+u'b')" 10000 loops, best of 3: 31.6 usec per loop rosuav@sikorsky:~$ python -c 'import sys; print(sys.version)' 2.7.11 (default, Jan 11 2016, 21:04:40) [GCC 5.3.1 20160101] rosuav@sikorsky:~$ python3 -c 'import sys; print(sys.version)' 3.6.0a0 (default:5452e4b5c007, Feb 1 2016, 07:28:50) [GCC 5.3.1 20160121] The other consideration is that, *on Windows only*, this operation takes more memory under 3.6 than under 2.7, because 2.7 will keep storing the 'a' in 16 bits and then just slap a two-code-unit smiley to the end; but on the flip side, 3.6 has been storing that all-ASCII string in *8* bits per character. Most of your programs will be full of ASCII strings - remember, all your variable names are string keys into some dictionary [2], and every time you call up a built-in function or standard library module, you'll be using an ASCII-only name to reference it. Halving their storage space makes a significant difference; and doubling the size of a very few strings in a very few programs is worth the correctness we gain by not having to worry about string index bugs. So in summary: Take no notice of jmf; he's a crank. ChrisA [0] Other Python implementations may be very different, but it's CPython that most people are looking at. [1] If you use sys.getsizeof() on these strings, you'll find that they actually take up a lot more space than I'm talking about. That's because there's overheads on string objects, which dominate tiny strings. But for large strings, where the performance difference actually matters, the storage space of the characters themselves dominates the overhead. [2] Local names in functions might get compiled out and replaced with numeric slot indices. But module-level names, names of built-ins, attribute names, etc, are all stored in the code as actual strings. -- https://mail.python.org/mailman/listinfo/python-list