On Thu, Dec 20, 2012 at 2:18 AM, Johannes Bauer <dfnsonfsdu...@gmx.de> wrote: > On 19.12.2012 15:23, wxjmfa...@gmail.com wrote: >> I was using the German word "Straße" (Strasse) — German >> translation from "street" — to illustrate the catastrophic and >> completely wrong-by-design Unicode handling in Py3.3, this >> time from a memory point of view (not speed): >> >>>>> sys.getsizeof('Straße') >> 43 >>>>> sys.getsizeof('STRAẞE') >> 50 >> >> instead of a sane (Py3.2) >> >>>>> sys.getsizeof('Straße') >> 42 >>>>> sys.getsizeof('STRAẞE') >> 42 > > How do those arbitrary numbers prove anything at all? Why do you draw > the conclusion that it's broken by design? What do you expect? You're > very vague here. Just to show how ridiculously pointless your numers > are, your example gives 84 on Python3.2 for any input of yours.
You may not be familiar with jmf. He's one of our resident trolls, and he has a bee in his bonnet about PEP 393 strings, on the basis that they take up more space in memory than a narrow build of Python 3.2 would, for a string with lots of BMP characters and one non-BMP. In 3.2 narrow builds, strings were stored in UTF-16, with *surrogate pairs* for non-BMP characters. This means that len() counts them twice, as does string indexing/slicing. That's a major bug, especially as your Python code will do different things on different platforms - most Linux builds of 3.2 are "wide" builds, storing characters in four bytes each. PEP 393 brings wide build semantics to all Pythons, while achieving memory savings better than a narrow build can (with PEP 393 strings, any all-ASCII or all-Latin-1 strings will be stored one byte per character). Every now and then, though, jmf points out *yet again* that his beloved and buggy narrow build consumes less memory and runs faster than the oh so terrible 3.3 on some contrived example. It gets rather tiresome. Interestingly, IDLE on my Windows box can't handle the bolded characters very well... >>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, >>> \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!" >>> print(s) Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> print(s) UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407' in position 0: Non-BMP character not supported in Tk I think this is most likely a case of "yeah, Windows XP just sucks". But I have no reason or inclination to get myself a newer Windows to find out if it's any different. ChrisA -- http://mail.python.org/mailman/listinfo/python-list