On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjre...@udel.edu> wrote: > On 7/24/2013 11:00 AM, Michael Torrie wrote: >> >> On 07/24/2013 08:34 AM, Chris Angelico wrote: >>> >>> Frankly, Python's strings are a *terrible* internal representation >>> for an editor widget - not because of PEP 393, but simply because >>> they are immutable, and every keypress would result in a rebuilding >>> of the string. On the flip side, I could quite plausibly imagine >>> using a list of strings; > > > I used exactly this, a list of strings, for a Python-coded text-only mock > editor to replace the tk Text widget in idle tests. It works fine for the > purpose. For small test texts, the inefficiency of immutable strings is not > relevant. > > Tk apparently uses a C-coded btree rather than a Python list. All details > are hidden, unless one finds and reads the source ;-), but but it uses C > arrays rather than Python strings. > > >>> In this usage, the FSR is beneficial, as it's possible to have >>> different strings at different widths. > > > For my purpose, the mock Text works the same in 2.7 and 3.3+.
Thanks for that report! And yes, it's going to behave exactly the same way, because its underlying structure is an ordered list of ordered lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of performance. But if you put your code onto a narrow build, you'll have issues as seen below. >> Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in >> pros and cons, > > They both have the pro that indexing is direct *and correct*. The cons are > different. They're close enough, though. It's simply a performance tradeoff - use the memory all the time, or take a bit of overhead to give yourself the option of using less memory. The difference is negligible compared to... >> and the cons of using UCS-2 (the old narrow builds) are >> well known. UCS-2 simply cannot represent all of unicode correctly. > > Python's narrow builds, at least for several releases, were in between USC-2 > and UTF-16 in that they used surrogates to represent all unicodes but did > not correct indexing for the presence of astral chars. This is a nuisance > for those who do use astral chars, such as emotes and CJK name chars, on an > everyday basis. ... this. If nobody had ever thought of doing a multi-format string representation, I could well imagine the Python core devs debating whether the cost of UTF-32 strings is worth the correctness and consistency improvements... and most likely concluding that narrow builds get abolished. And if any other language (eg ECMAScript) decides to move from UTF-16 to UTF-32, I would wholeheartedly support the move, even if it broke code to do so. To my mind, exposing UTF-16 surrogates to the application is a bug to be fixed, not a feature to be maintained. But since we can get the best of both worlds with only a small amount of overhead, I really don't see why anyone should be objecting. ChrisA -- http://mail.python.org/mailman/listinfo/python-list