On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody <rustompm...@gmail.com> wrote: > My conclusion: Early adopters of unicode -- Windows and Java -- were punished > for their early adoption. You can blame the unicode consortium, you can > blame the babel of human languages, particularly that some use characters > and some only (the equivalent of) what we call words. > > Or you can skip the blame-game and simply note the fact that large segments of > extant code-bases are currently in bug-prone or plain buggy state.
For most of the 1990s, I was writing code in REXX, on OS/2. An even earlier adopter, REXX didn't have Unicode support _at all_, but instead had facilities for working with DBCS strings. You can't get everything right AND be the first to produce anything. Python didn't make Unicode strings the default until 3.0, but that's not Unicode's fault. > This includes not just bug-prone-system code such as Java and Windows but > seemingly working code such as python 3. > > Here is Roy's Smith post that first started me thinking that something may > be wrong with SMP > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ > > Some parts are here some earlier and from my memory. > If details wrong please correct: > - 200 million records > - Containing 4 strings with SMP characters > - System made with python and mysql. SMP works with python, breaks mysql. > So whole system broke due to those 4 in 200,000,000 records > > I know enough (or not enough) of unicode to be chary of statistical > conclusions > from the above. > My conclusion is essentially an 'existence-proof': Hang on hang on. Why are you blaming Python or SMP characters for this? The problem here is MySQL, which doesn't adequately cope with the full Unicode range. (Or, didn't then, or doesn't with its default settings. I believe you can configure current versions of MySQL to work correctly, though I haven't actually checked. PostgreSQL gets it right, that's good enough for me.) > SMP-chars can break systems. > The breakage is costly-fied by the combination > - layman statistical assumptions > - BMP → SMP exercises different code-paths Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or the code that can't cope with a perfectly normal character? > You could also choose do with "astral crap" (Roy's words) what we all do with > crap -- throw it out as early as possible. There's only one character that fits that description, and that's 1F4A9. Everything else is just "astral characters", and you shouldn't have any difficulties with them. ChrisA -- https://mail.python.org/mailman/listinfo/python-list