Re: Newbie question about text encoding

Chris Angelico Thu, 05 Mar 2015 21:22:25 -0800

On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody <rustompm...@gmail.com> wrote:
> My conclusion: Early adopters of unicode -- Windows and Java -- were punished
> for their early adoption.  You can blame the unicode consortium, you can
> blame the babel of human languages, particularly that some use characters
> and some only (the equivalent of) what we call words.
>
> Or you can skip the blame-game and simply note the fact that large segments of
> extant code-bases are currently in bug-prone or plain buggy state.


For most of the 1990s, I was writing code in REXX, on OS/2. An even
earlier adopter, REXX didn't have Unicode support _at all_, but
instead had facilities for working with DBCS strings. You can't get
everything right AND be the first to produce anything. Python didn't
make Unicode strings the default until 3.0, but that's not Unicode's
fault.

> This includes not just bug-prone-system code such as Java and Windows but
> seemingly working code such as python 3.
>
> Here is Roy's Smith post that first started me thinking that something may
> be wrong with SMP
> https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ
>
> Some parts are here some earlier and from my memory.
> If details wrong please correct:
> - 200 million records
> - Containing 4 strings with SMP characters
> - System made with python and mysql. SMP works with python, breaks mysql.
>   So whole system broke due to those 4 in 200,000,000 records
>
> I know enough (or not enough) of unicode to be chary of statistical 
> conclusions
> from the above.
> My conclusion is essentially an 'existence-proof':

Hang on hang on. Why are you blaming Python or SMP characters for
this? The problem here is MySQL, which doesn't adequately cope with
the full Unicode range. (Or, didn't then, or doesn't with its default
settings. I believe you can configure current versions of MySQL to
work correctly, though I haven't actually checked. PostgreSQL gets it
right, that's good enough for me.)

> SMP-chars can break systems.
> The breakage is costly-fied by the combination
> - layman statistical assumptions
> - BMP → SMP exercises different code-paths

Broken systems can be shown up by anything. Suppose you have a program
that breaks when it gets a NUL character (not unknown in C code); is
the fault with the Unicode consortium for allocating something at
codepoint 0, or the code that can't cope with a perfectly normal
character?

> You could also choose do with "astral crap" (Roy's words) what we all do with
> crap -- throw it out as early as possible.

There's only one character that fits that description, and that's
1F4A9. Everything else is just "astral characters", and you shouldn't
have any difficulties with them.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to