Re: Using libunistring for string comparisons et al

Mike Gran Thu, 17 Mar 2011 11:08:33 -0700

> From:Ludovic Courtès <l...@gnu.org>
> >> Can we first check what would need to be done to fix this in 2.0.x?
> >> 
> >> At first glance:
> >> 
> >>   - “Straße” is normally stored as a Latin1 string, so it would need to
> >>     be converted to UTF-* before it can be passed to one of the
> >>     unicase.h functions.  *Or*, we could check with bug-libunistring
> >>     what it would take to add Latin1 string case mapping functions.
> >> 
> >>     Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
> >>     one-to-one case mapping.  All other Latin1 strings can be handled 
> by
> >>     iterating over characters, as is currently done.
> >
> > There is the micro sign, which, when case folded, becomes a Greek mu.
> > It is still a single character, but, it is the only latin-1 character that,
> > when folded, becomes a non-Latin-1 character
> 
> Blech.
> 
> It would have worked better with narrow == ASCII instead of
> narrow == Latin1.  It’s a change we can still make, I think.


It would be easy enough to do.  If someone were to fight for
a narrow encoding of Latin-1, I would expect it to be you, since
you're the only committer whose name requires ISO-8859-1.
So if you're okay with it, who am I to complain?

> 
> >>   - Case insensitive comparison is more difficult, as you already
> >>     pointed out.  To do it right we’d probably need to convert Latin1
> >>     strings to UTF-32 and then pass it to u32_casecmp.  We don’t have 
> to
> >>     do the conversion every time, though: we could just change Latin1
> >>     strings in-place so they now point to a wide stringbuf upon the
> >>     first ‘string-ci=’.
> >> 
> >> Thoughts?
> >

[...]

> 
> Indeed it’s quite inelegant.  ;-)
> 
> How about changing to narrow == ASCII and then string comparison would
> be:
> 
>   if (narrow (s1) != narrow (s2))
>     {

It would be easier and cleaner, as you demonstrate.

I guess the question is about future-proofing.  If the complications with the 
Latin-1
/ UTF-32 dual encoding are constrained to upcase/downcase and string-ci
comparison ops, then it doesn't seem worth it to change it.  But if it is going
to cause endless problems down the road, ASCII/UTF-32 is simpler.

A lot of this debate is about expectations, I think.  For my part, I think that
the string-ci ops only have real value for English language and ASCII text.
For non-English non-ASCII processing, sorting case-insensitively by numeric
codepoint values in the absence of locale sorting rules seems like an odd thing
to want to do. 

So I guess I'm not bothered with the ugly C necessary to make ISO-8859-1 work.
It is bad for string-ci ops but not too bad for upcase/downcase.  I also am
not too concerned that string-ci comparison ops for non-English non-ASCII
processing may be inefficient.  It does seem vital that string-locale comparison
ops be efficient, though.

Thanks,
Mike

Re: Using libunistring for string comparisons et al

Reply via email to