> From:Ludovic Courtès <l...@gnu.org> > >> Can we first check what would need to be done to fix this in 2.0.x? > >> > >> At first glance: > >> > >> - “Straße” is normally stored as a Latin1 string, so it would need to > >> be converted to UTF-* before it can be passed to one of the > >> unicase.h functions. *Or*, we could check with bug-libunistring > >> what it would take to add Latin1 string case mapping functions. > >> > >> Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a > >> one-to-one case mapping. All other Latin1 strings can be handled > by > >> iterating over characters, as is currently done. > > > > There is the micro sign, which, when case folded, becomes a Greek mu. > > It is still a single character, but, it is the only latin-1 character that, > > when folded, becomes a non-Latin-1 character > > Blech. > > It would have worked better with narrow == ASCII instead of > narrow == Latin1. It’s a change we can still make, I think.
It would be easy enough to do. If someone were to fight for a narrow encoding of Latin-1, I would expect it to be you, since you're the only committer whose name requires ISO-8859-1. So if you're okay with it, who am I to complain? > > >> - Case insensitive comparison is more difficult, as you already > >> pointed out. To do it right we’d probably need to convert Latin1 > >> strings to UTF-32 and then pass it to u32_casecmp. We don’t have > to > >> do the conversion every time, though: we could just change Latin1 > >> strings in-place so they now point to a wide stringbuf upon the > >> first ‘string-ci=’. > >> > >> Thoughts? > > [...] > > Indeed it’s quite inelegant. ;-) > > How about changing to narrow == ASCII and then string comparison would > be: > > if (narrow (s1) != narrow (s2)) > { It would be easier and cleaner, as you demonstrate. I guess the question is about future-proofing. If the complications with the Latin-1 / UTF-32 dual encoding are constrained to upcase/downcase and string-ci comparison ops, then it doesn't seem worth it to change it. But if it is going to cause endless problems down the road, ASCII/UTF-32 is simpler. A lot of this debate is about expectations, I think. For my part, I think that the string-ci ops only have real value for English language and ASCII text. For non-English non-ASCII processing, sorting case-insensitively by numeric codepoint values in the absence of locale sorting rules seems like an odd thing to want to do. So I guess I'm not bothered with the ugly C necessary to make ISO-8859-1 work. It is bad for string-ci ops but not too bad for upcase/downcase. I also am not too concerned that string-ci comparison ops for non-English non-ASCII processing may be inefficient. It does seem vital that string-locale comparison ops be efficient, though. Thanks, Mike