21 apr 2013 kl. 20.07 skrev Branko Čibej:
Yes, the obvious ones are German (ß == SS) equivalence and turkic (i
==
İ) and (ı == I) equivalences (and that's aready three characters);
but
then in French, lowercase accented letters are equivalent to uppercase
unaccented letters, whereas for example in Spanish that's not the
case.
And that's just looking at European and West Asian Latin scripts.
There
are at least 7 distinct Cyrillic scripts in roughly the same area that
I'm aware of, and I certainly don't know the case-folding rules for
all
of them.
Not only is the above true, one should also be careful to distinguish
case conversion from case-insensitive matching; these follow different
rules.
For instance, converting lower-case letters to upper case in French
will retain the accents (most of the time - this is locale-dependent),
but they are generally expected to be ignored when searching. By
contrast, it would be an error to match "a" with "ä" in Swedish when
searching, or to drop the dots in a case conversion.
Clearly a case- and accent-sensitive search is much easier to
implement, but would benefit from normalisation. Bytewise matching is
on the lowest rung.