21 apr 2013 kl. 20.07 skrev Branko Čibej:

Yes, the obvious ones are German (ß == SS) equivalence and turkic (i == İ) and (ı == I) equivalences (and that's aready three characters); but
then in French, lowercase accented letters are equivalent to uppercase
unaccented letters, whereas for example in Spanish that's not the case. And that's just looking at European and West Asian Latin scripts. There
are at least 7 distinct Cyrillic scripts in roughly the same area that
I'm aware of, and I certainly don't know the case-folding rules for all
of them.

Not only is the above true, one should also be careful to distinguish case conversion from case-insensitive matching; these follow different rules.

For instance, converting lower-case letters to upper case in French will retain the accents (most of the time - this is locale-dependent), but they are generally expected to be ignored when searching. By contrast, it would be an error to match "a" with "ä" in Swedish when searching, or to drop the dots in a case conversion.

Clearly a case- and accent-sensitive search is much easier to implement, but would benefit from normalisation. Bytewise matching is on the lowest rung.

Reply via email to