[Serge] > I have to admit that using > normalize is a far from perfect way to implement search. The most > advanced algorithm is published by Unicode guys: > <http://www.unicode.org/reports/tr10/> If you read it you'll understand > it's not so easy.
I only have to look at the length of the document to understand it's not so easy. 8-) I'll take your two-line normalization function any day. > IMHO It is perfectly acceptable to declare you don't interpret those > symbols. After all they are called *compatibility* code points. I > tried "a quater" symbol: Google and MSN don't interpret it. Yahoo > doesn't support it at all. [...] > if you have character "digit two" followed by "superscript > digit two"; they look like 2 power 2, but NFKD will convert them into > 22 (twenty two), which is wrong. So if you want to use NFKD for search > your will have to preprocess your data, for example inserting space > between the twos. I'm not sure it's obvious that it's wrong. How might a user enter "2<superscript digit 2>" into a search box? They might enter a genuine "<superscript digit 2>" in which case you're fine, or they might enter "2^2" in which case it depends how you deal with punctuation. They probably won't enter "2 2". It's certainly not wrong in the case of ligatures like LATIN SMALL LIGATURE FI - it's quite likely that the user will search for "fish" rather than finding and (somehow) typing the ligature. Some superscripts are similar - I imagine there's a code point for the "superscript st" in "1st" (though I can't find it offhand) and you'd definitely want to convert that to "st". NFKD normalization doesn't convert VULGAR FRACTION ONE QUARTER into "1/4" - I wonder whether there's some way to do that? > After all they are called *compatibility* code points. Yes, compatible with what the user types. 8-) -- Richie Hindle [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list