On 24.02.2016 14:30, Evgeny Kotkov wrote: > Branko Čibej <br...@apache.org> writes: >> Instead of relying on the Unicode spec, I propose a different approach: >> to treat accented letters as if they don't have diacriticals at all. >> This should be fairly easy to do with utf8proc: in the intermediate, >> 32-bit NFD string, remove any character that's in the >> combining-diacritical group, and then convert the result to NFC UTF-8. >> I've done this before with fairly good results; it's also much easier to >> explain this behaviour to users than to tell them, "read the Unicode spec". > I see that utf8proc has UTF8PROC_STRIPMARK flag that does something > similar to what you describe. The difference is that this option strips the > codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or > Me (Enclosing_Mark) categories [1]. > > Although that's more than just removing the characters that are marked as > Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use > this flag. How does this cope with what you propose?
This is probably even better than just removing combining diacriticals, because it should work well with non-latin/cyrillic scripts, too. > Another question is about exposing this ability in the API. I'd say that we > could do something like this: > > svn_utf__transform(svn_boolean_t normalize, > svn_boolean_t casefold, > svn_boolean_t remove_diacritics) > > (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?) > > Do you have an opinion or suggestions about that? The big question here is what we'll use the API for. Currently we have a 'normalize' function that's used by svn_fs_verify (IIRC). Since we're talking about a funciton that transforms a UTF-8 string to a shape suitable for stuff-insensitive comparison, we could follow the example of the standard strxfrm() -> svn_utf__xfrm(); but if that's too ugly, my preference is for svn_utf__fold(). However, I'd not add arguments for normalization/case folding/etc; I'd just make this function DTRT without any additional flags, because otherwise we'll always be second-guessing the correct invocation. If there's a use case for case-folding vs. non-case folding, then make two functions: svn_utf__xfrm and svn_utf__xfrm_casefold. (Again, obviously, all of these -- including svn_utf__normalize -- need only one private impltmentation in the source.) -- Brane