utf-test.c

Branko Čibej Wed, 24 Feb 2016 06:07:31 -0800

On 24.02.2016 14:30, Evgeny Kotkov wrote:
> Branko Čibej <[email protected]> writes:
>> Instead of relying on the Unicode spec, I propose a different approach:
>> to treat accented letters as if they don't have diacriticals at all.
>> This should be fairly easy to do with utf8proc: in the intermediate,
>> 32-bit NFD string, remove any character that's in the
>> combining-diacritical group, and then convert the result to NFC UTF-8.
>> I've done this before with fairly good results; it's also much easier to
>> explain this behaviour to users than to tell them, "read the Unicode spec".
> I see that utf8proc has UTF8PROC_STRIPMARK flag that does something
> similar to what you describe.  The difference is that this option strips the
> codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or
> Me (Enclosing_Mark) categories [1].
>
> Although that's more than just removing the characters that are marked as
> Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use
> this flag.  How does this cope with what you propose?


This is probably even better than just removing combining diacriticals,
because it should work well with non-latin/cyrillic scripts, too.

> Another question is about exposing this ability in the API.  I'd say that we
> could do something like this:
>
>   svn_utf__transform(svn_boolean_t normalize,
>                      svn_boolean_t casefold,
>                      svn_boolean_t remove_diacritics)
>
>   (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?)
>
> Do you have an opinion or suggestions about that?

The big question here is what we'll use the API for. Currently we have a
'normalize' function that's used by svn_fs_verify (IIRC). Since we're
talking about a funciton that transforms a UTF-8 string to a shape
suitable for stuff-insensitive comparison, we could follow the example
of the standard strxfrm() -> svn_utf__xfrm(); but if that's too ugly, my
preference is for svn_utf__fold().

However, I'd not add arguments for normalization/case folding/etc; I'd
just make this function DTRT without any additional flags, because
otherwise we'll always be second-guessing the correct invocation.

If there's a use case for case-folding vs. non-case folding, then make
two functions: svn_utf__xfrm and svn_utf__xfrm_casefold.

(Again, obviously, all of these -- including svn_utf__normalize -- need
only one private impltmentation in the source.)

-- Brane

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Reply via email to