On Tue, Jul 05, 2022 at 09:24:49PM +0200, Przemysław Sztoch wrote: > I do not add more, because they probably concern older languages. > An alternative might be to rely entirely on Unicode decomposition ... > However, after the change, only one additional Ukrainian letter with an > accent was added to the rule file.
Hmm. I was wondering about the decomposition part, actually. How much would it make things simpler if we treat the full range of the cyrillic characters, aka from U+0400 to U+4FF, scanning all of them and building rules only if there are decompositions? Is it worth considering the Cyrillic supplement, as of U+0500-U+052F? I was also thinking about the regression tests, and as unaccent characters are more spread than for Latin and Greek, it could be a good thing to have a complete coverage. We could for example use a query like that to check if a character is treated properly or not: SELECT chr(i.a) = unaccent(chr(i.a)) FROM generate_series(1024, 1327) AS i(a); -- range of Cyrillic. -- Michael
signature.asc
Description: PGP signature