Re: FTS Tokenization filters normalizer-icu vs lowercase

Michael Slusarz Thu, 20 Jan 2022 09:37:20 -0800

> On 01/20/2022 9:20 AM Alessio Cecchi <ales...@skye.it> wrote:
> 
> I'm trying to setup fts-flatcurve with tokenization.
> 
> What are the differences/benefits with "fts_filters = normalizer-icu" vs 
> "fts_filters = lowercase"?
> 
> Reading the Doc I found about normalizer-icu "This is potentially very 
> resource intensive." and about lowercase "Supports UTF8, when compiled with 
> libicu".
> 
> So, using lowercase is almost the same that normalizer-icu but faster?
> 
No, these are 2 different actions.


Lowercase tries to use language rules to map characters to a "lowercase" 
equivalent, which is character/language dependent.

Normalization tries to take a string and reduce it to a unique, normalized 
form, that can be directly compared to other normalized strings.  UTF, for 
example, can have strings that display the same to the user but contain very 
different byte data.  For example, it is possible to create more complicated 
glyphs by either using a specific code-point (i.e., a 4 byte UTF element) or by 
using a combination of UTF sequences that, when combined, create an identical 
display of the character.

Normalization is a very complicated topic.  
https://en.wikipedia.org/wiki/Unicode_equivalence might help with further 
understanding.

The ICU library deals with general internationalization support, and these two 
filters are using different parts of that library to do different things.  They 
are not replacements for each other, they are complimentary - you could 
normalize a string and then lowercase it, for example.

michael


> 
> 
> FYI
> 
> for using fts-flatcurve with dovecot RPM packages from repo.dovecot.org you 
> have to rebuild with --with-icu --with-stemmer --with-textcat and related 
> library.
> 
> Thanks
> 
> --
> Alessio Cecchi
> Postmaster @ http://www.qboxmail.it
> https://www.linkedin.com/in/alessice
>

Re: FTS Tokenization filters normalizer-icu vs lowercase

Reply via email to