Nice trick, but won't work for us as we use stemming later in the chain, so we 
don't want all original terms to survive the entire chain, we just want those 
tokens that gets modified by ICU filter to also emit the original.

Input: Genéve
LCF:  genéve
ICU:   genéve geneve
STEM: genév genev

Query: Genève -> (genéve OR geneve) -> (genév OR genev) ==> match on both 
terms, with a higher score than a match on 'Geneve' alone.

This is a made up example and a terrible idea to stem proper names, but you get 
the idea. It would have been solved by a preserveOriginal flag.

Jan

> 26. aug. 2021 kl. 00:35 skrev Markus Jelsma <[email protected]>:
> 
> Hoi Jan,
> 
> ICUFoldingFilter and ASCIIFoldingFilter i think do not respect the
> keyword=true attribute when i last checked. If you use
> KeywordRepeatFilter and modify the said TokenFilters to respect the
> keyword attribute, the problem seems solved.
> 
> Regards,
> Markus
> 
> 2021-08-25 16:32 GMT+02:00, André Widhani <[email protected]>:
>> Not with ICUFoldingFilter, but with the MappingCharFilter.
>> 
>> There you can supply a mapping file and skip baseletter mappings for the
>> users' native language, because in their own language, they know the correct
>> spelling ... most of the time ... sometimes.
>> 
>> This does really help with multiple languages and you lose the convenience
>> of ICUFoldingFilter.
>> 
>> André
>> ________________________________
>> From: Jan Høydahl <[email protected]>
>> Sent: Wednesday, 25 August 2021 15:43
>> To: [email protected] <[email protected]>
>> Subject: ICUFoldingFilter with preserveOriginal option?
>> 
>> External e-mail.
>> 
>> 
>> Hi,
>> 
>> I'm looking at using ICUFoldingFilter for a customer, to fold e.g. Genéve to
>> Geneve and thus get better recall.
>> However, for some common Norwegian words, the folding makes them clash with
>> super-common words so it becomes impossible to find exactly what you want.
>> I imagined if ICUFoldingFilter had a preserverOriginal=true option, then it
>> could leave the original word in the index on the same position, and an
>> exact match for "Genéve" would score better than the normalized one. But
>> this filter does not support this.
>> 
>> Have anyone found a workaround for this, except from duplicating all content
>> in different fields with different analysis and search across them with
>> different weights?
>> 
>> Jan
>> 

Reply via email to