Re: ICUFoldingFilter with preserveOriginal option?

Ere Maijala Thu, 26 Aug 2021 05:14:29 -0700

Thanks for the explanation.

I wonder if it would be feasible to create a token filter that wouldjust call two token filters and merge the results.. That could be reallypowerful and cater for other situations as well.


--Ere

Jan Høydahl kirjoitti 26.8.2021 klo 14.36:

Duplicating fields is a last resort.
My proposal is simple term stacking on same position. ICU filter will never 
output a different number of terms so no need for a graph.
I don't need full control over the boost for original term, it is enough that 
docs that matches both the original and the normalized term scores higher, 
which they will with a plain OR. Likely the original term will score slightly 
higher in many cases due to higher IDF.

Jan

26. aug. 2021 kl. 11:26 skrev Ere Maijala <ere.maij...@helsinki.fi>:

Hi,

Right, ok.

For an exact match boost to work, you'd have to index into another field with a 
different analysis chain anyway, or am I missing something? I may not be 
experienced enough in this, but I can't see a way to give the original term 
higher boost. Also, would filter need to be a graph filter for positions etc. 
to work properly?

I suppose it would be relatively simple to add support for protwords to 
ICUFoldingFilter, but I'd be concerned of the potential inconsistency from 
user's perspective if you'd require exact match for certain characters only for 
some words.

--Ere

Jan Høydahl kirjoitti 26.8.2021 klo 11.39:

Hi,
Thanks for the input. We already use the filter parameter to guard æøåäö. We 
could of course guard é or ô against normalization too, but thise becomes quite 
broad, and much of the benefit disappears.
If the filter supported some kind of protwords-list for exceptions, we could 
start assembling words that we know for sure clashes and should be excepted, 
however an exact-match rank boost approach would seem more flexible.
Jan

26. aug. 2021 kl. 10:08 skrev Ere Maijala <ere.maij...@helsinki.fi>:

Hi,

For our Finnish audience we avoid folding some characters to alleviate the 
problem. Along with MappingCharFilter this works pretty well. See 
https://github.com/NatLibFi/finna-solr/blob/dev/vufind/biblio/conf/schema.xml#L7
 for examples. Depending on your use case this could be a solution as well. 
Note that the filter parameter hasn't always been there, so a recent-enough 
Solr version is needed (I fail to recall the exact version).

--Ere

Jan Høydahl kirjoitti 25.8.2021 klo 16.43:

Hi,
I'm looking at using ICUFoldingFilter for a customer, to fold e.g. Genéve to 
Geneve and thus get better recall.
However, for some common Norwegian words, the folding makes them clash with 
super-common words so it becomes impossible to find exactly what you want.
I imagined if ICUFoldingFilter had a preserverOriginal=true option, then it could leave 
the original word in the index on the same position, and an exact match for 
"Genéve" would score better than the normalized one. But this filter does not 
support this.
Have anyone found a workaround for this, except from duplicating all content in 
different fields with different analysis and search across them with different 
weights?
Jan


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Re: ICUFoldingFilter with preserveOriginal option?

Reply via email to