Re: Compound words in English

Ramkumar Krishnamoorthy Wed, 23 Aug 2023 06:35:43 -0700

Thanks Tim & Walter.

Have managed to get it working with shingles and edge ngram. Initially it
did bring up a lot of false positives but managed to mitigate it tweaking
with the parameters and also by splitting this into a separate copy field
with lower boost than a normal match.


On Wed, Aug 16, 2023, 11:48 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> There are two cases.
>
> Index has “well being”, query is “wellbeing”. This is solved by using a
> shingle filter. That will make lots of nonsense compounds, too, but they
> won’t match real queries. Well, almost never.
>
> Index has “wellbeing”, query is “well being”. Best approach for this is
> synonym expansion at index time. Yes, you have to maintain that set of
> synoymns, but the list should grow to cover most cases quickly. These
> should be unidirectional mappings, like “wellbeing => well being”, with the
> synonym filter configured to keep the original term.
>
> This is what I did at Netflix back when Solr was new (version 1.3). The
> synonyms covered “superman”, “babysitter”, “manhunt”, “fullmetal”, etc. The
> last was for “Full Metal Jacket” and “Fullmetal Alchemist”. There were
> about 300 synonyms.
>
> You might also need to consider hyphenated versions, like “Spider-man”.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Aug 15, 2023, at 10:25 PM, Tim Casey <tca...@gmail.com> wrote:
> >
> > Index all diagrams.  If you use a dictionary then there is a lot of work
> to
> > maintain it.  Also this does not translate well to other languages.  The
> > downside to this is having partial token hits which decrease precision.
> > But, usually people who are looking for "well being" or "wellbeing" will
> > not expect to look for 'well*' in documents.  You would have to measure
> the
> > results in your data.  An obvious example would be first and last names.
> >
> > For every stream of tokens: t1 t2 t3...tn, you would index t1t2
> > t2t3...tn-1tn as well as the normal tokens.  Index them into a separate
> > non-stored field to allow control at query time.
> >
> > On Tue, Aug 15, 2023 at 8:08 PM Ramkumar Krishnamoorthy <
> > ramkumar1...@gmail.com> wrote:
> >
> >> Hi All,
> >>
> >> I am struggling to find the right filter that can make it work for
> search
> >> queries like "well being" and "play space" to be able to match terms
> like
> >> wellbeing and playspace in documents.
> >>
> >> Tried to make it work with wordDelimiterGraph. But that only works if
> the
> >> word in the document is "WellBeing". Another option I am considering is
> >> using DictionaryCompoundWordTokenFilterFactory but I need to find a
> >> dictionary file for English that I can pass to it..
> >>
> >> Any suggestions on how this can be handled?
> >>
> >> Thanks,
> >> Kumar
> >>
>
>

Re: Compound words in English

Reply via email to