Thanks Tim & Walter. Have managed to get it working with shingles and edge ngram. Initially it did bring up a lot of false positives but managed to mitigate it tweaking with the parameters and also by splitting this into a separate copy field with lower boost than a normal match.
On Wed, Aug 16, 2023, 11:48 PM Walter Underwood <wun...@wunderwood.org> wrote: > There are two cases. > > Index has “well being”, query is “wellbeing”. This is solved by using a > shingle filter. That will make lots of nonsense compounds, too, but they > won’t match real queries. Well, almost never. > > Index has “wellbeing”, query is “well being”. Best approach for this is > synonym expansion at index time. Yes, you have to maintain that set of > synoymns, but the list should grow to cover most cases quickly. These > should be unidirectional mappings, like “wellbeing => well being”, with the > synonym filter configured to keep the original term. > > This is what I did at Netflix back when Solr was new (version 1.3). The > synonyms covered “superman”, “babysitter”, “manhunt”, “fullmetal”, etc. The > last was for “Full Metal Jacket” and “Fullmetal Alchemist”. There were > about 300 synonyms. > > You might also need to consider hyphenated versions, like “Spider-man”. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Aug 15, 2023, at 10:25 PM, Tim Casey <tca...@gmail.com> wrote: > > > > Index all diagrams. If you use a dictionary then there is a lot of work > to > > maintain it. Also this does not translate well to other languages. The > > downside to this is having partial token hits which decrease precision. > > But, usually people who are looking for "well being" or "wellbeing" will > > not expect to look for 'well*' in documents. You would have to measure > the > > results in your data. An obvious example would be first and last names. > > > > For every stream of tokens: t1 t2 t3...tn, you would index t1t2 > > t2t3...tn-1tn as well as the normal tokens. Index them into a separate > > non-stored field to allow control at query time. > > > > On Tue, Aug 15, 2023 at 8:08 PM Ramkumar Krishnamoorthy < > > ramkumar1...@gmail.com> wrote: > > > >> Hi All, > >> > >> I am struggling to find the right filter that can make it work for > search > >> queries like "well being" and "play space" to be able to match terms > like > >> wellbeing and playspace in documents. > >> > >> Tried to make it work with wordDelimiterGraph. But that only works if > the > >> word in the document is "WellBeing". Another option I am considering is > >> using DictionaryCompoundWordTokenFilterFactory but I need to find a > >> dictionary file for English that I can pass to it.. > >> > >> Any suggestions on how this can be handled? > >> > >> Thanks, > >> Kumar > >> > >