This is also the sort of thing CommonGramsFilter ws designed for...

https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#common-grams-filter


: Date: Mon, 25 Mar 2024 10:17:48 -0400
: From: Doug Turnbull <douglas.turnb...@reddit.com.invalid>
: Reply-To: users@solr.apache.org
: To: users@solr.apache.org
: Subject: Re: Slow performance for phrases with terms with high ttf
: 
: As someone currently implementing a lot of positional search from scratch
: (in a different side-project), I can say it's totally expected behavior
: that high TTF / DF terms would be harder. To match the phrase there's
: simply more candidate documents and positions to intersect, so it's
: naturally a tougher problem.
: 
: If you think about how phrase search works, you might roughly think you
: 1. Find all documents with every term
: 2. Iterate positions of these documents so that "Bill" is exactly one
: before "Of" exactly one before "sale"... etc
: 
: I'd say the best you could do is:
: 
: 1. Make sure your index can fit in memory.
: 2. Ensure you add any filters (fq) if you have any mandatory requirements.
: Add a filter cache. Don't cache anything that's query-dependent
: 3. If its a really common phrase, think about tokenizing it into a single
: term "bill of sale" -> "bill_of_sale" which you could do outside the search
: engine or with text analysis. With the downside you lose the ability to
: match the individual terms. You could of course create a different field
: for these significant phrases if its important.
: 
: Best
: -Doug
: 
: On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
: 
: > There is a typo in my email. The term list should be like this:
: >
: >
: >    - "bill" -> df = 1.879.324, ttf = 14.145.950
: >    - "note" -> df = 8.479.826, ttf = 151.249.542
: >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
: >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
: >
: >
: > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
: >
: > > Hi,
: > >
: > > We are experiencing quite a performance decrease when searching for
: > > phrases that have terms with a high ttf value.
: > >
: > > E.g. searching for "note of sale" is around 3 times slower (~10 sec) than
: > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
: > > reproduced als when we use other terms that have a high ttf. We are
: > > querying the unstemmed index.
: > >
: > > Terms (numDocs: 26220184):
: > >
: > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
: > >    - "note" -> df = 8.479.826, ttf = 151.249.542
: > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
: > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
: > >
: > >
: > > Is this the expected behaviour or is there something that can be
: > > tuned, like a cache setting?
: > >
: > > Thanks,
: > > Sjoerd
: > >
: >
: 

-Hoss
http://www.lucidworks.com/

Reply via email to