This is also the sort of thing CommonGramsFilter ws designed for... https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#common-grams-filter
: Date: Mon, 25 Mar 2024 10:17:48 -0400 : From: Doug Turnbull <douglas.turnb...@reddit.com.invalid> : Reply-To: users@solr.apache.org : To: users@solr.apache.org : Subject: Re: Slow performance for phrases with terms with high ttf : : As someone currently implementing a lot of positional search from scratch : (in a different side-project), I can say it's totally expected behavior : that high TTF / DF terms would be harder. To match the phrase there's : simply more candidate documents and positions to intersect, so it's : naturally a tougher problem. : : If you think about how phrase search works, you might roughly think you : 1. Find all documents with every term : 2. Iterate positions of these documents so that "Bill" is exactly one : before "Of" exactly one before "sale"... etc : : I'd say the best you could do is: : : 1. Make sure your index can fit in memory. : 2. Ensure you add any filters (fq) if you have any mandatory requirements. : Add a filter cache. Don't cache anything that's query-dependent : 3. If its a really common phrase, think about tokenizing it into a single : term "bill of sale" -> "bill_of_sale" which you could do outside the search : engine or with text analysis. With the downside you lose the ability to : match the individual terms. You could of course create a different field : for these significant phrases if its important. : : Best : -Doug : : On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ssme...@gmail.com> wrote: : : > There is a typo in my email. The term list should be like this: : > : > : > - "bill" -> df = 1.879.324, ttf = 14.145.950 : > - "note" -> df = 8.479.826, ttf = 151.249.542 : > - "sale" -> df = 7.557.685, ttf = 12.0948.163 : > - "of" -> df = 21.244.060, ttf = 6.879.196.700 : > : > : > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ssme...@gmail.com> wrote: : > : > > Hi, : > > : > > We are experiencing quite a performance decrease when searching for : > > phrases that have terms with a high ttf value. : > > : > > E.g. searching for "note of sale" is around 3 times slower (~10 sec) than : > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be : > > reproduced als when we use other terms that have a high ttf. We are : > > querying the unstemmed index. : > > : > > Terms (numDocs: 26220184): : > > : > > - "bill" -> df = 1.879.324, ttf = 14.145.950 : > > - "note" -> df = 8.479.826, ttf = 151.249.542 : > > - "sale" -> df = 7.557.685, ttf = 12.0948.163 : > > - "bill" -> df = 21.244.060, ttf = 6.879.196.700 : > > : > > : > > Is this the expected behaviour or is there something that can be : > > tuned, like a cache setting? : > > : > > Thanks, : > > Sjoerd : > > : > : -Hoss http://www.lucidworks.com/