It could help yeah with parallelizing it. With the tradeoff that you'll only be as fast as your slowest shard (ie tail latency). So more shards mean one shard having a bad hair day, doing a GC, or something will increase the risk of slowing things down, and probably increase the variance in the overall response time. So definitely look at p90+ changes, not just p50.
On Mon, Mar 25, 2024 at 10:51 AM Sjoerd Smeets <ssme...@gmail.com> wrote: > Thanks Doug! > > Do you think adding more shards would help in this case? Putting the index > in memory is not really possible as the index is up to 2.5Tb. We have SSD's > though, so that is the closest we can get. We have 16 CPUs and configured > it for 4 shards. Would splitting it up in more shards potentially help? > We'll run some experiments anyway. > > On Mon, Mar 25, 2024 at 3:19 PM Doug Turnbull > <douglas.turnb...@reddit.com.invalid> wrote: > > > As someone currently implementing a lot of positional search from scratch > > (in a different side-project), I can say it's totally expected behavior > > that high TTF / DF terms would be harder. To match the phrase there's > > simply more candidate documents and positions to intersect, so it's > > naturally a tougher problem. > > > > If you think about how phrase search works, you might roughly think you > > 1. Find all documents with every term > > 2. Iterate positions of these documents so that "Bill" is exactly one > > before "Of" exactly one before "sale"... etc > > > > I'd say the best you could do is: > > > > 1. Make sure your index can fit in memory. > > 2. Ensure you add any filters (fq) if you have any mandatory > requirements. > > Add a filter cache. Don't cache anything that's query-dependent > > 3. If its a really common phrase, think about tokenizing it into a single > > term "bill of sale" -> "bill_of_sale" which you could do outside the > search > > engine or with text analysis. With the downside you lose the ability to > > match the individual terms. You could of course create a different field > > for these significant phrases if its important. > > > > Best > > -Doug > > > > On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ssme...@gmail.com> wrote: > > > > > There is a typo in my email. The term list should be like this: > > > > > > > > > - "bill" -> df = 1.879.324, ttf = 14.145.950 > > > - "note" -> df = 8.479.826, ttf = 151.249.542 > > > - "sale" -> df = 7.557.685, ttf = 12.0948.163 > > > - "of" -> df = 21.244.060, ttf = 6.879.196.700 > > > > > > > > > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ssme...@gmail.com> > wrote: > > > > > > > Hi, > > > > > > > > We are experiencing quite a performance decrease when searching for > > > > phrases that have terms with a high ttf value. > > > > > > > > E.g. searching for "note of sale" is around 3 times slower (~10 sec) > > than > > > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be > > > > reproduced als when we use other terms that have a high ttf. We are > > > > querying the unstemmed index. > > > > > > > > Terms (numDocs: 26220184): > > > > > > > > - "bill" -> df = 1.879.324, ttf = 14.145.950 > > > > - "note" -> df = 8.479.826, ttf = 151.249.542 > > > > - "sale" -> df = 7.557.685, ttf = 12.0948.163 > > > > - "bill" -> df = 21.244.060, ttf = 6.879.196.700 > > > > > > > > > > > > Is this the expected behaviour or is there something that can be > > > > tuned, like a cache setting? > > > > > > > > Thanks, > > > > Sjoerd > > > > > > > > > >