Re: Slow performance for phrases with terms with high ttf

Doug Turnbull Mon, 25 Mar 2024 07:59:24 -0700

It could help yeah with parallelizing it. With the tradeoff that you'll
only be as fast as your slowest shard (ie tail latency). So more shards
mean one shard having a bad hair day, doing a GC, or something will
increase the risk of slowing things down, and probably increase the
variance in the overall response time. So definitely look at p90+ changes,
not just p50.


On Mon, Mar 25, 2024 at 10:51 AM Sjoerd Smeets <ssme...@gmail.com> wrote:

> Thanks Doug!
>
> Do you think adding more shards would help in this case? Putting the index
> in memory is not really possible as the index is up to 2.5Tb. We have SSD's
> though, so that is the closest we can get. We have 16 CPUs and configured
> it for 4 shards. Would splitting it up in more shards potentially help?
> We'll run some experiments anyway.
>
> On Mon, Mar 25, 2024 at 3:19 PM Doug Turnbull
> <douglas.turnb...@reddit.com.invalid> wrote:
>
> > As someone currently implementing a lot of positional search from scratch
> > (in a different side-project), I can say it's totally expected behavior
> > that high TTF / DF terms would be harder. To match the phrase there's
> > simply more candidate documents and positions to intersect, so it's
> > naturally a tougher problem.
> >
> > If you think about how phrase search works, you might roughly think you
> > 1. Find all documents with every term
> > 2. Iterate positions of these documents so that "Bill" is exactly one
> > before "Of" exactly one before "sale"... etc
> >
> > I'd say the best you could do is:
> >
> > 1. Make sure your index can fit in memory.
> > 2. Ensure you add any filters (fq) if you have any mandatory
> requirements.
> > Add a filter cache. Don't cache anything that's query-dependent
> > 3. If its a really common phrase, think about tokenizing it into a single
> > term "bill of sale" -> "bill_of_sale" which you could do outside the
> search
> > engine or with text analysis. With the downside you lose the ability to
> > match the individual terms. You could of course create a different field
> > for these significant phrases if its important.
> >
> > Best
> > -Doug
> >
> > On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
> >
> > > There is a typo in my email. The term list should be like this:
> > >
> > >
> > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
> > >
> > >
> > > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ssme...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > We are experiencing quite a performance decrease when searching for
> > > > phrases that have terms with a high ttf value.
> > > >
> > > > E.g. searching for "note of sale" is around 3 times slower (~10 sec)
> > than
> > > > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
> > > > reproduced als when we use other terms that have a high ttf. We are
> > > > querying the unstemmed index.
> > > >
> > > > Terms (numDocs: 26220184):
> > > >
> > > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> > > >
> > > >
> > > > Is this the expected behaviour or is there something that can be
> > > > tuned, like a cache setting?
> > > >
> > > > Thanks,
> > > > Sjoerd
> > > >
> > >
> >
>

Re: Slow performance for phrases with terms with high ttf

Reply via email to