Thanks Doug, we'll experiment and let you know how it went.

On Mon, Mar 25, 2024 at 3:59 PM Doug Turnbull
<douglas.turnb...@reddit.com.invalid> wrote:

> It could help yeah with parallelizing it. With the tradeoff that you'll
> only be as fast as your slowest shard (ie tail latency). So more shards
> mean one shard having a bad hair day, doing a GC, or something will
> increase the risk of slowing things down, and probably increase the
> variance in the overall response time. So definitely look at p90+ changes,
> not just p50.
>
> On Mon, Mar 25, 2024 at 10:51 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
>
> > Thanks Doug!
> >
> > Do you think adding more shards would help in this case? Putting the
> index
> > in memory is not really possible as the index is up to 2.5Tb. We have
> SSD's
> > though, so that is the closest we can get. We have 16 CPUs and configured
> > it for 4 shards. Would splitting it up in more shards potentially help?
> > We'll run some experiments anyway.
> >
> > On Mon, Mar 25, 2024 at 3:19 PM Doug Turnbull
> > <douglas.turnb...@reddit.com.invalid> wrote:
> >
> > > As someone currently implementing a lot of positional search from
> scratch
> > > (in a different side-project), I can say it's totally expected behavior
> > > that high TTF / DF terms would be harder. To match the phrase there's
> > > simply more candidate documents and positions to intersect, so it's
> > > naturally a tougher problem.
> > >
> > > If you think about how phrase search works, you might roughly think you
> > > 1. Find all documents with every term
> > > 2. Iterate positions of these documents so that "Bill" is exactly one
> > > before "Of" exactly one before "sale"... etc
> > >
> > > I'd say the best you could do is:
> > >
> > > 1. Make sure your index can fit in memory.
> > > 2. Ensure you add any filters (fq) if you have any mandatory
> > requirements.
> > > Add a filter cache. Don't cache anything that's query-dependent
> > > 3. If its a really common phrase, think about tokenizing it into a
> single
> > > term "bill of sale" -> "bill_of_sale" which you could do outside the
> > search
> > > engine or with text analysis. With the downside you lose the ability to
> > > match the individual terms. You could of course create a different
> field
> > > for these significant phrases if its important.
> > >
> > > Best
> > > -Doug
> > >
> > > On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ssme...@gmail.com>
> wrote:
> > >
> > > > There is a typo in my email. The term list should be like this:
> > > >
> > > >
> > > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > > >    - "of" -> df = 21.244.060, ttf = 6.879.196.700
> > > >
> > > >
> > > > On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ssme...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are experiencing quite a performance decrease when searching for
> > > > > phrases that have terms with a high ttf value.
> > > > >
> > > > > E.g. searching for "note of sale" is around 3 times slower (~10
> sec)
> > > than
> > > > > the "bill of sale" `(~3 sec). This behaviour is consistent and can
> be
> > > > > reproduced als when we use other terms that have a high ttf. We are
> > > > > querying the unstemmed index.
> > > > >
> > > > > Terms (numDocs: 26220184):
> > > > >
> > > > >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> > > > >    - "note" -> df = 8.479.826, ttf = 151.249.542
> > > > >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> > > > >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> > > > >
> > > > >
> > > > > Is this the expected behaviour or is there something that can be
> > > > > tuned, like a cache setting?
> > > > >
> > > > > Thanks,
> > > > > Sjoerd
> > > > >
> > > >
> > >
> >
>

Reply via email to