As someone currently implementing a lot of positional search from scratch
(in a different side-project), I can say it's totally expected behavior
that high TTF / DF terms would be harder. To match the phrase there's
simply more candidate documents and positions to intersect, so it's
naturally a tougher problem.

If you think about how phrase search works, you might roughly think you
1. Find all documents with every term
2. Iterate positions of these documents so that "Bill" is exactly one
before "Of" exactly one before "sale"... etc

I'd say the best you could do is:

1. Make sure your index can fit in memory.
2. Ensure you add any filters (fq) if you have any mandatory requirements.
Add a filter cache. Don't cache anything that's query-dependent
3. If its a really common phrase, think about tokenizing it into a single
term "bill of sale" -> "bill_of_sale" which you could do outside the search
engine or with text analysis. With the downside you lose the ability to
match the individual terms. You could of course create a different field
for these significant phrases if its important.

Best
-Doug

On Mon, Mar 25, 2024 at 6:40 AM Sjoerd Smeets <ssme...@gmail.com> wrote:

> There is a typo in my email. The term list should be like this:
>
>
>    - "bill" -> df = 1.879.324, ttf = 14.145.950
>    - "note" -> df = 8.479.826, ttf = 151.249.542
>    - "sale" -> df = 7.557.685, ttf = 12.0948.163
>    - "of" -> df = 21.244.060, ttf = 6.879.196.700
>
>
> On Mon, Mar 25, 2024 at 8:56 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
>
> > Hi,
> >
> > We are experiencing quite a performance decrease when searching for
> > phrases that have terms with a high ttf value.
> >
> > E.g. searching for "note of sale" is around 3 times slower (~10 sec) than
> > the "bill of sale" `(~3 sec). This behaviour is consistent and can be
> > reproduced als when we use other terms that have a high ttf. We are
> > querying the unstemmed index.
> >
> > Terms (numDocs: 26220184):
> >
> >    - "bill" -> df = 1.879.324, ttf = 14.145.950
> >    - "note" -> df = 8.479.826, ttf = 151.249.542
> >    - "sale" -> df = 7.557.685, ttf = 12.0948.163
> >    - "bill" -> df = 21.244.060, ttf = 6.879.196.700
> >
> >
> > Is this the expected behaviour or is there something that can be
> > tuned, like a cache setting?
> >
> > Thanks,
> > Sjoerd
> >
>

Reply via email to