RE: ts_headline performance note

PG Doc comments form Sat, 06 Apr 2024 04:24:01 -0700

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/textsearch-controls.html
Description:


This is a followup to
https://www.postgresql.org/message-id/171175476214.7104.6233899851600749789%40wrigleys.postgresql.org

I am a developer on the team that fixed the performance issues with our
usage of ts_headline. I want to give a little bit of context on how we
noticed the issue and fixed it.

We first noticed the issue when we began indexing large numbers of
multi-megabyte text content in our database. It was causing our search
endpoint to slow down significantly and, in many cases, time out with query
times greater than 10 seconds. We assumed it was a performance issue with
search (maybe with our indexing or vector implementation), but in a process
of elimination we happened to remove our ts_headline annotations and that
fixed the issue.

We realized that adding ts_headline back and slowly increasing our result
set showed that for some results in the set, ts_headline came back quickly
(less than 10ms) and in other cases it added hundreds (occasionally
thousands) of milliseconds for a single result. From there we computed that
in our particular use case, ts_headline seems to have a maximum processing
speed of between 3-5 MB/s, and processing entire pages of results with some
results containing multiple megabytes of text was not feasible.

We solved this issue in three ways:

First, we are ensuring that our frontend is fully utilizing pagination to
reduce the number of results ts_headline must process per request.

Second, we are forcing our Django application to paginate the data before
generating headlines, thereby ensuring that ts_headline is not running for
the entire result set of the search operation, but only a subset. This was
not immediately obvious as Django's ORM seems to attempt to optimize to the
smallest number of queries for a given operation by default, instead of
optimizing toward least time. We accomplished this by searching, then
paginating, then extracting the primary keys of the pagination result into a
list which we feed into another query for ts_headline to work on.

Finally, we are annotating a substring (currently the first 50,000
characters) of the field that often contains large amounts of text and
running ts_headline on that instead. We found that if we stick to a maximum
per result text field size of 50k, ts_headline consistently takes 10ms or
less per result in our use case, and we find that acceptable for the page
sizes that we are returning.

The tradeoff with this solution is that ts_headline will not always find
text to highlight in the substring, but we find that acceptable as well due
to our need for better performance.

It may be helpful to include something like this in the docs for
ts_headline:

> ts_headline uses the original document, not a tsvector summary, so it can
be slow and should be used with care. For example:
> * Consider using pagination when searching through large data sets.
> * If you're using pagination, ensure your application is paginating the
data before generating headlines, so that ts_headline is not running for the
entire result set of the search operation, but only a subset.
> * Consider limiting ts_headline to running on a limited amount of text
from each document, instead of the entire document.
> * Consider setting up a timeout for the ts_headline operation, so that if
it takes too long, search results are delivered without headlines.

RE: ts_headline performance note

Reply via email to