The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/16/textsearch-controls.html Description:
This is a followup to https://www.postgresql.org/message-id/171175476214.7104.6233899851600749789%40wrigleys.postgresql.org I am a developer on the team that fixed the performance issues with our usage of ts_headline. I want to give a little bit of context on how we noticed the issue and fixed it. We first noticed the issue when we began indexing large numbers of multi-megabyte text content in our database. It was causing our search endpoint to slow down significantly and, in many cases, time out with query times greater than 10 seconds. We assumed it was a performance issue with search (maybe with our indexing or vector implementation), but in a process of elimination we happened to remove our ts_headline annotations and that fixed the issue. We realized that adding ts_headline back and slowly increasing our result set showed that for some results in the set, ts_headline came back quickly (less than 10ms) and in other cases it added hundreds (occasionally thousands) of milliseconds for a single result. From there we computed that in our particular use case, ts_headline seems to have a maximum processing speed of between 3-5 MB/s, and processing entire pages of results with some results containing multiple megabytes of text was not feasible. We solved this issue in three ways: First, we are ensuring that our frontend is fully utilizing pagination to reduce the number of results ts_headline must process per request. Second, we are forcing our Django application to paginate the data before generating headlines, thereby ensuring that ts_headline is not running for the entire result set of the search operation, but only a subset. This was not immediately obvious as Django's ORM seems to attempt to optimize to the smallest number of queries for a given operation by default, instead of optimizing toward least time. We accomplished this by searching, then paginating, then extracting the primary keys of the pagination result into a list which we feed into another query for ts_headline to work on. Finally, we are annotating a substring (currently the first 50,000 characters) of the field that often contains large amounts of text and running ts_headline on that instead. We found that if we stick to a maximum per result text field size of 50k, ts_headline consistently takes 10ms or less per result in our use case, and we find that acceptable for the page sizes that we are returning. The tradeoff with this solution is that ts_headline will not always find text to highlight in the substring, but we find that acceptable as well due to our need for better performance. It may be helpful to include something like this in the docs for ts_headline: > ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care. For example: > * Consider using pagination when searching through large data sets. > * If you're using pagination, ensure your application is paginating the data before generating headlines, so that ts_headline is not running for the entire result set of the search operation, but only a subset. > * Consider limiting ts_headline to running on a limited amount of text from each document, instead of the entire document. > * Consider setting up a timeout for the ts_headline operation, so that if it takes too long, search results are delivered without headlines.