I wrote: > (I wonder if we need to try to make it faster. I'd supposed that the > loop was cheap enough to be a non-problem, but with large enough > documents maybe not? It seems like converting to a hash table could > be worthwhile for a large doc.)
OK, I dug into Stephen's test case off-list. While some CFIs would be a good idea, that's just band-aid'ing the symptom. The actual problem is that hlCover() is taking way too much time. The test case boils down to "common_word & rare_word" matched to a very long document, wherein the rare_word appears only near the front. Once we're past that match, hlCover() tries all the remaining matches for common_word, of which there are plenty ... and for each one, it scans clear to the end of the document, looking vainly for a substring that will satisfy the "common_word & rare_word" query. So obviously, this is O(N^2) in the length of the document :-(. I have to suppose that I introduced this problem in c9b0c678d, since AFAIR we weren't getting ts_headline-takes-forever complaints before that. It's not immediately obvious why the preceding algorithm doesn't have a similar issue, but then there's not anything at all that was obvious about the preceding algorithm. The most obvious way to fix the O(N^2) hazard is to put a limit on the length of "cover" (matching substring) that we'll consider. For the mark_hl_words caller, we won't highlight more than max_words words anyway, so it would be reasonable to bound covers to that length or some small multiple of it. The other caller mark_hl_fragments is willing to highlight up to max_fragments of up to max_words each, and there can be some daylight between the fragments, so it's not quite clear what the longest reasonable match is. Still, I doubt it's useful to show a headline consisting of a few words from the start of the document and a few words from thousands of words later, so a limit of max_fragments times max_words times something would probably be reasonable. We could hard-code a rule like that, or we could introduce a new explicit parameter for the maximum cover length. The latter would be more flexible, but we need something back-patchable and I'm concerned about the compatibility hazards of adding a new parameter in minor releases. So on the whole I propose hard-wiring a multiplier of, say, 10 for both these cases. regards, tom lane