I looked into the report in bug #14563 about pg_trgm giving wrong answers for regexp searches. There is a whole lot of complex mechanism in trgm_regexp.c for extracting trigram patterns out of regexps, and so I was afraid that it might be a very subtle bug, but actually the problem seems to be in an apparently-trivial step. That's the part in selectColorTrigrams() where we discard trigrams if there are too many, fixing the graph by merging the states linked by a discarded trigram. We are not supposed to merge initial and final states, because if we did the graph would be broken, and the code that checks for that looks simple enough ... but it's wrong. Consider
C C Sinit ---> Sx ---> Sfin where the two arcs depicted are labeled by the same color trigram. We will examine each arc and decide that it doesn't link the init and fin states, and thereby decide that it's okay to remove C. But removing the first arc merges Sx into Sinit, and removing the second arc merges Sfin into Sx and thereby into Sinit, and we've blown it. I think a reasonable fix for this is to add a "tentative parent" field into struct TrgmState that is kept NULL except during this specific logic. In the loop looking for disallowed merges, set "tentative parent" to indicate a merge we would make if successful, and chase up tentative as well as real parent links. Then reset all the tentative links before moving on, whether or not the merge is allowed. In the example above, we'd set Sx's tentative parent to Sinit, then while looking at the second arc we would chase up from Sx to Sinit and realize we can't merge. It seems like a good idea, as well, for either mergeStates() or its sole caller to have an Assert, if not indeed a full-fledged runtime check, that we aren't merging init and final states. Also, I notice that the "children" state lists are useless and could just be dropped. The only thing they're used for at all is reducing the length of the parent-link chains, but I see no point in that considering the code isn't relying on those chains to be length 1. It's very unlikely that we save enough cycles by having shorter chains to pay for the overhead of allocating and maintaining the children lists. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers