On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorot...@gmail.com> writes: > > [ trgm-regexp-0.15.patch.gz ] > > I spent the weekend hacking on this, making a number of bug fixes and a > whole lot of cosmetic changes. I think there are large parts of this > that are in committable shape now, but I still find the actual graph > transformation logic to be mostly unintelligible. I think what's most > obscure is the distinction between the arcs list and the keys list of > each state in the expanded graph. I get the impression that the > general idea is for the arcs to represent exactly-known transitions > while the keys represent imprecisely-known transitions ... but there > seems to be at least some leakage between those categories. Could > you write down a specification for what's supposed to be happening > there? >
Here is my try to specify it. At first some notions. I know, they are already in the comments, but just in order to put it together. Extended color - any color of source CNFA or one of two special values: 1) Unknown color - may represent any character either alphanumeric or non-alphanumeric. 2) Blank color - may represent any non-alphanumeric character Prefix is extended colors of last two characters read by CNFA. Key is pair of CNFA state and prefix. So, key is a extended state which is containing additional information which can influence further trigrams. So, if you are in some key and traverse some CNFA arc then you moves info another key. But there are two possible cases (or, sometimes, both of them): 1) Your move from one key into another necessary means read of some trigram. Then you create new arc labeled with that trigram. 2) You move into another key, but you doesn't necessary read an useful trigram. For example, arc of source CNFA is labeled by "unextractable" color. Then you add new key into "keys" array. And outgoing arcs from this key will also be processed similarly to source key. Therefore "keys" array is a set of keys which are achievable from "stateKey" without reading of useful trigram. We could get rid of "keys" array and produce some "empty arcs" in the second case. You can imagine that "keys" array is set of keys which are achivable by "empty arcs" from "stateKey". States connected with "empty arcs" could be merged on the next stage. However, the reason of having separated addKeys stage is optimization. In addKey function more generals keys absorb less general ones. In many cases resulting graph becomes much simplier because of this. For example, if your regex is not prefix, then engine puts self-referencing arcs of all possible colors to initial state. Straight-forward processing of this could produce enormous output graph. I had similar situation in early version of patch where keys didn't absorb earch other. ------ With best regards, Alexander Korotkov.