Here's another little piece of regex performance hacking. This is based on looking at a slow regexp I found in Tcl's bug tracker:
-- Adapted from http://core.tcl.tk/tcl/tktview?name=446565 select regexp_matches( repeat('<script> 123 </script> <script> 345 </script> <script> 123 </script>', 100000), '<script(.(?!</script>))*?(doubleclick|flycast|burstnet|spylog)+?.*?</script>'); The core of the problem here is the lookahead constraint (?!</script>), which gets applied O(N^2) times for an N-character data string. The present patch doesn't do anything to cut down the big-O problem, but it does take a swipe at cutting the constant factor, which should remain useful even if we find a way to avoid the O(N^2) issue. Poking at this with perf, I was surprised to observe that the dominant cost is not down inside lacon() as one would expect, but in the loop in miss() that is deciding where to call lacon(). 80% of the runtime is going into these three lines: for (i = 0; i < d->nstates; i++) if (ISBSET(d->work, i)) for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++) So there are two problems here. The outer loop is iterating over all the NFA states, even though only a small fraction of the states are likely to have LACON out-arcs. (In the case at hand, the main NFA has 78 states, of which just one has LACON out-arcs.) Then, for every reachable state, we're scanning all its out-arcs to find the ones that are LACONs. (Again, just a fraction of the out-arcs are likely to be LACONs.) So the main thrust of this patch is to rearrange the "struct cnfa" representation to separate plain arcs from LACON arcs, allowing this loop to not waste time looking at irrelevant states or arcs. This also saves some time in miss()'s preceding main loop, which is only interested in plain arcs. Splitting the LACON arcs from the plain arcs complicates matters in a couple of other places, but none of them are in the least performance-critical. The other thing I noticed while looking at miss() is that it will call lacon() for each relevant arc, even though it's quite likely to see multiple arcs labeled with the same constraint number, for which the answer must be the same. So I added some simple logic to cache the last answer and re-use it if the next arc of interest has the same color. (We could imagine working harder to cache in the presence of multiple interesting LACONs, but I'm doubtful that it's worth the trouble. The one-entry cache logic is so simple it can hardly be a net loss, though.) On my machine, the combination of these two ideas reduces the runtime of the example above from ~150 seconds to ~53 seconds, or nearly 3x better. I see something like a 2% improvement on Joel's test corpus, which might just be noise. So this isn't any sort of universal panacea, but it sure helps when LACON evaluation is the bottleneck. Any objections? or better ideas? regards, tom lane
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c index 69929eddfc..3da394b9a5 100644 --- a/src/backend/regex/regc_nfa.c +++ b/src/backend/regex/regc_nfa.c @@ -3182,81 +3182,145 @@ compact(struct nfa *nfa, { struct state *s; struct arc *a; - size_t nstates; - size_t narcs; - struct carc *ca; - struct carc *first; + size_t nstates = 0; + size_t nlastates = 0; + size_t nplainarcs = 0; + size_t nlaarcs = 0; + struct carc *cpa; + struct carc *cla; assert(!NISERR()); - nstates = 0; - narcs = 0; for (s = nfa->states; s != NULL; s = s->next) { + size_t nthislaarcs = 0; + nstates++; - narcs += s->nouts + 1; /* need one extra for endmarker */ + for (a = s->outs; a != NULL; a = a->outchain) + { + switch (a->type) + { + case PLAIN: + nplainarcs++; + break; + case LACON: + nthislaarcs++; + break; + default: + NERR(REG_ASSERT); + return; + } + } + nplainarcs++; /* each state needs an endmarker arc */ + if (nthislaarcs > 0) + { + nlastates++; + nlaarcs += nthislaarcs + 1; /* endmarker needed */ + } } cnfa->stflags = (char *) MALLOC(nstates * sizeof(char)); - cnfa->states = (struct carc **) MALLOC(nstates * sizeof(struct carc *)); - cnfa->arcs = (struct carc *) MALLOC(narcs * sizeof(struct carc)); - if (cnfa->stflags == NULL || cnfa->states == NULL || cnfa->arcs == NULL) + cnfa->firstarc = (struct carc **) MALLOC(nstates * sizeof(struct carc *)); + cnfa->arcs = (struct carc *) + MALLOC((nplainarcs + nlaarcs) * sizeof(struct carc)); + if (nlastates > 0) + { + cnfa->lastates = (int *) MALLOC(nlastates * sizeof(int)); + cnfa->firstlaarc = (struct carc **) + MALLOC(nlastates * sizeof(struct carc *)); + } + else + { + cnfa->lastates = NULL; + cnfa->firstlaarc = NULL; + } + if (cnfa->stflags == NULL || cnfa->firstarc == NULL || cnfa->arcs == NULL || + (nlastates > 0 && (cnfa->lastates == NULL || cnfa->firstlaarc == NULL))) { if (cnfa->stflags != NULL) FREE(cnfa->stflags); - if (cnfa->states != NULL) - FREE(cnfa->states); + if (cnfa->firstarc != NULL) + FREE(cnfa->firstarc); if (cnfa->arcs != NULL) FREE(cnfa->arcs); + if (cnfa->lastates != NULL) + FREE(cnfa->lastates); + if (cnfa->firstlaarc != NULL) + FREE(cnfa->firstlaarc); NERR(REG_ESPACE); return; } + assert(nstates > 0); cnfa->nstates = nstates; + cnfa->ncolors = maxcolor(nfa->cm) + 1; + cnfa->flags = nfa->flags; + if (nlastates > 0) + cnfa->flags |= HASLACONS; + cnfa->nlastates = 0; /* will set below */ cnfa->pre = nfa->pre->no; cnfa->post = nfa->post->no; cnfa->bos[0] = nfa->bos[0]; cnfa->bos[1] = nfa->bos[1]; cnfa->eos[0] = nfa->eos[0]; cnfa->eos[1] = nfa->eos[1]; - cnfa->ncolors = maxcolor(nfa->cm) + 1; - cnfa->flags = nfa->flags; cnfa->minmatchall = nfa->minmatchall; cnfa->maxmatchall = nfa->maxmatchall; - ca = cnfa->arcs; + cpa = cnfa->arcs; /* where to store plain arcs */ + cla = cnfa->arcs + nplainarcs; /* where to store LA arcs */ for (s = nfa->states; s != NULL; s = s->next) { + struct carc *firstp = cpa; + struct carc *firstla = cla; + assert((size_t) s->no < nstates); cnfa->stflags[s->no] = 0; - cnfa->states[s->no] = ca; - first = ca; + cnfa->firstarc[s->no] = cpa; for (a = s->outs; a != NULL; a = a->outchain) + { switch (a->type) { case PLAIN: - ca->co = a->co; - ca->to = a->to->no; - ca++; + cpa->co = a->co; + cpa->to = a->to->no; + cpa++; break; case LACON: assert(s->no != cnfa->pre); assert(a->co >= 0); - ca->co = (color) (cnfa->ncolors + a->co); - ca->to = a->to->no; - ca++; - cnfa->flags |= HASLACONS; + if (cla == firstla) + { + cnfa->lastates[cnfa->nlastates] = s->no; + cnfa->firstlaarc[cnfa->nlastates] = cla; + cnfa->nlastates++; + } + cla->co = a->co; + cla->to = a->to->no; + cla++; break; default: NERR(REG_ASSERT); return; } - carcsort(first, ca - first); - ca->co = COLORLESS; - ca->to = 0; - ca++; + } + /* sort plain arcs, just for neatness */ + carcsort(firstp, cpa - firstp); + /* add endmarker for plain arcs */ + cpa->co = COLORLESS; + cpa->to = 0; + cpa++; + /* and the same for LA arcs, if this state has any */ + if (cla != firstla) + { + carcsort(firstla, cla - firstla); + cla->co = COLORLESS; + cla->to = 0; + cla++; + } } - assert(ca == &cnfa->arcs[narcs]); - assert(cnfa->nstates != 0); + assert(cpa == &cnfa->arcs[nplainarcs]); + assert(cla == &cnfa->arcs[nplainarcs + nlaarcs]); + assert(nlastates == cnfa->nlastates); /* mark no-progress states */ for (a = nfa->pre->outs; a != NULL; a = a->outchain) @@ -3299,7 +3363,11 @@ freecnfa(struct cnfa *cnfa) { assert(!NULLCNFA(*cnfa)); /* not empty already */ FREE(cnfa->stflags); - FREE(cnfa->states); + FREE(cnfa->firstarc); + if (cnfa->lastates != NULL) + FREE(cnfa->lastates); + if (cnfa->firstlaarc != NULL) + FREE(cnfa->firstlaarc); FREE(cnfa->arcs); ZAPCNFA(*cnfa); } @@ -3518,18 +3586,19 @@ dumpcstate(int st, FILE *f) { struct carc *ca; + bool hasarc = false; int pos; + int i; fprintf(f, "%d%s", st, (cnfa->stflags[st] & CNFA_NOPROGRESS) ? ":" : "."); pos = 1; - for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) + /* Dump its plain out-arcs */ + for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++) { if (ca->co == RAINBOW) fprintf(f, "\t[*]->%d", ca->to); - else if (ca->co < cnfa->ncolors) - fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to); else - fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to); + fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to); if (pos == 5) { fprintf(f, "\n"); @@ -3537,8 +3606,28 @@ dumpcstate(int st, } else pos++; + hasarc = true; + } + /* Dump its LACON out-arcs, if any */ + for (i = 0; i < cnfa->nlastates; i++) + { + if (cnfa->lastates[i] != st) + continue; + for (ca = cnfa->firstlaarc[i]; ca->co != COLORLESS; ca++) + { + fprintf(f, "\t:%ld:->%d", (long) ca->co, ca->to); + if (pos == 5) + { + fprintf(f, "\n"); + pos = 1; + } + else + pos++; + hasarc = true; + } + break; /* needn't look through rest of lastates[] */ } - if (ca == cnfa->states[st] || pos != 1) + if (pos != 1 || !hasarc) fprintf(f, "\n"); fflush(f); } diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c index 1db52d1005..ad2c98bd91 100644 --- a/src/backend/regex/rege_dfa.c +++ b/src/backend/regex/rege_dfa.c @@ -677,6 +677,8 @@ miss(struct vars *v, int gotstate; int dolacons; int sawlacons; + color lastlacon; + int lastlaresult; /* for convenience, we can be called even if it might not be a miss */ if (css->outs[co] != NULL) @@ -709,7 +711,7 @@ miss(struct vars *v, gotstate = 0; for (i = 0; i < d->nstates; i++) if (ISBSET(css->states, i)) - for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++) + for (ca = cnfa->firstarc[i]; ca->co != COLORLESS; ca++) if (ca->co == co || (ca->co == RAINBOW && !ispseudocolor)) { @@ -725,27 +727,33 @@ miss(struct vars *v, return NULL; /* character cannot reach any new state */ dolacons = (cnfa->flags & HASLACONS); sawlacons = 0; + lastlacon = COLORLESS; + lastlaresult = 0; /* doesn't matter */ /* outer loop handles transitive closure of reachable-by-LACON states */ while (dolacons) { + int j; + dolacons = 0; - for (i = 0; i < d->nstates; i++) + for (j = 0; j < cnfa->nlastates; j++) + { + i = cnfa->lastates[j]; if (ISBSET(d->work, i)) - for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++) + for (ca = cnfa->firstlaarc[j]; ca->co != COLORLESS; ca++) { - if (ca->co < cnfa->ncolors) - continue; /* not a LACON arc */ if (ISBSET(d->work, ca->to)) continue; /* arc would be a no-op anyway */ sawlacons = 1; /* this LACON affects our result */ - if (!lacon(v, cnfa, cp, ca->co)) + /* we might have just eval'd this same LACON */ + if (ca->co != lastlacon) { + lastlacon = ca->co; + lastlaresult = lacon(v, cnfa, cp, lastlacon); if (ISERR()) return NULL; - continue; /* LACON arc cannot be traversed */ } - if (ISERR()) - return NULL; + if (!lastlaresult) + continue; /* LACON arc cannot be traversed */ BSET(d->work, ca->to); dolacons = 1; if (ca->to == cnfa->post) @@ -754,6 +762,7 @@ miss(struct vars *v, noprogress = 0; FDEBUG(("%d :> %d\n", i, ca->to)); } + } } h = HASH(d->work, d->wordsper); @@ -805,9 +814,8 @@ static int /* predicate: constraint satisfied? */ lacon(struct vars *v, struct cnfa *pcnfa, /* parent cnfa */ chr *cp, - color co) /* "color" of the lookaround constraint */ + int n) /* index of the lookaround constraint */ { - int n; struct subre *sub; struct dfa *d; chr *end; @@ -820,7 +828,6 @@ lacon(struct vars *v, return 0; } - n = co - pcnfa->ncolors; assert(n > 0 && n < v->g->nlacons && v->g->lacons != NULL); FDEBUG(("=== testing lacon %d\n", n)); sub = &v->g->lacons[n]; diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c index cf95989948..69acc285f7 100644 --- a/src/backend/regex/regexec.c +++ b/src/backend/regex/regexec.c @@ -160,7 +160,7 @@ static void freedfa(struct dfa *); static unsigned hash(unsigned *, int); static struct sset *initialize(struct vars *, struct dfa *, chr *); static struct sset *miss(struct vars *, struct dfa *, struct sset *, color, chr *, chr *); -static int lacon(struct vars *, struct cnfa *, chr *, color); +static int lacon(struct vars *, struct cnfa *, chr *, int); static struct sset *getvacant(struct vars *, struct dfa *, chr *, chr *); static struct sset *pickss(struct vars *, struct dfa *, chr *, chr *); diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c index a493dbe88c..964a017f84 100644 --- a/src/backend/regex/regexport.c +++ b/src/backend/regex/regexport.c @@ -95,6 +95,7 @@ traverse_lacons(struct cnfa *cnfa, int st, regex_arc_t *arcs, int arcs_len) { struct carc *ca; + int i; /* * Since this function recurses, it could theoretically be driven to stack @@ -103,20 +104,24 @@ traverse_lacons(struct cnfa *cnfa, int st, */ check_stack_depth(); - for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) + /* Count and possibly emit state's ordinary out-arcs */ + for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++) { - if (ca->co < cnfa->ncolors) + int ndx = (*arcs_count)++; + + if (ndx < arcs_len) { - /* Ordinary arc, so count and possibly emit it */ - int ndx = (*arcs_count)++; - - if (ndx < arcs_len) - { - arcs[ndx].co = ca->co; - arcs[ndx].to = ca->to; - } + arcs[ndx].co = ca->co; + arcs[ndx].to = ca->to; } - else + } + + /* Find state's LACON out-arcs, if any */ + for (i = 0; i < cnfa->nlastates; i++) + { + if (cnfa->lastates[i] != st) + continue; + for (ca = cnfa->firstlaarc[i]; ca->co != COLORLESS; ca++) { /* LACON arc --- assume it's satisfied and recurse... */ /* ... but first, assert it doesn't lead directly to post state */ @@ -124,6 +129,7 @@ traverse_lacons(struct cnfa *cnfa, int st, traverse_lacons(cnfa, ca->to, arcs_count, arcs, arcs_len); } + break; /* needn't look through rest of lastates[] */ } } diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c index ec435b6f5f..31e158000f 100644 --- a/src/backend/regex/regprefix.c +++ b/src/backend/regex/regprefix.c @@ -131,7 +131,7 @@ findprefix(struct cnfa *cnfa, */ st = cnfa->pre; nextst = -1; - for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) + for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++) { if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1]) { @@ -146,6 +146,8 @@ findprefix(struct cnfa *cnfa, if (nextst == -1) return REG_NOMATCH; + /* The "pre" state won't have LACON out-arcs, so don't bother looking */ + /* * Scan through successive states, stopping as soon as we find one with * more than one acceptable transition character (either multiple colors @@ -164,18 +166,14 @@ findprefix(struct cnfa *cnfa, st = nextst; nextst = -1; thiscolor = COLORLESS; - for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) + for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++) { /* We can ignore BOS/BOL arcs */ if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1]) continue; - - /* - * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs - * and LACONs - */ + /* ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs */ if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] || - ca->co == RAINBOW || ca->co >= cnfa->ncolors) + ca->co == RAINBOW) { thiscolor = COLORLESS; break; @@ -198,6 +196,21 @@ findprefix(struct cnfa *cnfa, break; } } + /* If good so far, check to see if state has LACON outarcs */ + if (thiscolor != COLORLESS) + { + int i; + + for (i = 0; i < cnfa->nlastates; i++) + { + if (cnfa->lastates[i] == st) + { + /* Yes, so terminate the search */ + thiscolor = COLORLESS; + break; + } + } + } /* Done if we didn't find exactly one color on plain outarcs */ if (thiscolor == COLORLESS) break; @@ -235,7 +248,7 @@ findprefix(struct cnfa *cnfa, * even if the string is of zero length. */ nextst = -1; - for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) + for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++) { if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1]) { @@ -253,6 +266,20 @@ findprefix(struct cnfa *cnfa, break; } } + /* If good so far, check to see if state has LACON outarcs */ + if (nextst != -1) + { + int i; + + for (i = 0; i < cnfa->nlastates; i++) + { + if (cnfa->lastates[i] == st) + { + nextst = -1; + break; + } + } + } if (nextst == cnfa->post) return REG_EXACT; diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 306525eb5f..6b3b399b2c 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -345,17 +345,15 @@ struct nfa * * The main space savings in a compacted NFA is from making the arcs as small * as possible. We store only the transition color and next-state number for - * each arc. The list of out arcs for each state is an array beginning at - * cnfa.states[statenumber], and terminated by a dummy carc struct with - * co == COLORLESS. + * each arc. The list of plain out-arcs for each state is an array beginning + * at cnfa.firstarc[statenumber], and terminated by a dummy carc struct with + * co == COLORLESS. If a state has LACON out-arcs (most don't), then its + * state number appears in cnfa.lastates[k] for some k, and its first LACON + * out-arc is at cnfa.firstlaarc[k]. That array is likewise terminated by a + * dummy carc struct with co == COLORLESS. The "co" values for LACON out-arcs + * are lookaround constraint numbers. * - * The non-dummy carc structs are of two types: plain arcs and LACON arcs. - * Plain arcs just store the transition color number as "co". LACON arcs - * store the lookaround constraint number plus cnfa.ncolors as "co". LACON - * arcs can be distinguished from plain by testing for co >= cnfa.ncolors. - * - * Note that in a plain arc, "co" can be RAINBOW; since that's negative, - * it doesn't break the rule about how to recognize LACON arcs. + * Note that in a plain arc, "co" can be RAINBOW rather than a regular color. * * We have special markings for "trivial" NFAs that can match any string * (possibly with limits on the number of characters therein). In such a @@ -375,17 +373,20 @@ struct cnfa { int nstates; /* number of states */ int ncolors; /* number of colors (max color in use + 1) */ - int flags; + int flags; /* bitmask of the following flags: */ #define HASLACONS 01 /* uses lookaround constraints */ #define MATCHALL 02 /* matches all strings of a range of lengths */ + int nlastates; /* number of states with lookaround outarcs */ int pre; /* setup state number */ int post; /* teardown state number */ color bos[2]; /* colors, if any, assigned to BOS and BOL */ color eos[2]; /* colors, if any, assigned to EOS and EOL */ char *stflags; /* vector of per-state flags bytes */ #define CNFA_NOPROGRESS 01 /* flag bit for a no-progress state */ - struct carc **states; /* vector of pointers to outarc lists */ - /* states[n] are pointers into a single malloc'd array of arcs */ + struct carc **firstarc; /* per-state pointers to plain outarc lists */ + int *lastates; /* numbers of states with lookaround outarcs */ + struct carc **firstlaarc; /* pointers to LA outarcs for such states */ + /* firstarc[n] & firstlaarc[n] are pointers into a single array of arcs */ struct carc *arcs; /* the area for the lists */ /* these fields are used only in a MATCHALL NFA (else they're -1): */ int minmatchall; /* min number of chrs to match */