I don't know the details of the algorithm for generating a state machine to match a PCRE. I was wondering if there is an optimization so that matching an alteration of subpatterns is faster than than matching each subpattern sequentially.
On Tue, Jan 15, 2019 at 11:21 AM Leif Hedstrom <zw...@apache.org> wrote: > > > > > On Jan 15, 2019, at 10:09 AM, Walt Karas <wka...@oath.com.INVALID> wrote: > > > > When we remap we go through the list of regexs looking for a match. > > What if we took all the regexes, removed subpattern captures, and > > created one giant pattern with alternation and capture of the regex > > for each rule, like: > > > > (regex1)|(regex2)|...|(regexN) > > > > We could then just step through the array of subpattern capture > > offsets to find the matching rule. Then do a second match against > > just the original rule regex to get any captures it contains. Would > > this work, and would it be faster? > > > That could potentially be a lot slower, no? Imagine you have a remap today > with > > regex_map <regex1> ... > regex_map <regex2> ... > regex_map <regex3> ... > regex_map <regex4> … > > > If you know what you are doing, and have some luck, you could order these > such as the most common regex is matched early. And first match wins, and > therefore, you could stop evaluating regexes early. Even in an average case, > you’d only evaluate half of the regexes, right? > > But if you combine all these into one regex, you essentially have to evaluate > all regexes all the time, regardless of order and regardless if something has > matched. > > — Leif > >