[Bryan Olson] >>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful >>> book
[Tim Peters] >> Strongly disagree: [...] [Bryan] > I know I'm disagreeing with a lot of smart people in panning > the book. That's allowed :-) >>> What Python uses is search-and-backtrack. Unfortunately such >>> engines don't have much theory behind them, and it's hard to >>> reason generally about what they do. >> Yup X 3, and the last is precisely why Friedl's book is valuable for >> people using "NFA" implementations: Friedl does a good job of >> explaining when and why you're likely to trigger atrocious runtime >> performance, and gives practical general techniques for avoiding those >> problems. None of that has anything to do with CompSci regexps >> either, but is vital knowledge for people hoping to make happy >> non-trivial use of Python/Perl/etc regexps. > I read the first edition, minus some engine-specific stuff. > General techniques are what I want, and I didn't see them in the > book. He had things to do and not do, but it didn't add up to > building (ir-)regular expressions with reliably tractable > behavior. My guess then is that the book moved at too slow a pace for you, and you got bored. The most valuable general technique he (eventually ;-) explained he called "unrolling", and consists of writing a regexp in the form: normal* (?: special normal* )* where the sets of characters with which `normal` and `special` can start are disjoint. Under most "NFA" implementation, such a regexp is immune to most bad behaviors, whether finding very long matches, or in searching a very long string that happens not to contain any matches. Without knowing this, under many implementations it's very easy to end up writing a regexp that matches quickly when a short match exists, but "blows up" (stack fault, or some checked internal limit hit) if only a very long match exists; and/or takes even exponential time to fail to match when a match doesn't exist. For example, a direct application is writing a robust regular expression to match C string literals (also one form of Python string literal: double-quote character at each end, and backslash escapes, including backslash-newline to span lines): " [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* " The "normal case" is that it's just a regular old character: not a double quote, backslash, or newline. The "special case" is a backslash followed by any character whatsoever. The "normal" and "special" cases obviously don't intersect, so the interior of this regexp (i.e., ignoring the double quotes at each end) fits the "unrolling" pattern to a tee. As a result, you can use it with reasonably high confidence. For /why/ that's so, read the rest of his book ;-) In effect, it gives the search engine only one possible course of action at each character in the target string. So long as that's "normal", it has to stay in the "normal" part of the regexp, and as soon as it's "special" it has to leave the "normal" part. The intent is to eliminate any possibility for backtracking, thus ensuring predictable runtime and ensuring that no form of internal backtracking stack needs to be used (let alone hit its limit). The unrolling idea was new to me, and as soon as I learned it I replaced lots of regexps in the Python library with rewritten versions that demonstrably performed very much better -- even allowed closing some bug reports concerning extreme cases (extremely long matches, and unbearable runtime on long strings without any matches). A different lesson to take from this is that nobody is likely to stumble into effective "NFA" techniques based on luck or "common sense". It's a set of obscure & specialized learned skills. -- http://mail.python.org/mailman/listinfo/python-list