Bugs item #1515829, was opened at 2006-07-02 08:26 Message generated for change (Comment added) made by tim_one You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1515829&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Regular Expressions >Group: Feature Request >Status: Closed >Resolution: Wont Fix Priority: 5 Submitted By: Erik Demaine (edemaine) >Assigned to: Nobody/Anonymous (nobody) Summary: Exponential behavior in regular expression Initial Comment: 're' seems to have serious performance trouble with nested Kleene stars in the regular expression, if the matched text is fairly long. Attached is an example, naturally arising in some LaTeX parsing [admittedly not the only way to do it], along with a text generator parameterized by a repetition count n. Here is simple timing data on a Pentium 4 1.5GHz with 1.5GB RAM as a function of n: [...] n=4: 0:00:00.015000 n=5: 0:00:00.032000 n=6: 0:00:00.140000 n=7: 0:00:00.594000 n=8: 0:00:02.203000 n=9: 0:00:08.859000 n=10: 0:00:39.641000 n=11: 0:02:44.172000 n=12: 0:10:23.500000 This seems far slower than it should be, but I don't know the algorithm used by 're'. Is this behavior expected? If so, should it be optimized away by changing the algorithm? The generated text consists of a few lines of preamble, then a variable number n of copies of a partiuclar line, followed by a few lines of postscript. The first line of the postscript causes the regular expression *not* to match, and 're' spends a long time to find that out. Removing that line from the postscript, and causing the regular expression to match, makes the program run instantly. I get the same behavior on Python 2.4 and 2.5b1, on Windows and Linux, and with re.sub and re.search. ---------------------------------------------------------------------- >Comment By: Tim Peters (tim_one) Date: 2006-07-02 17:40 Message: Logged In: YES user_id=31435 Since this isn't going to change, I'm closing this. I don't know exactly what your regexp is intended to match, but I expect this will help speed it enormously: inside a group, one of the alternatives is the negated character class (NCC): [^{}%] One of the other alternatives starts with "{" and another with "%". That's very good, because those three alternatives are mutually exclusive based on just the current character in the target string. However, yet another alternative starts with a backslash, and it's thus ambiguous whether the backslash should be matched by that alternative or by the NCC. Because this is a backtracking engine, and the NCC is the first alternative, it tries the NCC first and won't try the backslash alternative unless it's impossible to find a match having tried the NCC. That can cause exponential-time failing-match behavior all by itself. If it's the case that a backslash in this context is _always_ supposed to match the \\. alternative, then adding a backslash to the NCC removes the ambiguity and greatly speeds (at least) failing matches: [^{}%\\] Then which alternative is supposed to match is entirely determined by the current character in the target string, so when backtracking on failure all other alternatives fail at once, and backtracking continues with at worst insignificant delay. Adding a backslash to the "inner" NCC helps a little, but adding one to the "outer" NCC too is very effective: n=0: 0:00:00 n=1: 0:00:00 n=2: 0:00:00 ... n=97: 0:00:00 n=98: 0:00:00 n=99: 0:00:00 See Friedl's book for much more along these lines. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2006-07-02 09:13 Message: Logged In: YES user_id=31435 Yes, it's easy to provoke exponential-time behavior. For a simple example, the regexp ^((a+)*)*$ takes O(3**n) time to fail to match strings of the form "a"*n + "b" Python's matcher is a backtracking engine, much like Perl's, and most other languages' non-POSIX re facilities. There's nary a DFA in sight ;-) Jeffrey Friedl's thick O'Reilly book "Mastering Regular Expressions" is mostly about the pragmatics of using such engines efficiently: http://regex.info/ Note that there's no current hope that this will change: because of gimmicks like backreferences, these aren't CompSci's idea of regular expressions, and no "thoroughly efficient" implementation technique is known. For example, this teensy regexp: ^(aa+)\\1+$ matches strings of a's whose length isn't prime, and finds a non-trivial factor when the length is composite. For harder-to-solve but messier examples: http://perl.plover.com/NPC/ ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1515829&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com