Philippe Verdy <verd...@wanadoo.fr> added the comment: Umm.... I saif that the attribution to Thompson was wrong, in fact it was correct. Thompson designed and documented the algorithm in 1968, long before the Aho/Seti/Ullman green book... so the algorithm is more than 40 years old, and still not in Python, Perl and PCRE (but it is present in GNU awk...)
The paper published in swtch.com is effectively written in 2007, but its conclusions are perfectly valid. The interesting aspect of this paper is that it demonstrates that the Thompson's multi-state NFA approach is still the best one, and better than what Perl, PCRE (and PHP), Python, Ruby and others are using, but that it can be also optimized further by caching the DFA construction "on the fly" (see the blue curve on the displayed diagram) while parsing the the already precompiled multi-state NFA. The cache for DFA states will fill up while matching the regexp against actual strings, so it will be much faster (and much less memory-and- time-greedy than generating the full DFA transition table at compilation time like in Bison/Yacc). However the paper still does not discusses how to make the DFA states cache limited in size. Notably because the longer the input text will be, the more the DFA cache will contain DFA states. One simple rule is to limit the number of cached DFA states, and then to allow all stored transitions to go all multiple-states in the NFA, and optionally to a single DFA state in the cache. Then the DFA cache can be used in a LIFO manner, to purge it automatically from unused states, in order to reuse them (for caching another new DFA state which is still not present in the cache, when the cache has reached its maximum size): if this occurs, the other existing DFA states that point to it must be cleaned (their DFA state pointer or reference, stored in their NFA or DFA transitions, must be cleared/set to null, so that they will only contain the list of pointers to outgoing NFA states). The problem is how to look for a multistate in the DFA cache: this requires some fast lookup, but this can be implemented in a fast way using hash tables (by hashing the list of NFA states represented in the DFA state). Apparently, GNU awk does not use the cached DFA approach: it just uses the NFA directly when the input text is lower than two dozens of characters, then it builds the full DFA as soon as the input text becomes larger (this explains the sudden, but moderate increase of time). But I've seen that GNU awk has the defect of this unlimited DFA generation approach: its excessive use of memory when the input text increases, because the number of DFA states added to the cache is contantly growing with the input, and the time to match each characer from the input slowly increases too. At some point, it will crash and exit due to memory limits exhaustion, when no more DFA states can be stored. That's why the DFA cache MUST be maintained under some level. I'll try to implement this newer approach first in Java (just because I better know this language than Python, and beacause I think it is more solid in terms of type-safety, so it can reduce the number of bugs to correct before getting something stable). In Java, there's a clean way to automatically cleanup objects from collections, when they are no longer needed: you just need to use weak references (the garbage collector will automatically cleanup the older objects, when needed). But this approach is not easy to port, and in fact, if it can effectively solve some problems, it will still not forbif the cache to use up to the maximum VM size. for performance reasons, I see little interest in storing more than about 1 million DFA states in the cache (also because the hash table that would be used would be less efficient when looking up for the key of a NFA multi-state where the DFA state is stored). So I will probably not use weak references, but will first use a maximum size (even if weak references could help maintain the cache at even lower bounds than the allowed maximum, if VM memory size is more constrained: it is a good idea in all Java programs to allow caches introduced only for performance reasons to also collaborate with the garbage collector, in order to avoid the explosion of all caches used in various programs or libraries). I don't know if Python supports the concept of weak references for handling performance caches. May be I'll port it later in Python, but don't expect that I'll port it to C/C++ (as a replacement of PCRE), because I now hate these unsafe languages despite I have practiced them for many years: others would do that for me, when I'll have published my Java implementation. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7132> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com