Kay Schluehr replied to my question: > > Why do you want to use a regex for this? > > Because it is part of a tokenizer that already uses regexps and I do > not intend to rewrite / replace it.
Switching to pytst is not a big change - there will be little impact on the rest of your code. On the other hand, it does change your build and deployment process because of the need for another package. Because you are constraining yourself to regex there isn't much of an improvement you can do. > I found that certain groups of token > might be represented in a more compact mannerr: for matching ['+', > '++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if > there is some generic approach to solve the "inverse regexp" problem in > a non-trivial fashion. There are a few questions which come up. Why do you want to have a "more compact" representation? Do you think it will make the result faster? Likely it will because of the reduced need for backtracking but unless you have a lot of strings with the same prefix then the number of backtracks should go as something like the log of the number of words, which scales pretty well. Are you speed limited by the existing implementation? Likely not. In which case the performance isn't an issue. Consider this anecdote from http://www.technicat.com/writing/programming.html I was asked in a phone interview with Google how I would search for a number in an array of ordered numbers. Obviously, the questioner was asking for a CS 101 answer - binary search. But in real life, I would probably do the "wrong" thing - search the array from beginning to end. There's no point in taking twice the time to write twice as much code that has to be maintained and debugged if the application performance is good enough, and in particular if that piece of code is not the bottleneck in the application. (And I seriously doubt you'd have that data laid out linearly in a fixed array like that if it was the bottleneck) As it turns out, the regexp implementation could do the optimization you want. That is, it could recognize that the input sequence is a set of fixed words (no special characters) and implement something like Aho-Corasick, in which case you wouldn't need to change your input. I don't know if Python's regexp implementation does this already. Neither do you. Test the functionality and performance first before going starting other work. I am not the first to give you this advice: John Machin: > Optimised with respect to what? speed? ease of construction? gry@ll.mit.edu: > As others have suggested, you should first try the most naive > implementation before making a hard optimization problem out of this. As for the general question of the existance of a "generic approach to solve the "inverse regexp" problem in a non-trivial fashion." Yes, there is. Kinda of. All (formal/theoretical) regular expressions can be written as a regular expression without need for backtracking. If you look up the theory of regular expressions, backtracking occurs when there are ambiguities in traversing the state machine. These are called nondeterministic finite state machines, or NFAs. It turns out that every NFA can be written as a DFA, which has no need for backtracking. Except. Actually, two excepts. The normal algorithm for converting an NFA to a DFA doesn't support grouping, which you'll need to figure out which group matched. Wikipedia says there is an alternate, slower algorithm which does support groups. The second except is that regexps in Python, Perl, etc. are not regular expressions. Some patterns, like r"([A-Z][a-z]+)\1", which matches "WikiWiki", are not "regular grammars" under the definition used in language theory. It's possible to use regexps to match strings of prime length, for example. So in the general case there is no solution. There are many possible optimizations, but no general solution. For your problem, which is a limited subset of the problem as it involves only constant strings, then the solution is a prefix tree, also called a trie. See http://en.wikipedia.org/wiki/Prefix_tree John Machin in an earlier post suggested you search along these lines when he said > I would have thought the way to approach this would be a simple > character-by-character tree/trie-driven lookup. This would be worst case > O(n) where n is the length of the longest string in your list. There may > well be a Python-callable gadget for this on the net somewhere. Google > "Danny Yoo ahocorasick" for a Python-callable solution to a similar but > more complex problem. For that matter I suggested it when I pointed you to pytst. The tst means "ternary search tree" (some write it "ternary search trie") which is a different implementation of the same idea. Your posts read like you ignored those pointers because they didn't fall into the category "implemented with a regular expression". If that's the case then no, there is no solution. You're letting a sense of purity overwhelm practicality. If you are really interested in this problem, try taking a CS course titled something like "Automata Theory" or "Foundations of Computer Science". The one which covers DFAs, NFAs, CFGs, Turing machines, etc. It's the CS class I enjoyed the most. Andrew [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list