> Why are people getting stuck on infinite regular > languages? I've made it quite clear that I'm only really > interested in doing this for finite languages, but that > shouldn't matter anyway.
The power of regular expressions is that they define a consice means to encapsulate an infinite number of inputs. If you're not tapping this power, the general wisdom is to not use regular expressions. That said, unless you can ensure you don't have an infinite grammer (in particular, no asterisks or plus-signs or unbounded ranges like "{4,}"), you then have the problem of how to navigate that infinite set. Taking for example the most common/simple regexp: ".*" That matches any and every string. There's a whole library of Congress for which every written work will match that string. It sounds a lot like Jorge Luis Borges's 1956 "The Library of Babel". Do you begin searching depth-first with "a" and then "aa" and then "aaa"? Or do you begin breadth-first with "a", then "b"...then "z", then "aa", "ab", ...? Depth-first is far more efficient when searching a branching set of data, but when you have infinite depth, it will take a while ;) Alternatively, you can use some sort of iterative-deepening search which puts an increasing bound on how deep the depth-first search will go before giving up for that particular iteration. Googling on "iteritave deepening" should bring back some ideas for that one. If you truely have a finite grammer, it would be feasible to write some sort of parser, ugly as it might be. As a first pass, I'd have my parser emit python code to write N nested loops to try each combination of each element, so something like "[a-zA-Z_][a-zA-Z0-9_]" became something like def tries(): s1 = set(range(ord('a'),ord('z')+1)) s1.add(range(ord('A'),ord('Z')+1)) s1.add(ord('_')) s2 = set(range(ord('a'),ord('z')+1)) s2.add(range(ord('A'),ord('Z')+1)) s2.add(range(ord('0'),ord('9')+1)) s2.add(ord('_')) for a1 in s1: for a2 in s2: yield "%s%s" % (chr(a1), chr(a2)) Thus, the input regexp would generate python code that you could then include in your project. Ugly, but a possible solution. There are lots of peculiarities though that would need to be handled. Branching, escaped characters in character sets (well, escaping in general is a headache). An alternative might be brute-forcing your regexp: for a1 in range(0,256): for a2 in range(0,256): s = "%s%s" % (chr(a1),chr(a2)) m = re.match(s) if m: yield s nested to however deep you're interested in. All will be slow. All are pretty ugly. -tkc -- http://mail.python.org/mailman/listinfo/python-list