Re: How to print all expressions that match a regular expression

Alf P. Steinbach Sat, 06 Feb 2010 18:57:36 -0800

* Steven D'Aprano:

On Sun, 07 Feb 2010 01:51:19 +0100, Alf P. Steinbach wrote:

Regular expressions are programs in a "regex" programming language.
What you are asking for is the same as saying:

"Is there a program that can enumerate every possible set of data that
is usable as valid input for a given program?"

This, in turn, is equivalent to the Halting Problem -- if you can solve
one, you can solve the other. You might like to google on the Halting
Problem before you spend too much time on this.

Hm, well, text editors /regularly/ do repeated regular expression
searches, producing match after match after match, on request.


I think you have completely misunderstood what I'm saying.


Yes.

I'm not saying that you can't *run* a regular expression against text andgenerate output. That truly would be a stupid thing to say, because Iclearly can do this:
import re
mo = re.search("p.rr.t",
... "Some text containing parrots as well as other things")
mo.group()
'parrot'
As you point out, it's not hard to embed a regex interpreter inside atext editor or other application, or to call an external library.
What is difficult, and potentially impossible, is to take an arbitraryregular expression such as "p.rr.t" (the program in the regex language)and generate every possible data ("parrot", "pbrrat", ...) that wouldgive a match when applied to that regular expression.

Hm, that's not difficult to do, it's just like counting, but it's rathermeaningless since either the output is trivial or it's in general exponential orinfinite.


So it seems we both misunderstood the problem.

I didn't read the top level article until now, and reading it, I can't makesense of it.

It sounds like some kind of homework problem, but without the constraints thatsurely would be there in a homework problem.

Now, in this case, my example is very simple, and it would be easy toenumerate every possible data: there's only 65025 of them, limiting tothe extended ASCII range excluding NUL (1-255). But for an arbitraryregex, it won't be that easy. Often it will be unbounded: the example ofenumerating every string that matches .* has already been given.
The second problem is, generating the data which gives the output youwant is potentially very, very, difficult, potentially as difficult asfinding collisions in cryptographic hash functions:
"Given the function hashlib.sha256, enumerate all the possible inputsthat give the hexadecimal result0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."


I tried some "parrot" variants but no dice. :-(


[snip]

I'm suggesting that, in general, there's no way to tell in advance whichregexes will be easy and which will be hard, and even when they are easy,the enumeration will often be infinite.

I agree about the (implied) meaningless, exponential/infinite output, whichmeans that possibly that's not what the OP meant, but disagree about thereasoning about "no way": really, regular expressions are /very/ limited so it'snot hard to compute up front the number of strings it can generate from somegiven character set, in time linear in the length of the regexp.

Essentially, any regexp that includes '+' or '*' (directly or via e.g. notationthat denotes "digit sequence") yields an infinite number of strings.

And otherwise the regexp is of the form ABCDE..., where A, B, C etc are partsthat each can generate a certain finite number of strings; multiplying thesenumbers gives the total number of strings that the regexp can generate.



Cheers,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list

Re: How to print all expressions that match a regular expression

Reply via email to