Re: Looking for a regexp generator based on a set of known string representative of a string set

Anthra Norell Sat, 09 Sep 2006 01:33:33 -0700

Diez B. Roggisch wrote:
> [EMAIL PROTECTED] schrieb:
>> Hello
>>
>> I am looking for python code that takes as input a list of strings
>> (most similar,
>> but not necessarily, and rather short: say not longer than 50 chars)
>> and that computes and outputs the python regular expression that
>> matches
>> these string values (not necessarily strictly, perhaps the code is able
>> to determine
>> patterns, i.e. families of strings...).
>
> For matching the exact set, of course a concatenation can be used.
> However, this is of limited use, because then simple string find will
> suffice.
>
> In general, this can't be done. It might be possible that the underlying
> structure of the language isn't regular at all - for example, the simple
> language of palindromes isn't.
>
> And even if it was - the search space is just to large to explore. You
> could generate a bazillion matching expressions, but .* will always
> match - so how do you prevent the generator from converging towards that?
>
> If all you have are those strings, you are better off trying to infer
> some structure yourself.
>
> Diez



Here is how regular expressions raise confusion. One starts with a relatively 
simple matching problem and before long finds oneself
in a "search space too large to explore", having to formulate the right 
expression out of "a bazillion" permutations with a syntax
replete with potential traps: backslashes, special characters, precedence 
order, etc. The fact emerges (once more) that a regular
expression does not spare the effort of acquiring a crystal clear conception of 
what needs to be done as an absolute prerequisite of
writing code that in the end can be relied upon to do what it is supposed to 
do. In order to acquire that crystal clear conception a
"space too large to explore" is hardly a working environment one should be 
eager to get into. Yet regular expressions seem to exert
a strange attraction and are often sought for problems that don't really 
require them. I suspect this is because the regular
expression parsers are made by highly qualified experts and so they may 
recommend themselves to the programmer with a less than
perfect understanding of a task as a sort of magic bullet that "does it right" 
all by itself. Regex eding as a substitute for
problem analysis so to speak ...
      Most OPs are rather sketchy and leave it up to their readers to guess 
purpose, requirements, incidentals, environmentals,
restrictions, etc. Accordingly, threads often veer off topic very quickly into 
lofty realms of academic interest, the OP never to be
heard from again. So I would suggest that the OP explain what he intends to do 
with his regular expression. A lot depends on it.
      In the interim I'd like to resize the search space to a manageable 
dimension with the proven hypothesis that the number of
useless permutations of the kind being discussed is a bazillion minus one, 
because, fortunately, there is only one that makes sense.
It follows these two rules: 1. First come first serve. And 2. Of two contending 
targets the longer prevails. That is not to say
other rules never make sense. It is to say that in ninety-nine percent of the 
cases this rule applies and that the other one percent
can also be understood in terms of these two rules plus exception rules.

Chapter 3.1. from the SE doc visualizes the point by replacing a set of 
overlapping targets with their upper-cased selves:

   >>> overlapping_targets = 'be=BE being=BEING been=BEEN bee=BEE belong=BELONG 
long=LONG longer=LONGER'
   >>> high_strung_bee_story = "There was a bee belonging to hive nine longing 
to be a beetle and thinking that being a bee was
okay, but she had been a bee long enough and wouldn't be one much longer."
   >>> SE.SE (overlapping_targets)(high_strung_bee_story)
"There was a BEE BELONGing to hive nine LONGing to BE a BEEtle and thinking 
that BEING a BEE was okay, but she had BEEN a BEE LONG
enough and wouldn't BE one much LONGER."

The matches appear correct, though BELONGing, LONGing and BEEtle exemplify 
unintended matches. These could be avoided by refinement
of the target set. The partially successful result points out the direction the 
refinement has to take. Precedence being unrelated
to the order of the definitions, editing the set is carefree in that it does 
not impose precedence considerations. The benefit in
terms of modularity is substantial, as module sets can be combined.

Doing the example with a regular expression:

>>> r = re.compile ('be|being|been|bee|belong|long|longer')
>>> r.findall (high_strung_bee_story)
['be', 'be', 'long', 'long', 'be', 'be', 'be', 'be', 'be', 'be', 'long', 'be', 
'long']

The regular expression also applies rule one: first come first serve, but 
applies a different rule two, defining precedence in the
order in which the targets line up in the expression, left to right. In order 
to implement the SE rule of sensibility, the patterns
should be lined up  reverse-sorted, which places the longer ones to the left.

>>> r = re.compile ('longer|long|belong|being|been|bee|be')
>>> r.findall (high_strung_bee_story)
['bee', 'belong', 'long', 'be', 'bee', 'being', 'bee', 'been', 'bee', 'long', 
'be', 'longer']

Which of the two is correct? Without knowing the purpose of the search it is 
impossible to know. I expect it to be the second one.
But only the OP knows and could help his helpers help by letting them know what 
he is up to.


Frederic




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Looking for a regexp generator based on a set of known string representative of a string set

Reply via email to