"Paul McGuire" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > "Andy Dingley" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >> >> [EMAIL PROTECTED] wrote: >> >>> I am looking for python code that takes as input a list of strings >>> [...] and outputs the python regular expression >> >> (s1|s2|s3|s4|s5) >> for strings of "s1" etc. >> >> Regex compilers are themselves quite good at optimising beyond this >> > > It turns out this problem is a little trickier, especially when one of > your strings is a leading subset of another. For instance, let's say we > are looking for comparison operators, one of <, >, =, >=, <=, or !=. > Simply concatenating with intervening '|' characters gives this regexp: > > "<|>|=|<=|>=|!=" > > However, the leading '<' and '>' alternatives mask the later '<=', '<>', > and '>=' alternatives, and so the regexp never matches the longer options > (I was not able to find a greediness switch that would override this). So > when searching "a >= b" we get this: > >>>> re.findall("<|>|=|<=|>=|!=", "a >= b") > ['>', '='] > > By moving the longer option to the front of the regexp, the longer option > is no longer masked by the shorter: > >>>> re.findall(">=|<|>|=|<=|!=", "a >= b") > ['>='] > > > You also can't just concatenate input strings, since it is very likely > they will contain one of the magic re symbols ()[]?*./\+, etc. So > re.escape needs to be called to add the necessary '\'s. > > Here is an extract from pyparsing's oneOf function that does something > similar, that handles the leading substring masking problem, and escapes > the input strings, before concatenating them to a valid regexp. Of > course, the simplest approach would be to just sort the symbols by > descending length, but you may have some a priori knowledge that 'X' is a > very common match, and want that option tested as early as possible. So > this method only reorders strings if there is a masking problem. > > > def createREfrom( symbols ): #symbols is a list of strings > isequal = ( lambda a,b: a == b ) > masks = ( lambda a,b: b.startswith(a) ) > i = 0 > while i < len(symbols)-1: > cur = symbols[i] > for j,other in enumerate(symbols[i+1:]): > if ( isequal(other, cur) ): > del symbols[i+j+1] > break > elif ( masks(cur, other) ): > del symbols[i+j+1] > symbols.insert(i,other) > cur = other > break > else: > i += 1 > return "|".join( [ re.escape(sym) for sym in symbols] ) > >>>> print createREfrom(["ABC","ABCDEF","ABCGHI"]) > ABCDEF|ABCGHI|ABC >>>> print createREfrom("> < = <= >= != << >> <<< >>>".split()) > \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\= >>>> re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <= >>>> b") > ['<='] > > > Note, this does not do any optimization, such as collapsing > "ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the > Python cookbook for such optimization. > > -- Paul > >
-- http://mail.python.org/mailman/listinfo/python-list