compound regex

spir Mon, 09 Feb 2009 12:44:15 -0800

Hello,

(new here)


Below an extension to standard module re. The point is to allow writing and 
testing sub-expressions individually, then nest them into a super-expression. 
More or less like using a parser generator -- but keeping regex grammar and 
power.
I used the format {sub_expr_name}: as in standard regexes {} are only used to 
express repetition number, a pair of curly braces nesting an identifier should 
not conflict.

The extension is new, very few tested. I would enjoy comments, critics, etc. I 
would like to know if you find such a feature useful. You will probably find 
the code simple enough ;-) 

Denis
------
la vida e estranya

===============
# coding: utf-8

'''     super_regex
        
        Define & check sub-patterns individually,
        then include them in global super-pattern.
        
        uses format {name} for inclusion:
                sub1 = Regex(...)
                sub2 = Regex(...)
                super_format = "...{sub1}...{sub2}..."
                # final regex object:
                super_regex = superRegex(super_format)
        '''

from re import compile as Regex

# sub-pattern inclusion format
sub_pattern = Regex(r"{[a-zA-Z_][a-zA-Z_0-9]*}")

# sub-pattern expander
def sub_pattern_expansion(inclusion, dic=None):
        name = inclusion.group()[1:-1]
        ### namespace dict may be specified -- else globals()
        if dic is None:
                dic = globals()
        if name not in dic:
                raise NameError("Cannot find sub-pattern '%s'." % name)
        return dic[name].pattern

# super-pattern generator
def superRegex(format):
        expanded_format = sub_pattern.sub(sub_pattern_expansion, format)
        return Regex(expanded_format)

if __name__ == "__main__": # purely artificial example use
        # pattern
        time = Regex(r"\d\d:\d\d:\d\d") # hh:mm:ss
        code = Regex(r"\S{5}")                  # non-whitespace x 5
        desc = Regex(r"[\w\s]+$")               # alphanum|space --> EOL
        ref_format = "^ref: {time} #{code} --- {desc}"
        ref_regex = superRegex(ref_format)
        # output
        print 'super pattern:\n"%s" ==>\n"%s"\n' % 
(ref_format,ref_regex.pattern)
        text = "ref: 12:04:59 #%+.?% --- foo 987 bar"
        result = ref_regex.match(text)
        print 'text: "%s" ==>\n"%s"' %(text,result.group())
--
http://mail.python.org/mailman/listinfo/python-list

compound regex

Reply via email to