Hi,

I have a script that trawls through log files looking for certain error 
conditions. These are identified via certain keywords (all different) in those 
lines

I then process those lines using regex groups to extract certain fields.

Currently, I'm using a for loop to iterate through the file, and a dict of 
regexes:

    breaches = {
        'type1': re.compile(r'some_regex_expression'),
        'type2': re.compile(r'some_regex_expression'),
        'type3': re.compile(r'some_regex_expression'),
        'type4': re.compile(r'some_regex_expression'),
        'type5': re.compile(r'some_regex_expression'),
    }
    ...
    with open('blah.log', 'r') as f:
        for line in f:
            for breach in breaches:
                results = breaches[breach].search(line)
                if results:
                    self.logger.info('We found an error - {0} - 
{1}'.format(results.group('errorcode'), results.group('errormsg'))
                    # We do other things with other regex groups as well.

(This isn't the *exact* code, but it shows the logic/flow fairly closely).

For completeness, the actual regexes look something like this:

Also, my regexs could possibly be tuned, they look something like this:

    
(?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s*\]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\]\s*level\(\d\)
 
broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\]\s*(?P<bar>\w{4}):(?P<feedcode>\w+)
 failed order: (?P<side>\w+) (?P<volume>\d+) @ (?P<price>[\d.]+), error on 
update \(\d+ : Some error string. Active Orders=(?P<value>\d+) 
Limit=(?P<limit>\d+)\)\)

(Feel free to suggest any tuning, if you think they need it).

My question is - I've heard that using the "in" membership operator is 
substantially faster than using Python regexes.

Is this true? What is the technical explanation for this? And what sort of 
performance characteristics are there between the two?

(I couldn't find much in the way of docs for "in", just the brief mention here 
- http://docs.python.org/2/reference/expressions.html#not-in )

Would I be substantially better off using a list of strings and using "in" 
against each line, then using a second pass of regex only on the matched lines?

(Log files are compressed, I'm actually using bz2 to read them in, uncompressed 
size is around 40-50 Gb).



Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to