Hi, That was actually *one* regex expression...lol.
And yes, it probably is a bit convoluted. Thanks for the tip about using VERBOSE - I'll use that, and comment my regex - that's a useful tip. Are there any other general pointers you might give for that regex? The lines I'm trying to match look something like this: 07:40:05.793627975 [Info ] [SOME_MODULE] [SOME_FUNCTION] [SOME_OTHER_FLAG] [RequestTag=0 ErrorCode=3 ErrorText="some error message" ID=0:0x0000000000000000 Foo=1 Bar=5 Joe=5] Essentially, I'd want to strip out the timestamp, logging-level, module, function etc - and possibly the tag-value pairs? And yes, based on what you said, I probably will use the "in" loop first outside the regex - the lines I'm searching for are fairly few compared to the overall log size. Cheers, Victor On Thursday, 13 December 2012 12:09:33 UTC+11, Steven D'Aprano wrote: > On Wed, 12 Dec 2012 14:35:41 -0800, Victor Hooi wrote: > > > > > Hi, > > > > > > I have a script that trawls through log files looking for certain error > > > conditions. These are identified via certain keywords (all different) in > > > those lines > > > > > > I then process those lines using regex groups to extract certain fields. > > [...] > > > Also, my regexs could possibly be tuned, they look something like this: > > > > > > (?P<timestamp>\d{2}:\d{2}:\d{2}.\d{9})\s*\[(?P<logginglevel>\w+)\s* > > \]\s*\[(?P<module>\w+)\s*\]\s*\[{0,1}\]{0,1}\s*\[(?P<function>\w+)\s*\] > > \s*level\(\d\) broadcast\s*\(\[(?P<f1_instance>\w+)\]\s*\[(?P<foo>\w+)\] > > \s*(?P<bar>\w{4}):(?P<feedcode>\w+) failed order: (?P<side>\w+) (? > > P<volume>\d+) @ (?P<price>[\d.]+), error on update \(\d+ : Some error > > string. Active Orders=(?P<value>\d+) Limit=(?P<limit>\d+)\)\) > > > > > > (Feel free to suggest any tuning, if you think they need it). > > > > "Tuning"? I think it needs to be taken out and killed with a stake to the > > heart, then buried in concrete! :-) > > > > An appropriate quote: > > > > Some people, when confronted with a problem, think "I know, > > I'll use regular expressions." Now they have two problems. > > -- Jamie Zawinski > > > > Is this actually meant to be a single regex, or did your email somehow > > mangle multiple regexes into a single line? > > > > At the very least, you should write your regexes using the VERBOSE flag, > > so you can use non-significant whitespace and comments. There is no > > performance cost to using VERBOSE once they are compiled, but a huge > > maintainability benefit. > > > > > > > My question is - I've heard that using the "in" membership operator is > > > substantially faster than using Python regexes. > > > > > > Is this true? What is the technical explanation for this? And what sort > > > of performance characteristics are there between the two? > > > > Yes, it is true. The technical explanation is simple: > > > > * the "in" operator implements simple substring matching, > > which is trivial to perform and fast; > > > > * regexes are an interpreted mini-language which operate via > > a complex state machine that needs to do a lot of work, > > which is complicated to perform and slow. > > > > Python's regex engine is not as finely tuned as (say) Perl's, but even in > > Perl simple substring matching ought to be faster, simply because you are > > doing less work to match a substring than to run a regex. > > > > But the real advantage to using "in" is readability and maintainability. > > > > As for the performance characteristics, you really need to do your own > > testing. Performance will depend on what you are searching for, where you > > are searching for it, whether it is found or not, your version of Python, > > your operating system, your hardware. > > > > At some level of complexity, you are better off just using a regex rather > > than implementing your own buggy, complicated expression matcher: for > > some matching tasks, there is no reasonable substitute to regexes. But > > for *simple* uses, you should prefer *simple* code: > > > > [steve@ando ~]$ python -m timeit \ > > > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \ > > > "'xyz' in data" > > 100000 loops, best of 3: 4.17 usec per loop > > > > [steve@ando ~]$ python -m timeit \ > > > -s "data = 'abcd'*1000 + 'xyz' + 'abcd'*1000" \ > > > -s "from re import search" \ > > > "search('xyz', data)" > > 100000 loops, best of 3: 10.9 usec per loop > > > > > > > > > (I couldn't find much in the way of docs for "in", just the brief > > > mention here - > > > http://docs.python.org/2/reference/expressions.html#not-in ) > > > > > > Would I be substantially better off using a list of strings and using > > > "in" against each line, then using a second pass of regex only on the > > > matched lines? > > > > That's hard to say. It depends on whether you are matching on a substring > > that will frequently be found close to the start of each line, or > > something else. > > > > Where I expect you *will* see a good benefit is: > > > > * you have many lines to search; > > * but only a few actually match the regex; > > * the regex is quite complicated, and needs to backtrack a lot; > > * but you can eliminate most of the "no match" cases with a simple > > substring match. > > > > If you are in this situation, then very likely you will see a big benefit > > from a two-pass search: > > > > for line in log: > > if any(substr in line for substr in list_of_substrings): > > # now test against a regex > > > > > > Otherwise, maybe, maybe not. > > > > > > -- > > Steven -- http://mail.python.org/mailman/listinfo/python-list