Tim Peters wrote:
> """
> Some people, when confronted with a problem, think “I know, I'll use
> regular expressions.” Now they have two problems.
> - Jamie Zawinski
> """
Maybe so, but I'm committed now :). I have dozens of regexes to parse specific
log messages I'm interested in. I made a little DSL that uses regexes with
capture groups, and if the regex matches, takes the resulting groupdict and
optionally applies further transformations on the individual fields. This
allows me to very concisely specify what I want to extract before doing further
analysis and aggregation on the resulting fields. For example:
flush_end = Rule(
Capture(
# Completed flushing
/u01/data02/tb_tbi_project02_prd/data_launch_index-4a5f72725b7211eaab635720a1b8a299/aa-26507-bti-Data.db
(46.528MiB) for commitlog position CommitLogPosition(segmentId=1615955816662,
position=223538288)
# Completed flushing
/dse/data02/OpsCenter/rollup_state-7b621931ab7511e8b862810a639403e5/bb-21969-bti-Data.db
(7.763MiB/2.197MiB on disk/1 files) for commitlog position
CommitLogPosition(segmentId=1637403836277, position=9927158)
r"Completed flushing (?P<sstable>[^ ]+)
\((?P<bytes_flushed>[^)/]+)(/(?P<bytes_on_disk>[^ ]+) on disk/(?P<file_count>[^
]+) files)?\) for commitlog position
CommitLogPosition\(segmentId=(?P<commitlog_segment>[^,]+),
position=(?P<commitlog_position>[^)]+)\)"
),
Convert(
normval,
"bytes_flushed",
"bytes_on_disk",
"commitlog_segment",
"commitlog_position",
),
table_from_sstable,
)
I know there are specialized tools like logstash but it's nice to be able to
specify the extraction and subsequent analysis together in Python.
> reason to change that. Naive regexps are both clumsy and prone to bad
> timing in many tasks that "should be" very easy to express. For
> example, "now match up to the next occurrence of 'X'". In SNOBOL and
> Icon, that's trivial. 75% of regexp users will write ".*X", with scant
> understanding that it may match waaaay more than they intended.
> Another 20% will write ".*?X", with scant understanding that may
> extend beyond _just_ "the next" X in some cases. That leaves the happy
> 5% who write "[^X]*X", which finally says what they intended from the
> start.
If you look in my regex in the example above, you will see that the "[^X]*X" is
exactly what I did. The pathological case arose from a simple typo where I had
an extra + after a capture group that I failed to notice, and which somehow
worked correctly on the expected input but ran forever when the expected
terminating character appeared more times than expected in the input string.
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/USLCQSN6WARWTWJI5LATPS3DZMAYDM5S/
Code of Conduct: http://python.org/psf/codeofconduct/