On Wed, Dec 14, 2016 at 10:43 PM, Paul Moore <p.f.mo...@gmail.com> wrote: > This is a messy format to parse, but it's manageable. However, there's a > catch. Because the logging software involved is broken, I can occasionally > get a log record prematurely terminated with a new record starting > mid-stream. So something like the following: > > [2016-11-30T20:04:08.000+00:00] [Component] > [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description > of the issue goes here > > I'm struggling to find a "clean" way to parse this. I've managed a clumsy > approach, by splitting the file contents on the pattern > [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case where > this gets truncated) and then treating each entry as a record and parsing it > individually. But the resulting code isn't exactly maintainable, and I'm > looking for something cleaner. >
Is the "[Component]" section something you could verify? (That is - is there a known list of components?) If so, I would include that as a secondary check. Ditto anything else you can check (I'm guessing the [level] is one of a small set of values too.) The logic would be something like this: Read line from file. Verify line as a potential record: Assert that line begins with timestamp. Verify as many fields as possible (component, level, etc) Search line for additional timestamp. If additional timestamp found: Recurse. If verification fails, assume we didn't really have a corrupted line. (Process partial line? Or discard?) If "[[" in line: Until line is "]]": Read line from file, append to description If timestamp found: Recurse. If verification succeeds, break out of loop. Unfortunately it's still not really clean; but that's the nature of working with messy data. Coping with ambiguity is *hard*. ChrisA -- https://mail.python.org/mailman/listinfo/python-list