On Wednesday, 14 December 2016 12:57:23 UTC, Chris Angelico wrote: > Is the "[Component]" section something you could verify? (That is - is > there a known list of components?) If so, I would include that as a > secondary check. Ditto anything else you can check (I'm guessing the > [level] is one of a small set of values too.)
Possibly, although this is to analyze the structure of a basically undocumented log format. So if I validate too tightly, I end up just checking my assumptions rather than checking the data :-( > The logic would be > something like this: > > Read line from file. > Verify line as a potential record: > Assert that line begins with timestamp. > Verify as many fields as possible (component, level, etc) > Search line for additional timestamp. > If additional timestamp found: > Recurse. If verification fails, assume we didn't really have a > corrupted line. > (Process partial line? Or discard?) > If "[[" in line: > Until line is "]]": > Read line from file, append to description > If timestamp found: > Recurse. If verification succeeds, break out of loop. > > Unfortunately it's still not really clean; but that's the nature of > working with messy data. Coping with ambiguity is *hard*. Yeah, that's essentially what I have now. As I say, it's working but nobody could really love it. But you're right, it's more the fault of the data than of the code. One thought I had, which I might try, is to go with the timestamp as the one assumption I make of the data, and read the file in as, in effect, a text stream, spitting out a record every time I see something matching a the [timestamp] pattern. Then parse record by record. Truncated records should either be obvious (because the delimited fields have start and end markers, so unmatched markers = truncated record) or acceptable (because undelimited fields are free text). I'm OK with ignoring the possibility that the free text contains something that looks like a timestamp. The only problem with this approach is that I have more data than I'd really like to read into memory all at once, so I'd need to do some sort of streamed match/split processing. But thinking about it, that sounds like the sort of job a series of chained generators could manage. Maybe I'll look at that approach... Paul -- https://mail.python.org/mailman/listinfo/python-list