Re: Parsing a potentially corrupted file

Paul Moore Wed, 14 Dec 2016 06:16:09 -0800

On Wednesday, 14 December 2016 12:57:23 UTC, Chris Angelico  wrote:
> Is the "[Component]" section something you could verify? (That is - is
> there a known list of components?) If so, I would include that as a
> secondary check. Ditto anything else you can check (I'm guessing the
> [level] is one of a small set of values too.)


Possibly, although this is to analyze the structure of a basically undocumented 
log format. So if I validate too tightly, I end up just checking my assumptions 
rather than checking the data :-(

> The logic would be
> something like this:
> 
> Read line from file.
> Verify line as a potential record:
>     Assert that line begins with timestamp.
>     Verify as many fields as possible (component, level, etc)
>     Search line for additional timestamp.
>     If additional timestamp found:
>         Recurse. If verification fails, assume we didn't really have a
> corrupted line.
>         (Process partial line? Or discard?)
>     If "[[" in line:
>         Until line is "]]":
>             Read line from file, append to description
>             If timestamp found:
>                 Recurse. If verification succeeds, break out of loop.
> 
>  Unfortunately it's still not really clean; but that's the nature of
> working with messy data. Coping with ambiguity is *hard*.

Yeah, that's essentially what I have now. As I say, it's working but nobody 
could really love it. But you're right, it's more the fault of the data than of 
the code.

One thought I had, which I might try, is to go with the timestamp as the one 
assumption I make of the data, and read the file in as, in effect, a text 
stream, spitting out a record every time I see something matching a the 
[timestamp] pattern. Then parse record by record. Truncated records should 
either be obvious (because the delimited fields have start and end markers, so 
unmatched markers = truncated record) or acceptable (because undelimited fields 
are free text). I'm OK with ignoring the possibility that the free text 
contains something that looks like a timestamp.

The only problem with this approach is that I have more data than I'd really 
like to read into memory all at once, so I'd need to do some sort of streamed 
match/split processing. But thinking about it, that sounds like the sort of job 
a series of chained generators could manage. Maybe I'll look at that approach...

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Parsing a potentially corrupted file

Reply via email to