Ok thanks. Actually we ran something very similar this weekend. It works but
is very slow.
The Spark method I included in my original post is about 5-6 times faster.
Just wondering if there is something even faster than that. I see this as
being a recurring problem over the next few months
Sure. FI would just like to remove ones that fail the basic checks done by the
Parquet readFooters function, in that their length is wrong or magic number is
incorrect, which throws exceptions in the read method.
Errors like:
java.io.IOException: Could not read footer: java.lang.RuntimeExcepti
Ah, yes, I see that it has been turned off now, that’s why it wasn’t working.
Thank you, this is helpful! The problem now is to filter out bad (miswritten)
Parquet files, as they are causing this operation to fail.
Any suggestions on detecting them quickly and easily?
From: Cheng Lian [mailto
Dear Michael,
Thank you very much for your help.
I should have mentioned in my original email, I did try the sequence notation.
It doesn’t seem to have the desired effect. Maybe I should say that each one
of these files has a different schema. When I use that call, I’m not ending up
with a