RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Ok thanks. Actually we ran something very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Sure. FI would just like to remove ones that fail the basic checks done by the Parquet readFooters function, in that their length is wrong or magic number is incorrect, which throws exceptions in the read method. Errors like: java.io.IOException: Could not read footer: java.lang.RuntimeExcepti

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Ah, yes, I see that it has been turned off now, that’s why it wasn’t working. Thank you, this is helpful! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? From: Cheng Lian [mailto

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Dear Michael, Thank you very much for your help. I should have mentioned in my original email, I did try the sequence notation. It doesn’t seem to have the desired effect. Maybe I should say that each one of these files has a different schema. When I use that call, I’m not ending up with a