from:"jordan.thomas"

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas

Ok thanks. Actually we ran something very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas

Sure. FI would just like to remove ones that fail the basic checks done by the Parquet readFooters function, in that their length is wrong or magic number is incorrect, which throws exceptions in the read method. Errors like: java.io.IOException: Could not read footer: java.lang.RuntimeExcepti

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas

Ah, yes, I see that it has been turned off now, that’s why it wasn’t working. Thank you, this is helpful! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? From: Cheng Lian [mailto

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas

Dear Michael, Thank you very much for your help. I should have mentioned in my original email, I did try the sequence notation. It doesn’t seem to have the desired effect. Maybe I should say that each one of these files has a different schema. When I use that call, I’m not ending up with a

RE: Performance when iterating over many parquet files

RE: Performance when iterating over many parquet files

RE: Performance when iterating over many parquet files

RE: Performance when iterating over many parquet files

4 matches

Site Navigation

Mail list logo

Footer information