Re: Corrupt parquet file

Steve Loughran Mon, 12 Feb 2018 11:57:37 -0800


On 12 Feb 2018, at 19:35, Dong Jiang 
<dji...@dataxu.com<mailto:dji...@dataxu.com>> wrote:


I got no error messages from EMR. We write directly from dataframe to S3. There 
doesn’t appear to be an issue with S3 file, we can still down the parquet file 
and read most of the columns, just one column is corrupted in parquet.
I suspect we need to write to HDFS first, make sure we can read back the entire 
data set, and then copy from HDFS to S3. Any other thoughts?


The s3 object store clients mostly buffer to local temp fs before they write, 
at least all the ASF connectors do, so that data can be PUT/POSTed in 5+MB 
blocks, without requiring enough heap to buffer all data written by all 
threads. That's done to file://, not HDFS. Even if you do that copy up later 
from HDFS to S3, there's still going to be that local HDD buffering: it's not 
going to fix the problem —not if this really is corrupted local HDD data

Re: Corrupt parquet file

Reply via email to