Re: Corrupt parquet file

2018-02-13 Thread Steve Loughran
On 12 Feb 2018, at 20:21, Ryan Blue mailto:rb...@netflix.com>> wrote: I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when memory holding a value i

Re: Corrupt parquet file

2018-02-12 Thread Ryan Blue
I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when memory holding a value is corrupted just before it is handed off to the writer? That's why we track

Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran
On 12 Feb 2018, at 19:35, Dong Jiang mailto:dji...@dataxu.com>> wrote: I got no error messages from EMR. We write directly from dataframe to S3. There doesn’t appear to be an issue with S3 file, we can still down the parquet file and read most of the columns, just one column is corrupted in p

Re: Corrupt parquet file

2018-02-12 Thread Dong Jiang
back the entire data set, and then copy from HDFS to S3. Any other thoughts? From: Steve Loughran Date: Monday, February 12, 2018 at 2:27 PM To: "rb...@netflix.com" Cc: Dong Jiang , Apache Spark Dev Subject: Re: Corrupt parquet file What failure mode is likely here? As the uploads

Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran
>> Reply-To: "rb...@netflix.com<mailto:rb...@netflix.com>" mailto:rb...@netflix.com>> Date: Monday, February 5, 2018 at 1:34 PM To: Dong Jiang mailto:dji...@dataxu.com>> Cc: Spark Dev List mailto:dev@spark.apache.org>> Subject: Re: Corrupt parquet file We ensure the

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
> *Date: *Monday, February 5, 2018 at 1:34 PM > > *To: *Dong Jiang > *Cc: *Spark Dev List > *Subject: *Re: Corrupt parquet file > > > > We ensure the bad node is removed from our cluster and reprocess to > replace the data. We only see this once or twice a year, so it isn'

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
before, what do you do to prevent a recurrence? Thanks, Dong From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Monday, February 5, 2018 at 12:46 PM To: Dong Jiang Cc: Spark Dev List Subject: Re: Corrupt parquet file If you can still access the logs, then you should be able to

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
uot; Date: Monday, February 5, 2018 at 1:34 PM To: Dong Jiang Cc: Spark Dev List Subject: Re: Corrupt parquet file We ensure the bad node is removed from our cluster and reprocess to replace the data. We only see this once or twice a year, so it isn't a significant problem. We've d

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
> *Date: *Monday, February 5, 2018 at 12:46 PM > > *To: *Dong Jiang > *Cc: *Spark Dev List > *Subject: *Re: Corrupt parquet file > > > > If you can still access the logs, then you should be able to find where > the write task ran. Maybe you can get an instance ID and op

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
_SUCCESS file, does that suggest all data is > good? > > How can we prevent a recurrence? Can you share your experience? > > > > Thanks, > > > Dong > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Monday, February 5, 201

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
a recurrence? Can you share your experience? Thanks, Dong From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Monday, February 5, 2018 at 12:38 PM To: Dong Jiang Cc: Spark Dev List Subject: Re: Corrupt parquet file Dong, We see this from time to time as well. In my experience, it

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
Dong, We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible. As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding i