Re: Corrupt parquet file

Ryan Blue Mon, 05 Feb 2018 10:42:38 -0800

In that case, I'd recommend tracking down the node where the files were
created and reporting it to EMR.


On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang <dji...@dataxu.com> wrote:

> Thanks for the response, Ryan.
>
> We have transient EMR cluster, and we do rerun the cluster whenever the
> cluster failed. However, in this particular case, the cluster succeeded,
> not reporting any errors. I was able to null out the corrupted the column
> and recover the rest of the 133 columns. I do feel the issue is more than
> 1-2 occurrences a year. This is the second time, I am aware of the issue
> within a month, and we certainly don’t run as large data infrastructure
> compared to Netflix.
>
>
>
> I will keep an eye on this issue.
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rb...@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 1:34 PM
>
> *To: *Dong Jiang <dji...@dataxu.com>
> *Cc: *Spark Dev List <dev@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> We ensure the bad node is removed from our cluster and reprocess to
> replace the data. We only see this once or twice a year, so it isn't a
> significant problem.
>
>
>
> We've discussed options for adding write-side validation, but it is
> expensive and still unreliable if you don't trust the hardware.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dji...@dataxu.com> wrote:
>
> Hi, Ryan,
>
>
> Do you have any suggestions on how we could detect and prevent this issue?
>
> This is the second time we encountered this issue. We have a wide table,
> with 134 columns in the file. The issue seems only impact one column, and
> very hard to detect. It seems you have encountered this issue before, what
> do you do to prevent a recurrence?
>
>
>
> Thanks,
>
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rb...@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:46 PM
>
>
> *To: *Dong Jiang <dji...@dataxu.com>
> *Cc: *Spark Dev List <dev@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> If you can still access the logs, then you should be able to find where
> the write task ran. Maybe you can get an instance ID and open a ticket with
> Amazon. Otherwise, it will probably start failing the HW checks when the
> instance hardware is reused, so I wouldn't worry about it.
>
>
>
> The _SUCCESS file convention means that the job ran successfully, at least
> to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to
> indicate actual job success (you could do other tasks after that fail) and
> it carries no guarantee about the data that was written.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dji...@dataxu.com> wrote:
>
> Hi, Ryan,
>
>
>
> Many thanks for your quick response.
>
> We ran Spark on transient EMR clusters. Nothing in the log or EMR events
> suggests any issues with the cluster or the nodes. We also see the _SUCCESS
> file on the S3. If we see the _SUCCESS file, does that suggest all data is
> good?
>
> How can we prevent a recurrence? Can you share your experience?
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rb...@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:38 PM
> *To: *Dong Jiang <dji...@dataxu.com>
> *Cc: *Spark Dev List <dev@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> Dong,
>
>
>
> We see this from time to time as well. In my experience, it is almost
> always caused by a bad node. You should try to find out where the file was
> written and remove that node as soon as possible.
>
>
>
> As far as finding out what is wrong with the file, that's a difficult
> task. Parquet's encoding is very dense and corruption in encoded values
> often looks like different data. When you see a decoding exception like
> this, we find it is usually that the compressed data was corrupted and is
> no longer valid. You can look for the page of data based on the value
> counter, but that's about it.
>
>
>
> Even if you could find a single record that was affected, that's not
> valuable because you don't know whether there is other corruption that is
> undetectable. There's nothing to reliably recover here. What we do in this
> case is find and remove the bad node, then reprocess data so we know
> everything is correct from the upstream source.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dji...@dataxu.com> wrote:
>
> Hi,
>
> We are running on Spark 2.2.1, generating parquet files, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto, as the following:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-
> 4af35426f434.c000.snappy.parquet
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
>
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
>
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
>
> Any help is greatly appreciated.
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Reply via email to