Re: Corrupt parquet file

Ryan Blue Mon, 05 Feb 2018 09:46:47 -0800

If you can still access the logs, then you should be able to find where the
write task ran. Maybe you can get an instance ID and open a ticket with
Amazon. Otherwise, it will probably start failing the HW checks when the
instance hardware is reused, so I wouldn't worry about it.


The _SUCCESS file convention means that the job ran successfully, at least
to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to
indicate actual job success (you could do other tasks after that fail) and
it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <[email protected]> wrote:

> Hi, Ryan,
>
>
>
> Many thanks for your quick response.
>
> We ran Spark on transient EMR clusters. Nothing in the log or EMR events
> suggests any issues with the cluster or the nodes. We also see the _SUCCESS
> file on the S3. If we see the _SUCCESS file, does that suggest all data is
> good?
>
> How can we prevent a recurrence? Can you share your experience?
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Monday, February 5, 2018 at 12:38 PM
> *To: *Dong Jiang <[email protected]>
> *Cc: *Spark Dev List <[email protected]>
> *Subject: *Re: Corrupt parquet file
>
>
>
> Dong,
>
>
>
> We see this from time to time as well. In my experience, it is almost
> always caused by a bad node. You should try to find out where the file was
> written and remove that node as soon as possible.
>
>
>
> As far as finding out what is wrong with the file, that's a difficult
> task. Parquet's encoding is very dense and corruption in encoded values
> often looks like different data. When you see a decoding exception like
> this, we find it is usually that the compressed data was corrupted and is
> no longer valid. You can look for the page of data based on the value
> counter, but that's about it.
>
>
>
> Even if you could find a single record that was affected, that's not
> valuable because you don't know whether there is other corruption that is
> undetectable. There's nothing to reliably recover here. What we do in this
> case is find and remove the bad node, then reprocess data so we know
> everything is correct from the upstream source.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <[email protected]> wrote:
>
> Hi,
>
> We are running on Spark 2.2.1, generating parquet files, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto, as the following:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-
> 4af35426f434.c000.snappy.parquet
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
>
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
>
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
>
> Any help is greatly appreciated.
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Reply via email to