In that case, I'd recommend tracking down the node where the files were created and reporting it to EMR.
On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang <dji...@dataxu.com> wrote: > Thanks for the response, Ryan. > > We have transient EMR cluster, and we do rerun the cluster whenever the > cluster failed. However, in this particular case, the cluster succeeded, > not reporting any errors. I was able to null out the corrupted the column > and recover the rest of the 133 columns. I do feel the issue is more than > 1-2 occurrences a year. This is the second time, I am aware of the issue > within a month, and we certainly don’t run as large data infrastructure > compared to Netflix. > > > > I will keep an eye on this issue. > > > > Thanks, > > > Dong > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-To: *"rb...@netflix.com" <rb...@netflix.com> > *Date: *Monday, February 5, 2018 at 1:34 PM > > *To: *Dong Jiang <dji...@dataxu.com> > *Cc: *Spark Dev List <dev@spark.apache.org> > *Subject: *Re: Corrupt parquet file > > > > We ensure the bad node is removed from our cluster and reprocess to > replace the data. We only see this once or twice a year, so it isn't a > significant problem. > > > > We've discussed options for adding write-side validation, but it is > expensive and still unreliable if you don't trust the hardware. > > > > rb > > > > On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dji...@dataxu.com> wrote: > > Hi, Ryan, > > > Do you have any suggestions on how we could detect and prevent this issue? > > This is the second time we encountered this issue. We have a wide table, > with 134 columns in the file. The issue seems only impact one column, and > very hard to detect. It seems you have encountered this issue before, what > do you do to prevent a recurrence? > > > > Thanks, > > > > Dong > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-To: *"rb...@netflix.com" <rb...@netflix.com> > *Date: *Monday, February 5, 2018 at 12:46 PM > > > *To: *Dong Jiang <dji...@dataxu.com> > *Cc: *Spark Dev List <dev@spark.apache.org> > *Subject: *Re: Corrupt parquet file > > > > If you can still access the logs, then you should be able to find where > the write task ran. Maybe you can get an instance ID and open a ticket with > Amazon. Otherwise, it will probably start failing the HW checks when the > instance hardware is reused, so I wouldn't worry about it. > > > > The _SUCCESS file convention means that the job ran successfully, at least > to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to > indicate actual job success (you could do other tasks after that fail) and > it carries no guarantee about the data that was written. > > > > rb > > > > On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dji...@dataxu.com> wrote: > > Hi, Ryan, > > > > Many thanks for your quick response. > > We ran Spark on transient EMR clusters. Nothing in the log or EMR events > suggests any issues with the cluster or the nodes. We also see the _SUCCESS > file on the S3. If we see the _SUCCESS file, does that suggest all data is > good? > > How can we prevent a recurrence? Can you share your experience? > > > > Thanks, > > > Dong > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-To: *"rb...@netflix.com" <rb...@netflix.com> > *Date: *Monday, February 5, 2018 at 12:38 PM > *To: *Dong Jiang <dji...@dataxu.com> > *Cc: *Spark Dev List <dev@spark.apache.org> > *Subject: *Re: Corrupt parquet file > > > > Dong, > > > > We see this from time to time as well. In my experience, it is almost > always caused by a bad node. You should try to find out where the file was > written and remove that node as soon as possible. > > > > As far as finding out what is wrong with the file, that's a difficult > task. Parquet's encoding is very dense and corruption in encoded values > often looks like different data. When you see a decoding exception like > this, we find it is usually that the compressed data was corrupted and is > no longer valid. You can look for the page of data based on the value > counter, but that's about it. > > > > Even if you could find a single record that was affected, that's not > valuable because you don't know whether there is other corruption that is > undetectable. There's nothing to reliably recover here. What we do in this > case is find and remove the bad node, then reprocess data so we know > everything is correct from the upstream source. > > > > rb > > > > On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dji...@dataxu.com> wrote: > > Hi, > > We are running on Spark 2.2.1, generating parquet files, like the following > pseudo code > df.write.parquet(...) > We have recently noticed parquet file corruptions, when reading the parquet > in Spark or Presto, as the following: > > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read > value at 40870 in block 0 in file > file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f- > 4af35426f434.c000.snappy.parquet > > Caused by: org.apache.parquet.io.ParquetDecodingException: could not read > page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594] > in col [incoming_aliases_array, list, element, key_value, value] BINARY > > It appears only one column in one of the rows in the file is corrupt, the > file has 111041 rows. > > My questions are > 1) How can I identify the corrupted row? > 2) What could cause the corruption? Spark issue or Parquet issue? > > Any help is greatly appreciated. > > Thanks, > > Dong > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix