Hi, We are running on Spark 2.2.1, generating parquet files, like the following pseudo code df.write.parquet(...) We have recently noticed parquet file corruptions, when reading the parquet in Spark or Presto, as the following:
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 40870 in block 0 in file file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet Caused by: org.apache.parquet.io.ParquetDecodingException: could not read page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594] in col [incoming_aliases_array, list, element, key_value, value] BINARY It appears only one column in one of the rows in the file is corrupt, the file has 111041 rows. My questions are 1) How can I identify the corrupted row? 2) What could cause the corruption? Spark issue or Parquet issue? Any help is greatly appreciated. Thanks, Dong -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org