After checking the codes, I think there are few issues regarding this ignoreCorruptFiles config, so you can't actually use it with Parquet files now.
I opened a JIRA https://issues.apache.org/jira/browse/SPARK-19082 and also submitted a PR for it. khyati wrote > Hi Reynold Xin, > > In spark 2.1.0, > I tried setting spark.sql.files.ignoreCorruptFiles = true by using > commands, > > val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) > > sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / > sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") > > but still getting error while reading parquet files using > val newDataDF = > sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet") > > Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4) > java.io.IOException: Could not read footer: java.lang.RuntimeException: > hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [65, 82, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing anything. ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20466.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
