I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here.
On Wed, Jun 3, 2015 at 1:39 PM, Don Drake <dondr...@gmail.com> wrote: > As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that > Spark is behaving differently when reading Parquet directories that contain > a .metadata directory. > > It seems that in spark 1.2.x, it would just ignore the .metadata > directory, but now that I'm using Spark 1.3, reading these files causes the > following exceptions: > > scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir") > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown > during a parallel computation: java.lang.RuntimeException: > hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [116, 34, 10, 125] > > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) > > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) > > > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) > > > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) > > scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) > > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > > scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) > > scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) > > . > > . > > . > > > > java.lang.RuntimeException: > hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [116, 34, 10, 125] > > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) > > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) > > > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) > > > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) > > scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) > > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > > scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) > > scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) > > . > > . > > . > > > > java.lang.RuntimeException: > hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [117, 101, 116, 10] > > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) > > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) > > > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) > > > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) > > scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) > > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > > scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) > > scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) > > . > > . > > . > > at > scala.collection.parallel.package$$anon$1.alongWith(package.scala:87) > > at > scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86) > > at > scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650) > > at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72) > > at > scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650) > > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190) > > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514) > > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162) > > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514) > > at > scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > > > When I remove the .metadata directory, it is able to read these parquet > files just fine. > > I feel that Spark should ignore the dot files/directories when attempting > to read these parquet files. I'm seeing this in CDH 5.4.2 (Spark 1.3.0 + > patches) > > Thoughts? > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ > http://www.MailLaunder.com/ > 800-733-2143 > -- Marcelo