I'm trying to list and then process all files in an hdfs directory. I'm able to run the code below when I supply a specific AvroSequence file, but if I use a wildcard to get all Avro sequence files in the directory it fails.
Anyone know how to do this? val avroRdd = sc.newAPIHadoopFile("hdfs://<url>:8020/<my dir>/*", classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]) avroRdd.collect() org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 1.0 failed 1 times, most recent failure: Lost task 8.0 in stage 1.0 (TID 20, localhost): java.io.EOFException (null) java.io.DataInputStream.readFully(DataInputStream.java:197) java.io.DataInputStream.readFully(DataInputStream.java:169) org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714) org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728) org.apache.avro.hadoop.io.AvroSequenceFile.getMetadata(AvroSequenceFile.java:727) org.apache.avro.hadoop.io.AvroSequenceFile.access$100(AvroSequenceFile.java:71) org.apache.avro.hadoop.io.AvroSequenceFile$Reader$Options.getConfigurationWithAvroSerialization(AvroSequenceFile.java:672) org.apache.avro.hadoop.io.AvroSequenceFile$Reader.<init>(AvroSequenceFile.java:709) org.apache.avro.mapreduce.AvroSequenceFileInputFormat$AvroSequenceFileRecordReader.initialize(AvroSequenceFileInputFormat.java:86) org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:114) org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100) org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) org.apache.spark.rdd.RDD.iterator(RDD.scala:228) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648.html Sent from the Apache Spark User List mailing list archive at Nabble.com.