I'm trying to list and then process all files in an hdfs directory.

I'm able to run the code below when I supply a specific AvroSequence file,
but if I use a wildcard to get all Avro sequence files in the directory it
fails. 

Anyone know how to do this?

val avroRdd = sc.newAPIHadoopFile("hdfs://<url>:8020/<my dir>/*", 
classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable])
avroRdd.collect()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in
stage 1.0 failed 1 times, most recent failure: Lost task 8.0 in stage 1.0
(TID 20, localhost): java.io.EOFException (null)
        java.io.DataInputStream.readFully(DataInputStream.java:197)
        java.io.DataInputStream.readFully(DataInputStream.java:169)
       
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
       
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
       
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
       
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
       
org.apache.avro.hadoop.io.AvroSequenceFile.getMetadata(AvroSequenceFile.java:727)
       
org.apache.avro.hadoop.io.AvroSequenceFile.access$100(AvroSequenceFile.java:71)
       
org.apache.avro.hadoop.io.AvroSequenceFile$Reader$Options.getConfigurationWithAvroSerialization(AvroSequenceFile.java:672)
       
org.apache.avro.hadoop.io.AvroSequenceFile$Reader.<init>(AvroSequenceFile.java:709)
       
org.apache.avro.mapreduce.AvroSequenceFileInputFormat$AvroSequenceFileRecordReader.initialize(AvroSequenceFileInputFormat.java:86)
       
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:114)
        org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100)
        org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
        org.apache.spark.scheduler.Task.run(Task.scala:51)
       
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)
       
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:744)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to