Spark RDD.take() generating duplicates for AvroData

Anoop Shiralige Fri, 15 Apr 2016 06:00:01 -0700

Hi All,

I have some avro data, which I am reading in the following way.


Query :

> val data = sc.newAPIHadoopFile(file,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable]).
map(_._1.datum)

But, when I try to print the data, it is generating duplicates.

> data.take(10).foreach(println)


One of the workaround I found was to RDD.repartition(10) the data, upon
which I get the samples without any duplicates. I have read this post :
http://stackoverflow.com/questions/35951616/repeat-duplicate-records-with-avro-and-spark.
But this is not solving my problem.  I am curious to know the reason for
this behaviour.

Thank you for your time,
AnoopShiralige

Spark RDD.take() generating duplicates for AvroData

Reply via email to