Hi All, I have some avro data, which I am reading in the following way.
Query : > val data = sc.newAPIHadoopFile(file, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]). map(_._1.datum) But, when I try to print the data, it is generating duplicates. > data.take(10).foreach(println) One of the workaround I found was to RDD.repartition(10) the data, upon which I get the samples without any duplicates. I have read this post : http://stackoverflow.com/questions/35951616/repeat-duplicate-records-with-avro-and-spark. But this is not solving my problem. I am curious to know the reason for this behaviour. Thank you for your time, AnoopShiralige