This is due to the HadoopRDD (and also the underlying Hadoop InputFormat)
reuse objects to avoid allocation. It is sort of tricky to fix. However, in
most cases you can clone the records to make sure you are not collecting
the same object over and over again.
https://issues.apache.org/jira/browse/
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT
and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks
hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ?
Akhil Das-2 wrote
> Can you dump out a small piece of data? while doing rdd.colle
full code example:
def main(args: Array[String]) {
val conf = new
SparkConf().setAppName("ErrorExample").setMaster("local[8]")
.set("spark.serializer", classOf[KryoSerializer].getName)
val sc = new SparkContext(conf)
val rdd = sc.hadoopFile(
"hdfs://./user.avro"
Can you dump out a small piece of data? while doing rdd.collect and
rdd.foreach(println)
Thanks
Best Regards
On Wed, Sep 17, 2014 at 12:26 PM, vasiliy wrote:
> it also appears in streaming hdfs fileStream
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabb
it also appears in streaming hdfs fileStream
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.