Would this be an issue on the raw data ? I use the following simple code, and don't hit the issue you mentioned. Or it would be better to share your code.
val rdd =sc.sequenceFile("/Users/hadoop/Temp/Seq", classOf[IntWritable], classOf[Text]) rdd.map{case (k,v) => (k.get(), v.toString)}.collect() foreach println On Thu, Nov 19, 2015 at 12:04 PM, jeff saremi <jeffsar...@hotmail.com> wrote: > I sent this to the user forum. I got no responses. Could someone here > please help? thanks > jeff > > ------------------------------ > From: jeffsar...@hotmail.com > To: u...@spark.apache.org > Subject: SequenceFile and object reuse > Date: Fri, 13 Nov 2015 13:29:58 -0500 > > > So we tried reading a sequencefile in Spark and realized that all our > records have ended up becoming the same. > THen one of us found this: > > Note: Because Hadoop's RecordReader class re-uses the same Writable object > for each record, directly caching the returned RDD or directly passing it > to an aggregation or shuffle operation will create many references to the > same object. If you plan to directly cache, sort, or aggregate Hadoop > writable objects, you should first copy them using a map function. > > Is there anyone that can shed some light on this bizzare behavior and the > decisions behind it? > And I also would like to know if anyone's able to read a binary file and > not to incur the additional map() as suggested by the above? What format > did you use? > > thanks > Jeff > -- Best Regards Jeff Zhang