Would this be an issue on the raw data ? I use the following simple code,
and don't hit the issue you mentioned. Or it would be better to share your
code.

val rdd =sc.sequenceFile("/Users/hadoop/Temp/Seq",
classOf[IntWritable], classOf[Text])
rdd.map{case (k,v) => (k.get(), v.toString)}.collect() foreach println


On Thu, Nov 19, 2015 at 12:04 PM, jeff saremi <jeffsar...@hotmail.com>
wrote:

> I sent this to the user forum. I got no responses. Could someone here
> please help? thanks
> jeff
>
> ------------------------------
> From: jeffsar...@hotmail.com
> To: u...@spark.apache.org
> Subject: SequenceFile and object reuse
> Date: Fri, 13 Nov 2015 13:29:58 -0500
>
>
> So we tried reading a sequencefile in Spark and realized that all our
> records have ended up becoming the same.
> THen one of us found this:
>
> Note: Because Hadoop's RecordReader class re-uses the same Writable object
> for each record, directly caching the returned RDD or directly passing it
> to an aggregation or shuffle operation will create many references to the
> same object. If you plan to directly cache, sort, or aggregate Hadoop
> writable objects, you should first copy them using a map function.
>
> Is there anyone that can shed some light on this bizzare behavior and the
> decisions behind it?
> And I also would like to know if anyone's able to read a binary file and
> not to incur the additional map() as suggested by the above? What format
> did you use?
>
> thanks
> Jeff
>



-- 
Best Regards

Jeff Zhang

Reply via email to