Hi Jeff,

Many access patterns simply take the result of hadoopFile and use it to
create some other object, and thus have no need for each input record to
refer to a different object.  In those cases, the current API is more
performant than an alternative that would create an object for each record,
because it avoids the unnecessary overhead of creating Java objects.  As
you've pointed out, this is at the expense of making the code more verbose
when caching.

-Sandy

On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi <jeffsar...@hotmail.com>
wrote:

> So we tried reading a sequencefile in Spark and realized that all our
> records have ended up becoming the same.
> THen one of us found this:
>
> Note: Because Hadoop's RecordReader class re-uses the same Writable object
> for each record, directly caching the returned RDD or directly passing it
> to an aggregation or shuffle operation will create many references to the
> same object. If you plan to directly cache, sort, or aggregate Hadoop
> writable objects, you should first copy them using a map function.
>
> Is there anyone that can shed some light on this bizzare behavior and the
> decisions behind it?
> And I also would like to know if anyone's able to read a binary file and
> not to incur the additional map() as suggested by the above? What format
> did you use?
>
> thanks
> Jeff
>

Reply via email to