Although it feels like you are copying an RDD when you map it, it is not necessarily literally being copied. Your map function may pass through most objects unchanged. So there may not be so much overhead as you think.
I don't think you can avoid a scan of the data unless you can somehow know that whole partitions do not need to be touched. If this still doesn't work you may need to reconsider your design as it may not be a great fit for the RDD model. Yes you can't somehow assign to Iterators. On Dec 2, 2014 11:23 AM, "Xuelin Cao" <[email protected]> wrote: > > Hi, > > I'd like to make an operation on an RDD that *ONLY *change the value > of some items, without make a full copy or full scan of each data. > > It is useful when I need to handle a large RDD, and each time I need > only to change a little fraction of the data, and keeps other data > unchanged. Certainly I don't want to make a full copy the data to the new > RDD. > > For example, suppose I have a RDD that contains integer data from 0 > to 100. What I want is to make the first element of the RDD changed from 0 > to 1, other elements untouched. > > I tried this, but it doesn't work: > > var rdd = parallelize(Range(0,100)) > rdd.mapPartitions({iter=> iter(0) = 1}) > > The reported error is : value update is not a member of > Iterator[Int] > > > Anyone knows how to make it work? > >
