Hi, The partition information for Spark master is not updated after a stage failure. In case of HDFS, Spark gets partition information from InputFormat and if a data node in HDFS is down when spark is performing computation for a certain stage, this stage will fail and be resubmitted using the same partition information. With failures like a data node loss, I think the partition index you get when you call mapPartitionWithIndex is consistent.
Hope this helps! Liquan On Mon, Oct 6, 2014 at 12:33 PM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > Is the RDD partition index you get when you call mapPartitionWithIndex > consistent under fault-tolerance condition? > > I.e. > > 1. Say index is 1 for one of the partitions when you call > data.mapPartitionWithIndex((index, rows) => ....) // Say index is 1 > 2. The partition fails (maybe a long with a bunch of other partitions). > 3. When the partitions get restarted somewhere else, will they retain the > same index value, as well as all the lineage arguments? > > > -- Liquan Pei Department of Physics University of Massachusetts Amherst