I have a use case that would greatly benefit from RDDs having a .scanLeft() method. Are the project developers interested in adding this to the public API?
Looking through past message traffic, this has come up a few times. The recommendation from the list before has been to implement a parallel prefix scan. http://comments.gmane.org/gmane.comp.lang.scala.spark.user/1880 https://groups.google.com/forum/#!topic/spark-users/ts-FdB50ltY The algorithm Reynold sketched in the first link leads to this working implementation: val vector = sc.parallelize(1 to 20, 3) val sums = 0 +: vector.mapPartitionsWithIndex{ case(partition, iter) => Iterator(iter.sum) }.collect.scanLeft(0)(_+_).drop(1) val prefixScan = vector.mapPartitionsWithIndex { case(partition, iter) => val base = sums(partition) println(partition, base) iter.scanLeft(base)(_+_).drop(1) }.collect I'd love to have that replaced with this: val vector = sc.parallelize(1 to 20, 3) val cumSum: RDD[Int] = vector.scanLeft(0)(_+_) Any thoughts on whether this contribution would be accepted? What pitfalls exist that I should be thinking about? Thanks! Andrew