Regarding the specific problem of generating random folds in a more efficient way, this should help: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
It uses a sort of multiplexing formalism on RDDs: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions I wrote a blog post to explain the idea here: http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/ ----- Original Message ----- > Hi there, > > I'd like to write some iterative computation, i.e., computation that can be > done via a for loop. I understand that in Spark foreach is a better choice. > However, foreach and foreachPartition seem to be for self-contained > computation that only involves the corresponding Row or Partition, > respectively. But in my application each computational task does not only > involve one partition, but also other partitions. It's just that every task > has a specific way of using the corresponding partition and the other > partitions. An example application will be cross-validation in machine > learning, where each fold corresponds to a partition, e.g., the whole data > is divided into 5 folds, then for task 1, I use fold 1 for testing and folds > 2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5 > for training; etc. > > In this case, if I were to use foreachPartition, it seems that I need to > duplicate the data the number of times equal to the number of folds (or > iterations of my for loop). More generally, I would need to still prepare a > partition for every distributed task and that partition would need to > include all the data needed for the task, which could be a huge waste of > space. > > Is there any other solutions? Thanks. > > f. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
