There is also a "randomSplit" method in the latest version of spark https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
On Tue, Mar 25, 2014 at 1:21 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > There is also https://github.com/apache/spark/pull/18 against the current > repo which may be easier to apply. > > > On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh <a...@adatao.com> wrote: > >> Hi Jaonary, >> >> You can find the code for k-fold CV in >> https://github.com/apache/incubator-spark/pull/448. I have not find the >> time to resubmit the pull to latest master. >> >> >> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani <sanjay_a...@yahoo.com >> > wrote: >> >>> Hi Jaonary, >>> >>> I believe the n folds should be mapped into n Keys in spark using a map >>> function. You can reduce the returned PairRDD and you should get your >>> metric. >>> I don't understand partitions fully, but from whatever I understand of >>> it, they aren't required in your scenario. >>> >>> Regards, >>> Sanjay >>> >>> >>> On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <jaon...@gmail.com> >>> wrote: >>> Hi >>> >>> I need to partition my data represented as RDD into n folds and run >>> metrics computation in each fold and finally compute the means of my >>> metrics overall the folds. >>> Does spark can do the data partition out of the box or do I need to >>> implement it myself. I know that RDD has a partitions method and >>> mapPartitions but I really don't understand the purpose and the meaning of >>> partition here. >>> >>> >>> >>> Cheers, >>> >>> Jaonary >>> >>> >>> >> >> >> -- >> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/ >> http://www.linkedin.com/in/haianh >> >> > > > -- > Cell : 425-233-8271 >