Re: N-Fold validation and RDD partitions

2014-03-25 Thread Jaonary Rabarisoa
There is also a "randomSplit" method in the latest version of spark https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala On Tue, Mar 25, 2014 at 1:21 AM, Holden Karau wrote: > There is also https://github.com/apache/spark/pull/18 against the c

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau
There is also https://github.com/apache/spark/pull/18 against the current repo which may be easier to apply. On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh wrote: > Hi Jaonary, > > You can find the code for k-fold CV in > https://github.com/apache/incubator-spark/pull/448. I have not find the >

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Walrus theCat
If someone wanted / needed to implement this themselves, are partitions the correct way to go? Any tips on how to get started (say, dividing an RDD into 5 parts)? On Fri, Mar 21, 2014 at 9:51 AM, Jaonary Rabarisoa wrote: > Thank you Hai-Anh. Are the files CrossValidation.scala and > RandomS

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Jaonary Rabarisoa
Thank you Hai-Anh. Are the files CrossValidation.scala and RandomSplitRDD.scala enough to use it ? I'm currently using spark 0.9.0 and I to avoid to rebuild every thing. On Fri, Mar 21, 2014 at 4:58 PM, Hai-Anh Trinh wrote: > Hi Jaonary, > > You can find the code for k-fold CV in > https:/

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Hai-Anh Trinh
Hi Jaonary, You can find the code for k-fold CV in https://github.com/apache/incubator-spark/pull/448. I have not find the time to resubmit the pull to latest master. On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani wrote: > Hi Jaonary, > > I believe the n folds should be mapped into n Keys i

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Sanjay Awatramani
Hi Jaonary, I believe the n folds should be mapped into n Keys in spark using a map function. You can reduce the returned PairRDD and you should get your metric. I don't understand partitions fully, but from whatever I understand of it, they aren't required in your scenario. Regards, Sanjay

N-Fold validation and RDD partitions

2014-03-21 Thread Jaonary Rabarisoa
Hi I need to partition my data represented as RDD into n folds and run metrics computation in each fold and finally compute the means of my metrics overall the folds. Does spark can do the data partition out of the box or do I need to implement it myself. I know that RDD has a partitions method an