Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
If all users are equally important, then the average score should be representative. You shouldn't worry about missing one or two. For stratified sampling, wikipedia has a paragraph about its disadvantage: http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages It depends on the size of th

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
For mllib PR, I will add this logic: "If a user is missing in training and appears in test, we can simply ignore it." I was struggling since users appear in test on which the model was not trained on... For our internal tests we want to cross validate on every product / user as all of them are eq

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
`sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should not be many strata. My question is why we need to split on each user's ratings. If a user is missing in trai

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
Sean, I thought sampleByKey (stratified sampling) in 1.1 was designed to solve the problem that randomSplit can't sample by key... Xiangrui, What's the expected behavior of sampleByKey ? In the dataset sampled using sampleByKey the keys should match the input dataset keys right ? If it is a bug,

Re: Using sampleByKey

2014-11-18 Thread Sean Owen
I use randomSplit to make a train/CV/test set in one go. It definitely produces disjoint data sets and is efficient. The problem is you can't do it by key. I am not sure why your subtract does not work. I suspect it is because the values do not partition the same way, or they don't evaluate equali