If all users are equally important, then the average score should be
representative. You shouldn't worry about missing one or two. For
stratified sampling, wikipedia has a paragraph about its disadvantage:
http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages
It depends on the size of th
For mllib PR, I will add this logic: "If a user is missing in training and
appears in test, we can simply ignore it."
I was struggling since users appear in test on which the model was not
trained on...
For our internal tests we want to cross validate on every product / user as
all of them are eq
`sampleByKey` with the same fraction per stratum acts the same as
`sample`. The operation you want is perhaps `sampleByKeyExact` here.
However, when you use stratified sampling, there should not be many
strata. My question is why we need to split on each user's ratings. If
a user is missing in trai
Sean,
I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
the problem that randomSplit can't sample by key...
Xiangrui,
What's the expected behavior of sampleByKey ? In the dataset sampled using
sampleByKey the keys should match the input dataset keys right ? If it is a
bug,
I use randomSplit to make a train/CV/test set in one go. It definitely
produces disjoint data sets and is efficient. The problem is you can't
do it by key.
I am not sure why your subtract does not work. I suspect it is because
the values do not partition the same way, or they don't evaluate
equali