For mllib PR, I will add this logic: "If a user is missing in training and
appears in test, we can simply ignore it."

I was struggling since users appear in test on which the model was not
trained on...

For our internal tests we want to cross validate on every product / user as
all of them are equally important and so I have to come up with a sampling
strategy for every user/product...

In general for stratified sampling what's the bound on strata ? Like number
of classes in a labeled dataset ~ 100 ?

On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng <men...@gmail.com> wrote:

> `sampleByKey` with the same fraction per stratum acts the same as
> `sample`. The operation you want is perhaps `sampleByKeyExact` here.
> However, when you use stratified sampling, there should not be many
> strata. My question is why we need to split on each user's ratings. If
> a user is missing in training and appears in test, we can simply
> ignore it. -Xiangrui
>
> On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Sean,
> >
> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
> > the problem that randomSplit can't sample by key...
> >
> > Xiangrui,
> >
> > What's the expected behavior of sampleByKey ? In the dataset sampled
> using
> > sampleByKey the keys should match the input dataset keys right ? If it
> is a
> > bug, I can open up a JIRA and look into it...
> >
> > Thanks.
> > Deb
> >
> > On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> >> I use randomSplit to make a train/CV/test set in one go. It definitely
> >> produces disjoint data sets and is efficient. The problem is you can't
> >> do it by key.
> >>
> >> I am not sure why your subtract does not work. I suspect it is because
> >> the values do not partition the same way, or they don't evaluate
> >> equality in the expected way, but I don't see any reason why. Tuples
> >> work as expected here.
> >>
> >> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das <debasish.da...@gmail.com
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a rdd whose key is a userId and value is (movieId, rating)...
> >> >
> >> > I want to sample 80% of the (movieId,rating) that each userId has seen
> >> for
> >> > train, rest is for test...
> >> >
> >> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
> >> >
> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> >> x.rating))}
> >> >
> >> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
> >> >
> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >> >
> >> > blocks = sc.maxParallelism
> >> >
> >> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
> >> >
> >> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
> >> >
> >> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
> >> >
> >> > My expectation was that the println will produce exact number of keys
> for
> >> > keyedRatings, keyedTraining and keyedTest but this is not the case...
> >> >
> >> > On MovieLens for example I am noticing the following:
> >> >
> >> > Rating keys 3706
> >> >
> >> > Training keys 3676
> >> >
> >> > Test keys 3470
> >> >
> >> > I also tried sampleByKey as follows:
> >> >
> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> >> x.rating))}
> >> >
> >> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
> >> >
> >> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
> >> >
> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >> >
> >> > Still I get the results as:
> >> >
> >> > Rating keys 3706
> >> >
> >> > Training keys 3682
> >> >
> >> > Test keys 3459
> >> >
> >> > Any idea what's is wrong here...
> >> >
> >> > Are my assumptions about behavior of sample/sampleByKey on a key-value
> >> RDD
> >> > correct ? If this is a bug I can dig deeper...
> >> >
> >> > Thanks.
> >> >
> >> > Deb
> >>
>

Reply via email to