`sampleByKey` with the same fraction per stratum acts the same as
`sample`. The operation you want is perhaps `sampleByKeyExact` here.
However, when you use stratified sampling, there should not be many
strata. My question is why we need to split on each user's ratings. If
a user is missing in training and appears in test, we can simply
ignore it. -Xiangrui

On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das <debasish.da...@gmail.com> wrote:
> Sean,
>
> I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
> the problem that randomSplit can't sample by key...
>
> Xiangrui,
>
> What's the expected behavior of sampleByKey ? In the dataset sampled using
> sampleByKey the keys should match the input dataset keys right ? If it is a
> bug, I can open up a JIRA and look into it...
>
> Thanks.
> Deb
>
> On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I use randomSplit to make a train/CV/test set in one go. It definitely
>> produces disjoint data sets and is efficient. The problem is you can't
>> do it by key.
>>
>> I am not sure why your subtract does not work. I suspect it is because
>> the values do not partition the same way, or they don't evaluate
>> equality in the expected way, but I don't see any reason why. Tuples
>> work as expected here.
>>
>> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I have a rdd whose key is a userId and value is (movieId, rating)...
>> >
>> > I want to sample 80% of the (movieId,rating) that each userId has seen
>> for
>> > train, rest is for test...
>> >
>> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
>> >
>> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> x.rating))}
>> >
>> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
>> >
>> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >
>> > blocks = sc.maxParallelism
>> >
>> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
>> >
>> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
>> >
>> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
>> >
>> > My expectation was that the println will produce exact number of keys for
>> > keyedRatings, keyedTraining and keyedTest but this is not the case...
>> >
>> > On MovieLens for example I am noticing the following:
>> >
>> > Rating keys 3706
>> >
>> > Training keys 3676
>> >
>> > Test keys 3470
>> >
>> > I also tried sampleByKey as follows:
>> >
>> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> x.rating))}
>> >
>> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
>> >
>> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
>> >
>> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >
>> > Still I get the results as:
>> >
>> > Rating keys 3706
>> >
>> > Training keys 3682
>> >
>> > Test keys 3459
>> >
>> > Any idea what's is wrong here...
>> >
>> > Are my assumptions about behavior of sample/sampleByKey on a key-value
>> RDD
>> > correct ? If this is a bug I can dig deeper...
>> >
>> > Thanks.
>> >
>> > Deb
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to