Hi,

I have a rdd whose key is a userId and value is (movieId, rating)...

I want to sample 80% of the (movieId,rating) that each userId has seen for
train, rest is for test...

val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))

val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}

val keyedTraining = keyedRatings.sample(true, 0.8, 1L)

val keyedTest = keyedRatings.subtract(keyedTraining)

blocks = sc.maxParallelism

println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")

println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")

println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")

My expectation was that the println will produce exact number of keys for
keyedRatings, keyedTraining and keyedTest but this is not the case...

On MovieLens for example I am noticing the following:

Rating keys 3706

Training keys 3676

Test keys 3470

I also tried sampleByKey as follows:

val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}

val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap

val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)

val keyedTest = keyedRatings.subtract(keyedTraining)

Still I get the results as:

Rating keys 3706

Training keys 3682

Test keys 3459

Any idea what's is wrong here...

Are my assumptions about behavior of sample/sampleByKey on a key-value RDD
correct ? If this is a bug I can dig deeper...

Thanks.

Deb

Reply via email to