Hi,
I have a rdd whose key is a userId and value is (movieId, rating)...
I want to sample 80% of the (movieId,rating) that each userId has seen for
train, rest is for test...
val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}
val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
val keyedTest = keyedRatings.subtract(keyedTraining)
blocks = sc.maxParallelism
println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
My expectation was that the println will produce exact number of keys for
keyedRatings, keyedTraining and keyedTest but this is not the case...
On MovieLens for example I am noticing the following:
Rating keys 3706
Training keys 3676
Test keys 3470
I also tried sampleByKey as follows:
val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}
val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
val keyedTest = keyedRatings.subtract(keyedTraining)
Still I get the results as:
Rating keys 3706
Training keys 3682
Test keys 3459
Any idea what's is wrong here...
Are my assumptions about behavior of sample/sampleByKey on a key-value RDD
correct ? If this is a bug I can dig deeper...
Thanks.
Deb