Hi, I have a rdd whose key is a userId and value is (movieId, rating)...
I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))} val keyedTraining = keyedRatings.sample(true, 0.8, 1L) val keyedTest = keyedRatings.subtract(keyedTraining) blocks = sc.maxParallelism println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}") println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}") println(s"Test keys ${keyedTest.groupByKey(blocks).count()}") My expectation was that the println will produce exact number of keys for keyedRatings, keyedTraining and keyedTest but this is not the case... On MovieLens for example I am noticing the following: Rating keys 3706 Training keys 3676 Test keys 3470 I also tried sampleByKey as follows: val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))} val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L) val keyedTest = keyedRatings.subtract(keyedTraining) Still I get the results as: Rating keys 3706 Training keys 3682 Test keys 3459 Any idea what's is wrong here... Are my assumptions about behavior of sample/sampleByKey on a key-value RDD correct ? If this is a bug I can dig deeper... Thanks. Deb