I use randomSplit to make a train/CV/test set in one go. It definitely
produces disjoint data sets and is efficient. The problem is you can't
do it by key.

I am not sure why your subtract does not work. I suspect it is because
the values do not partition the same way, or they don't evaluate
equality in the expected way, but I don't see any reason why. Tuples
work as expected here.

On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das <debasish.da...@gmail.com> wrote:
> Hi,
>
> I have a rdd whose key is a userId and value is (movieId, rating)...
>
> I want to sample 80% of the (movieId,rating) that each userId has seen for
> train, rest is for test...
>
> val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
>
> val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}
>
> val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
>
> val keyedTest = keyedRatings.subtract(keyedTraining)
>
> blocks = sc.maxParallelism
>
> println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
>
> println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
>
> println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
>
> My expectation was that the println will produce exact number of keys for
> keyedRatings, keyedTraining and keyedTest but this is not the case...
>
> On MovieLens for example I am noticing the following:
>
> Rating keys 3706
>
> Training keys 3676
>
> Test keys 3470
>
> I also tried sampleByKey as follows:
>
> val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}
>
> val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
>
> val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
>
> val keyedTest = keyedRatings.subtract(keyedTraining)
>
> Still I get the results as:
>
> Rating keys 3706
>
> Training keys 3682
>
> Test keys 3459
>
> Any idea what's is wrong here...
>
> Are my assumptions about behavior of sample/sampleByKey on a key-value RDD
> correct ? If this is a bug I can dig deeper...
>
> Thanks.
>
> Deb

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to