Question was not what kind of sampling but random sampling per user. There's no
value associated with items to create stratas. If you read Matteo's answer,
that's the way to go about it.
-Surender
On Thursday, 12 April, 2018, 5:49:43 PM IST, Gourav Sengupta
wrote:
Hi,
There is an opt
Hi,
There is an option for Stratified Sampling available in SPARK:
https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling
.
Also there is a method called randomSplit which may be called on dataframes
in case we want to split them into training and test data.
Please let me
Thanks Matteo, this should work!
-Surender
On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu
wrote:
I don't think it's trivial. Anyway, the naive solution would be a cross join
between user x items. But this can be very very expensive. I've encountered
once a similar problem,
I don't think it's trivial. Anyway, the naive solution would be a cross
join between user x items. But this can be very very expensive. I've
encountered once a similar problem, here how I solved it:
- create a new RDD with (itemID, index) where the index is a unique
integer between 0 and the
right, this is what I did when I said I tried to persist and create an RDD out
of it to sample from. But how to do for each user?You have one rdd of users on
one hand and rdd of items on the other. How to go from here? Am I missing
something trivial?
On Thursday, 12 April, 2018, 2:10:51 A
Why broadcasting this list then? You should use an RDD or DataFrame. For
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar
wrote:
> I'm using pySpark.
> I've list of 1 million items (all float values ) and 1 million users. for
> ea
I'm using pySpark.I've list of 1 million items (all float values ) and 1
million users. for each user I want to sample randomly some items from the item
list.Broadcasting the item list results in Outofmemory error on the driver,
tried setting driver memory till 10G. I tried to persist this arra