Why broadcasting this list then? You should use an RDD or DataFrame. For example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar <skiit...@yahoo.co.uk.invalid> wrote: > I'm using pySpark. > I've list of 1 million items (all float values ) and 1 million users. for > each user I want to sample randomly some items from the item list. > Broadcasting the item list results in Outofmemory error on the driver, > tried setting driver memory till 10G. I tried to persist this array on > disk but I'm not able to figure out a way to read the same on the workers. > > Any suggestion would be appreciated. >