Oh.... so if I had something more reasonable, like RDD's full of tuples of
say, (Int,Set,Set), I could expect a more uniform distribution?

Thanks


On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> This happened because they were integers equal to 0 mod 5, and we used the
> default hashCode implementation for integers, which will map them all to 0.
> There's no API method that will look at the resulting partition sizes and
> rebalance them, but you could use another hash function.
>
> Matei
>
> On Mar 24, 2014, at 5:20 PM, Walrus theCat <walrusthe...@gmail.com> wrote:
>
> > Hi,
> >
> > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0
> ).coalesce(5,true).glom.collect  yields
> >
> > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
> Array(), Array())
> >
> > How do I get something more like:
> >
> >  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
> >
> > Thanks
>
>

Reply via email to