Oh.... so if I had something more reasonable, like RDD's full of tuples of say, (Int,Set,Set), I could expect a more uniform distribution?
Thanks On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > This happened because they were integers equal to 0 mod 5, and we used the > default hashCode implementation for integers, which will map them all to 0. > There's no API method that will look at the resulting partition sizes and > rebalance them, but you could use another hash function. > > Matei > > On Mar 24, 2014, at 5:20 PM, Walrus theCat <walrusthe...@gmail.com> wrote: > > > Hi, > > > > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 > ).coalesce(5,true).glom.collect yields > > > > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), > Array(), Array()) > > > > How do I get something more like: > > > > Array(Array(0), Array(20), Array(40), Array(60), Array(80)) > > > > Thanks > >