I've got an RDD where each element is a long string (a whole document). I'm
using pyspark so some of the handy partition-handling functions aren't
available, and I count the number of elements in each partition with:
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yield (id,c)
> rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()
This returns the following:
{0: 866, 1: 1158, 2: 828, 3: 876}
But if I do:
> rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap()
I get
{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134}
Why this strange redistribution of elements? I'm obviously misunderstanding how
spark does the partitioning -- is it a problem with having a list of strings as
an RDD?
Help vey much appreciated!
Thanks,
Rok
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]