I've got an RDD where each element is a long string (a whole document). I'm 
using pyspark so some of the handy partition-handling functions aren't 
available, and I count the number of elements in each partition with: 

def count_partitions(id, iterator): 
    c = sum(1 for _ in iterator)
    yield (id,c) 

> rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()

This returns the following: 

{0: 866, 1: 1158, 2: 828, 3: 876}

But if I do: 

> rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap()

I get

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134}

Why this strange redistribution of elements? I'm obviously misunderstanding how 
spark does the partitioning -- is it a problem with having a list of strings as 
an RDD? 

Help vey much appreciated! 

Thanks,

Rok


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to