I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with:
def count_partitions(id, iterator): c = sum(1 for _ in iterator) yield (id,c) > rdd.mapPartitionsWithSplit(count_partitions).collectAsMap() This returns the following: {0: 866, 1: 1158, 2: 828, 3: 876} But if I do: > rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap() I get {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134} Why this strange redistribution of elements? I'm obviously misunderstanding how spark does the partitioning -- is it a problem with having a list of strings as an RDD? Help vey much appreciated! Thanks, Rok --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org