I have equal sized partitions now, but I want the RDD to be partitioned such
that the partitions are equally weighted by some attribute of each RDD
element (e.g. size or complexity).

I have been looking at the RangePartitioner code and I have come up with
something like

EquallyWeightedPartitioner(noOfPartitions, weightFunction)

1) take a sum or (sample) of complexities of all elements and calculate
average weight per partition
2) take a histogram of weights
3) assign a list of partitions to each bucket
4)  getPartition(key: Any): Int would
  a) get the weight and then find the bucket
  b) assign a random partition from the list of partitions associated with
each bucket

Just wanted to know if someone else had come across this issue before and
there was a better way of doing this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Equally-weighted-partitions-in-Spark-tp5171p5212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to