Weichen Xu created SPARK-18078:
----------------------------------
Summary: Add option for customize zipPartition task preferred
locations
Key: SPARK-18078
URL: https://issues.apache.org/jira/browse/SPARK-18078
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Weichen Xu
`RDD.zipPartitions` task preferred locations strategy will use the intersection
of corresponding zipped partitions locations, if the intersection is null, it
use union of these locations.
but in special case, I want to customize the task preferred locations for
better performance. A typical case is in spark-tfocus: a distributed
matrix(DMatrix) multiply a vector(DVector), it use RDD.zipPartitions.
https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/DVectorFunctions.scala
Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the
zipPartition task always locates on `DMatrix` partition's location. it will get
better data locality than the default preferred location strategy.
I think it make sense to add an option for this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]