After using Spark for many years, including SystemML's Spark backend, I'd
like to give some feedback on potential PairRDD API extensions that I would
find very useful:

1) MapToPair with preservesPartitioning flag: For many binary operations
with broadcasts, we always need to use mapPartitionsToPair just to flag it
as partitioning-preserving. A simple mapValues cannot be used in these
situations because the keys are needed for broadcast block lookups.

2) Multi-key lookups: For indexing operations over hash-partitioned
out-of-core RDDs, a lookup with a list of keys would be very helpful
because filter and join have to read all RDD partitions just to deserialize
them and investigate the keys. We currently emulate an efficient multi-key
lookup by creating a probe set of partition ids and wrapping Spark's
PartitionPruningRDD around the input RDD.

3) Streaming collect/parallelize: For applications that manage their own
driver and executor memory, collect and parallelize pose challenges. The
collect action requires memory for both the in-memory representation and
the collected list of key-value pairs. Unfortunately, the existing
toLocalIterator is even slower than writing the RDD to HDFS and reading it
back into the driver. Similarly, strong references to parallelized RDDs pin
memory at the driver. Having an unsafe flag to pin the RDD instead into a
persistent storage level at the executors would significantly simplify the
driver memory management.

One can work around these minor issues but maybe it would be useful for
many people to address them once in Spark via API extensions.

Regards,
Matthias

Reply via email to