This feature is called "block affinity groups" and it's been under
discussion for a while, but isn't fully implemented yet. HDFS-2576 is
not a complete solution because it doesn't change the way the balancer
works, just the initial placement of blocks. Once heterogeneous
storage management (HDFS-
Hi Michael,
I think once that work is into HDFS, it will be great to expose this
functionality via Spark. This is something worth pursuing because it could
grant orders of magnitude perf improvements in cases when people need to
join data.
The second item would be very interesting, could yield s
It seems like there are two things here:
- Co-locating blocks with the same keys to avoid network transfer.
- Leveraging partitioning information to avoid a shuffle when data is
already partitioned correctly (even if those partitions aren't yet on the
same machine).
The former seems more complic
Christopher, can you expand on the co-partitioning support?
We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have a common hash key. Our analytics team wants to
do frequent joins across these different data-sets based on this key. It
makes sense that
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
If the former, Spark does support copartitioning.
If the latter, it's an HDFS scope that's outside of Spark. On that note,
Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
sure the paper makes useful c
It appears support for this type of control over block placement is going
out in the next version of HDFS:
https://issues.apache.org/jira/browse/HDFS-2576
On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf wrote:
> One of my colleagues has been questioning me as to why Spark/HDFS makes no
> attempts
One of my colleagues has been questioning me as to why Spark/HDFS makes no
attempts to try to co-locate related data blocks. He pointed to this
paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
CoHadoop research and the performance improvements it yielded for
Map/Reduce jobs