Re：[SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)

黄明 Wed, 11 Oct 2017 00:54:10 -0700

+1

In fact, not only Shuffle can benefit from RDMA. Broadcast modules can also
benefit from this with decent modification, with benefits of less CPU
occupation and better network proformance.

Tencent is evaluating on this in Lab, and we observe a roughly 50% improvemnet
in TeraSort, in 100G network environment. And we believe this will be a key
improvement of speed for most distributed framework of in near feature.

________________

AndyHuang

原始邮件
发件人: Yuval Degani<yuval...@gmail.com>
收件人: dev<dev@spark.apache.org>
发送时间: 2017年10月10日(周二) 09:40
主题: [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)

Dear Spark community,
I would like to call for the review of SPARK-22229: "RDMA Accelerated Shuffle
Engine".
The request purpose is to embed an RDMA accelerated Shuffle Manager into
mainstream Spark.

Such implementation is available as an external plugin as part of the
"SparkRDMA" project: <a
href="https://github.com/Mellanox/SparkRDMA";>https://github.com/Mellanox/SparkRDMA</a>.
SparkRDMA already demonstrated enormous potential for accelerating shuffles
seamlessly in both benchmarks and actual production environments.

Adding RDMA capabilities to Spark will be one more important step in enabling
lower-level acceleration as conveyed by the "Tungsten" project.

SparkRDMA will be presented at Spark Summit 2017 in Dublin
(https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/).

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22229
PDF version:
https://issues.apache.org/jira/secure/attachment/12891122/SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf

Overview
An RDMA-accelerated shuffle engine can provide enormous performance benefits to
shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin
open-source project (https://github.com/Mellanox/SparkRDMA).
Using RDMA for shuffle improves CPU utilization significantly and reduces I/O
processing overhead by bypassing the kernel and networking stack as well as
avoiding memory copies entirely. Those valuable CPU cycles are then consumed
directly by the actual Spark workloads, and help reducing the job runtime
significantly.
This performance gain is demonstrated with both industry standard HiBench
TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive customer
applications.
SparkRDMA will be presented at Spark Summit 2017 in Dublin
(https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/).

Background and Motivation
Spark's current Shuffle engine implementation over “Netty” faces many
performance issues often seen in other socket-based applications.
Using standard TCP/IP communication socket-based model for heavy data transfers
usually requires copying the data multiple times and going through many system
calls in the I/O-path. These consume significant amounts of CPU cycles and
memory that could have been otherwise assigned to the actual job at hand. This
becomes even more critical with latency-sensitive Spark streaming and Deep
Learning applications over SparkML.
RDMA (Remote Direct Memory Access) is already a commodity technology that is
supported on most mid-range to high-end Network Adapter cards, manufactured by
various companies. Furthermore, RDMA-capable networks are already offered on
public clouds such as Microsoft Azure, and will probably be supported in AWS
soon to appeal to MPI users. Existing users of Spark on Microsoft Azure servers
can get the benefits of RDMA by running on a suitable instance with this
plugin, without needing any application changes.
RDMA provides a unique approach for accessing memory locations over the
network, without the need for copying on the transmitter side nor on the
receiver side.
These remote memory read and write operations are enabled by a standard
interface which is part of mainstream Linux releases for many years now. This
standardized interface allows direct access to remote memory from user-space,
while skipping costly system calls in the I/O-path. Due to its many virtues,
RDMA has found its way to be a standard data transfer protocol in HPC (High
Performance Computing) applications, with MPI being the most prominent.
RDMA has been traditionally associated with InfiniBand networks, but with the
standardization of RDMA over Converged Ethernet (RoCE), RDMA is supported and
widely used on Ethernet networks for many years now.
Since Spark is all about performing everything in-memory, RDMA seems like a
perfect fit for filling-in the gap of transferring intermediate in-memory data
between the participating nodes. SparkRDMA
(https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle
performance can dramatically improve with the use of RDMA. Today, it is gaining
significant traction with many users, and has successfully demonstrated major
performance improvement on production applications in high-profile technology
companies. SparkRDMA is a generic and easy-to-use plugin. However, in order to
gain wide adoption with its effort-less acceleration, it must be integrated
into Apache Spark itself.
The purpose of this SPIP, is to introduce RDMA into mainstream Spark for
improving Shuffle performance, and to pave the way for further accelerations
such as GPUDirect (acceleration with NVidia GPUs over CUDA), NVMeoF (NVMe over
Fabrics) and more.
SparkRDMA will be presented at Spark Summit 2017 in Dublin
(https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/)

Target Personas
Any Spark user who cares about performance

Goals
Use RDMA to improve Shuffle data transfer performance and reduce total job
runtime.
Automatically activate RDMA accelerated shuffles where supported.

Non-Goals
This SPIP limits the usage of RDMA for Shuffle data transfers only. This SPIP
is the first step in introducing RDMA to Spark and open a range of
possibilities for future improvements. In the future, SPIPs can address other
network consumers that can benefit from RDMA such as, but not limited to:
Broadcast, RDD transfers, RPC messaging, storage access, GPU access and
HDFS-RDMA interface. .

API Changes
There will be no API changes required

Proposed Design
SparkRDMA currently utilizes the ShuffleManager interface
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala)
to allow RDMA accelerated Shuffles. This interface is sufficient to allow
network savvy users to take advantage of RDMA capabilities in the context of
Spark. However, to make this technology more easily accessible, we propose to
add the code to mainstream Spark, and implement a method to automatically use
the RdmaShuffleManager when RDMA is supported on the system. This way, any
Spark user that already has the hardware support for RDMA, can seamlessly enjoy
its performance benefits.
Furthermore, SparkRDMA in its current plugin form, is limited by several
constraints that can be removed once introduced into mainstream Spark. Among
those are:

• SparkRDMA manages its own memory, off-heap. When integrated into
Spark, it can use Tungsten physical memory for all of its needs, allowing for
faster allocations and memory registrations that can increase performance
significantly. Also, any data that resides in Tungsten memory can be
transferred with almost no overhead.

• MapStatuses are redundant – no need for those extra transfers that
take precious seconds in many jobs

Rejected Designs
Support RDMA with the SparkRDMA plugin:

• The SparkRDMA plugin approach introduces limitations and overhead
that reduce performance

• Plugins are awkward to build, install and deploy and that is why they
are usually avoided

• Forward support is difficult to maintain for plugins that are not
part of the upstream project, specifically for Spark which is a rapidly
changing project
To ensure maximum performance and to allow mass adoption of this general
solution, RDMA capabilities must be introduced into Spark itself.

Re：[SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)

Reply via email to