+1

In fact, not only Shuffle can benefit from RDMA. Broadcast modules can also 
benefit from this with decent modification, with benefits of less CPU 
occupation and better network proformance.


Tencent is evaluating on this in Lab, and we observe a roughly 50% improvemnet 
in TeraSort, in 100G network environment. And we believe this will be a key 
improvement of speed for most distributed framework of in near feature.

________________

AndyHuang

 原始邮件
发件人: Yuval Degani<yuval...@gmail.com>
收件人: dev<dev@spark.apache.org>
发送时间: 2017年10月10日(周二) 09:40
主题: [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)

Dear Spark community,
I would like to call for the review of SPARK-22229: "RDMA Accelerated Shuffle 
Engine".
The request purpose is to embed an RDMA accelerated Shuffle Manager into 
mainstream Spark.

Such implementation is available as an external plugin as part of the 
"SparkRDMA" project: <a 
href="https://github.com/Mellanox/SparkRDMA";>https://github.com/Mellanox/SparkRDMA</a>.
SparkRDMA already demonstrated enormous potential for accelerating shuffles 
seamlessly in both benchmarks and actual production environments.

Adding RDMA capabilities to Spark will be one more important step in enabling 
lower-level acceleration as conveyed by the "Tungsten" project.

SparkRDMA will be presented at Spark Summit 2017 in Dublin 
(https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/).

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22229
PDF version: 
https://issues.apache.org/jira/secure/attachment/12891122/SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf

Overview
An RDMA-accelerated shuffle engine can provide enormous performance benefits to 
shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
open-source project (https://github.com/Mellanox/SparkRDMA).
Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
processing overhead by bypassing the kernel and networking stack as well as 
avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
directly by the actual Spark workloads, and help reducing the job runtime 
significantly.
This performance gain is demonstrated with both industry standard HiBench 
TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive customer 
applications.
SparkRDMA will be presented at Spark Summit 2017 in Dublin 
(https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/).

Background and Motivation
Spark's current Shuffle engine implementation over “Netty” faces many 
performance issues often seen in other socket-based applications.
Using standard TCP/IP communication socket-based model for heavy data transfers 
usually requires copying the data multiple times and going through many system 
calls in the I/O-path. These consume significant amounts of CPU cycles and 
memory that could have been otherwise assigned to the actual job at hand. This 
becomes even more critical with latency-sensitive Spark streaming and Deep 
Learning applications over SparkML.
RDMA (Remote Direct Memory Access) is already a commodity technology that is 
supported on most mid-range to high-end Network Adapter cards, manufactured by 
various companies. Furthermore, RDMA-capable networks are already offered on 
public clouds such as Microsoft Azure, and will probably be supported in AWS 
soon to appeal to MPI users. Existing users of Spark on Microsoft Azure servers 
can get the benefits of RDMA by running on a suitable instance with this 
plugin, without needing any application changes.
RDMA provides a unique approach for accessing memory locations over the 
network, without the need for copying on the transmitter side nor on the 
receiver side.
These remote memory read and write operations are enabled by a standard 
interface which is part of mainstream Linux releases for many years now. This 
standardized interface allows direct access to remote memory from user-space, 
while skipping costly system calls in the I/O-path. Due to its many virtues, 
RDMA has found its way to be a standard data transfer protocol in HPC (High 
Performance Computing) applications, with MPI being the most prominent.
RDMA has been traditionally associated with InfiniBand networks, but with the 
standardization of RDMA over Converged Ethernet (RoCE), RDMA is supported and 
widely used on Ethernet networks for many years now.
Since Spark is all about performing everything in-memory, RDMA seems like a 
perfect fit for filling-in the gap of transferring intermediate in-memory data 
between the participating nodes. SparkRDMA 
(https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle 
performance can dramatically improve with the use of RDMA. Today, it is gaining 
significant traction with many users, and has successfully demonstrated major 
performance improvement on production applications in high-profile technology 
companies. SparkRDMA is a generic and easy-to-use plugin. However, in order to 
gain wide adoption with its effort-less acceleration, it must be integrated 
into Apache Spark itself.
The purpose of this SPIP, is to introduce RDMA into mainstream Spark for 
improving Shuffle performance, and to pave the way for further accelerations 
such as GPUDirect (acceleration with NVidia GPUs over CUDA), NVMeoF (NVMe over 
Fabrics) and more.
SparkRDMA will be presented at Spark Summit 2017 in Dublin 
(https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/)

Target Personas
Any Spark user who cares about performance

Goals
Use RDMA to improve Shuffle data transfer performance and reduce total job 
runtime.
Automatically activate RDMA accelerated shuffles where supported.

Non-Goals
This SPIP limits the usage of RDMA for Shuffle data transfers only. This SPIP 
is the first step in introducing RDMA to Spark and open a range of 
possibilities for future improvements. In the future, SPIPs can address other 
network consumers that can benefit from RDMA such as, but not limited to: 
Broadcast, RDD transfers, RPC messaging, storage access, GPU access and 
HDFS-RDMA interface. .

API Changes
There will be no API changes required

Proposed Design
SparkRDMA currently utilizes the ShuffleManager interface 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala)
 to allow RDMA accelerated Shuffles. This interface is sufficient to allow 
network savvy users to take advantage of RDMA capabilities in the context of 
Spark. However, to make this technology more easily accessible, we propose to 
add the code to mainstream Spark, and implement a method to automatically use 
the RdmaShuffleManager when RDMA is supported on the system. This way, any 
Spark user that already has the hardware support for RDMA, can seamlessly enjoy 
its performance benefits.
Furthermore, SparkRDMA in its current plugin form, is limited by several 
constraints that can be removed once introduced into mainstream Spark. Among 
those are:

•         SparkRDMA manages its own memory, off-heap. When integrated into 
Spark, it can use Tungsten physical memory for all of its needs, allowing for 
faster allocations and memory registrations that can increase performance 
significantly. Also, any data that resides in Tungsten memory can be 
transferred with almost no overhead.

•         MapStatuses are redundant – no need for those extra transfers that 
take precious seconds in many jobs

Rejected Designs
Support RDMA with the SparkRDMA plugin:

•         The SparkRDMA plugin approach introduces limitations and overhead 
that reduce performance

•         Plugins are awkward to build, install and deploy and that is why they 
are usually avoided

•         Forward support is difficult to maintain for plugins that are not 
part of the upstream project, specifically for Spark which is a rapidly 
changing project
To ensure maximum performance and to allow mass adoption of this general 
solution, RDMA capabilities must be introduced into Spark itself.

Reply via email to