+1
In fact, not only Shuffle can benefit from RDMA. Broadcast modules can also benefit from this with decent modification, with benefits of less CPU occupation and better network proformance. Tencent is evaluating on this in Lab, and we observe a roughly 50% improvemnet in TeraSort, in 100G network environment. And we believe this will be a key improvement of speed for most distributed framework of in near feature. ________________ AndyHuang 原始邮件 发件人: Yuval Degani<yuval...@gmail.com> 收件人: dev<dev@spark.apache.org> 发送时间: 2017年10月10日(周二) 09:40 主题: [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail) Dear Spark community, I would like to call for the review of SPARK-22229: "RDMA Accelerated Shuffle Engine". The request purpose is to embed an RDMA accelerated Shuffle Manager into mainstream Spark. Such implementation is available as an external plugin as part of the "SparkRDMA" project: <a href="https://github.com/Mellanox/SparkRDMA">https://github.com/Mellanox/SparkRDMA</a>. SparkRDMA already demonstrated enormous potential for accelerating shuffles seamlessly in both benchmarks and actual production environments. Adding RDMA capabilities to Spark will be one more important step in enabling lower-level acceleration as conveyed by the "Tungsten" project. SparkRDMA will be presented at Spark Summit 2017 in Dublin (https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/). JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22229 PDF version: https://issues.apache.org/jira/secure/attachment/12891122/SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf Overview An RDMA-accelerated shuffle engine can provide enormous performance benefits to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin open-source project (https://github.com/Mellanox/SparkRDMA). Using RDMA for shuffle improves CPU utilization significantly and reduces I/O processing overhead by bypassing the kernel and networking stack as well as avoiding memory copies entirely. Those valuable CPU cycles are then consumed directly by the actual Spark workloads, and help reducing the job runtime significantly. This performance gain is demonstrated with both industry standard HiBench TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive customer applications. SparkRDMA will be presented at Spark Summit 2017 in Dublin (https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/). Background and Motivation Spark's current Shuffle engine implementation over “Netty” faces many performance issues often seen in other socket-based applications. Using standard TCP/IP communication socket-based model for heavy data transfers usually requires copying the data multiple times and going through many system calls in the I/O-path. These consume significant amounts of CPU cycles and memory that could have been otherwise assigned to the actual job at hand. This becomes even more critical with latency-sensitive Spark streaming and Deep Learning applications over SparkML. RDMA (Remote Direct Memory Access) is already a commodity technology that is supported on most mid-range to high-end Network Adapter cards, manufactured by various companies. Furthermore, RDMA-capable networks are already offered on public clouds such as Microsoft Azure, and will probably be supported in AWS soon to appeal to MPI users. Existing users of Spark on Microsoft Azure servers can get the benefits of RDMA by running on a suitable instance with this plugin, without needing any application changes. RDMA provides a unique approach for accessing memory locations over the network, without the need for copying on the transmitter side nor on the receiver side. These remote memory read and write operations are enabled by a standard interface which is part of mainstream Linux releases for many years now. This standardized interface allows direct access to remote memory from user-space, while skipping costly system calls in the I/O-path. Due to its many virtues, RDMA has found its way to be a standard data transfer protocol in HPC (High Performance Computing) applications, with MPI being the most prominent. RDMA has been traditionally associated with InfiniBand networks, but with the standardization of RDMA over Converged Ethernet (RoCE), RDMA is supported and widely used on Ethernet networks for many years now. Since Spark is all about performing everything in-memory, RDMA seems like a perfect fit for filling-in the gap of transferring intermediate in-memory data between the participating nodes. SparkRDMA (https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle performance can dramatically improve with the use of RDMA. Today, it is gaining significant traction with many users, and has successfully demonstrated major performance improvement on production applications in high-profile technology companies. SparkRDMA is a generic and easy-to-use plugin. However, in order to gain wide adoption with its effort-less acceleration, it must be integrated into Apache Spark itself. The purpose of this SPIP, is to introduce RDMA into mainstream Spark for improving Shuffle performance, and to pave the way for further accelerations such as GPUDirect (acceleration with NVidia GPUs over CUDA), NVMeoF (NVMe over Fabrics) and more. SparkRDMA will be presented at Spark Summit 2017 in Dublin (https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/) Target Personas Any Spark user who cares about performance Goals Use RDMA to improve Shuffle data transfer performance and reduce total job runtime. Automatically activate RDMA accelerated shuffles where supported. Non-Goals This SPIP limits the usage of RDMA for Shuffle data transfers only. This SPIP is the first step in introducing RDMA to Spark and open a range of possibilities for future improvements. In the future, SPIPs can address other network consumers that can benefit from RDMA such as, but not limited to: Broadcast, RDD transfers, RPC messaging, storage access, GPU access and HDFS-RDMA interface. . API Changes There will be no API changes required Proposed Design SparkRDMA currently utilizes the ShuffleManager interface (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala) to allow RDMA accelerated Shuffles. This interface is sufficient to allow network savvy users to take advantage of RDMA capabilities in the context of Spark. However, to make this technology more easily accessible, we propose to add the code to mainstream Spark, and implement a method to automatically use the RdmaShuffleManager when RDMA is supported on the system. This way, any Spark user that already has the hardware support for RDMA, can seamlessly enjoy its performance benefits. Furthermore, SparkRDMA in its current plugin form, is limited by several constraints that can be removed once introduced into mainstream Spark. Among those are: • SparkRDMA manages its own memory, off-heap. When integrated into Spark, it can use Tungsten physical memory for all of its needs, allowing for faster allocations and memory registrations that can increase performance significantly. Also, any data that resides in Tungsten memory can be transferred with almost no overhead. • MapStatuses are redundant – no need for those extra transfers that take precious seconds in many jobs Rejected Designs Support RDMA with the SparkRDMA plugin: • The SparkRDMA plugin approach introduces limitations and overhead that reduce performance • Plugins are awkward to build, install and deploy and that is why they are usually avoided • Forward support is difficult to maintain for plugins that are not part of the upstream project, specifically for Spark which is a rapidly changing project To ensure maximum performance and to allow mass adoption of this general solution, RDMA capabilities must be introduced into Spark itself.