"Can normal hardware benifit from this RDMA optimization? 100G network is very expensive and I think only a few Spark clusters are running on it."
Hi Wenchen, Yes, also RDMA supported network adapters with lower speeds - as low as 10G will show improvement in transfer time. Actually, in most cases, we found that 25G will give you the same results as 40/50G. This is due to Spark's lower bandwidth transfers bounded by the machine's CPU power. On Tue, Oct 17, 2017 at 3:33 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > Can normal hardware benifit from this RDMA optimization? 100G network is > very expensive and I think only a few Spark clusters are running on it. > > On Wed, Oct 11, 2017 at 3:53 PM, andymhuang(黄明) <andymhu...@tencent.com> > wrote: > >> +1 >> >> >> In fact, not only Shuffle can benefit from RDMA. Broadcast modules can >> also benefit from this with decent modification, with benefits of less CPU >> occupation and better network proformance. >> >> >> Tencent is evaluating on this in Lab, and we observe a roughly 50% >> improvemnet in TeraSort, in 100G network environment. And we believe this >> will be a key improvement of speed for most distributed framework of in >> near feature. >> >> ________________ >> >> AndyHuang >> >> 原始邮件 >> *发件人:* Yuval Degani<yuval...@gmail.com> >> *收件人:* dev<dev@spark.apache.org> >> *发送时间:* 2017年10月10日(周二) 09:40 >> *主题:* [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail) >> >> Dear Spark community, >> >> I would like to call for the review of SPARK-22229: "RDMA Accelerated >> Shuffle Engine". >> >> The request purpose is to embed an RDMA accelerated Shuffle Manager into >> mainstream Spark. >> >> >> >> Such implementation is available as an external plugin as part of the >> "SparkRDMA" project: <a href="https://github.com/Mellanox/SparkRDMA"> >> https://github.com/Mellanox/SparkRDMA</a>. >> >> SparkRDMA already demonstrated enormous potential for accelerating >> shuffles seamlessly in both benchmarks and actual production environments. >> >> >> >> Adding RDMA capabilities to Spark will be one more important step in >> enabling lower-level acceleration as conveyed by the "Tungsten" project. >> >> >> >> SparkRDMA will be presented at Spark Summit 2017 in Dublin ( >> https://spark-summit.org/eu-2017/events/accelerating-shuffl >> e-a-tailor-made-rdma-solution-for-apache-spark/). >> >> >> >> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22229 >> >> PDF version: https://issues.apache.org/jira/secure/attachment/12891122/SP >> ARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf >> >> >> >> *Overview* >> >> An RDMA-accelerated shuffle engine can provide enormous performance >> benefits to shuffle-intensive Spark jobs, as demonstrated in the >> “SparkRDMA” plugin open-source project (https://github.com/Mellanox/S >> parkRDMA). >> >> Using RDMA for shuffle improves CPU utilization significantly and reduces >> I/O processing overhead by bypassing the kernel and networking stack as >> well as avoiding memory copies entirely. Those valuable CPU cycles are then >> consumed directly by the actual Spark workloads, and help reducing the job >> runtime significantly. >> >> This performance gain is demonstrated with both industry standard HiBench >> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive >> customer applications. >> >> SparkRDMA will be presented at Spark Summit 2017 in Dublin ( >> https://spark-summit.org/eu-2017/events/accelerating-shuffl >> e-a-tailor-made-rdma-solution-for-apache-spark/). >> >> >> >> *Background and Motivation* >> >> Spark's current Shuffle engine implementation over “Netty” faces many >> performance issues often seen in other socket-based applications. >> >> Using standard TCP/IP communication socket-based model for heavy data >> transfers usually requires copying the data multiple times and going >> through many system calls in the I/O-path. These consume significant >> amounts of CPU cycles and memory that could have been otherwise assigned to >> the actual job at hand. This becomes even more critical with >> latency-sensitive Spark streaming and Deep Learning applications over >> SparkML. >> >> RDMA (Remote Direct Memory Access) is already a commodity technology that >> is supported on most mid-range to high-end Network Adapter cards, >> manufactured by various companies. Furthermore, RDMA-capable networks are >> already offered on public clouds such as Microsoft Azure, and will probably >> be supported in AWS soon to appeal to MPI users. Existing users of Spark on >> Microsoft Azure servers can get the benefits of RDMA by running on a >> suitable instance with this plugin, without needing any application changes. >> >> RDMA provides a unique approach for accessing memory locations over the >> network, without the need for copying on the transmitter side nor on the >> receiver side. >> >> These remote memory read and write operations are enabled by a standard >> interface which is part of mainstream Linux releases for many years now. >> This standardized interface allows direct access to remote memory from >> user-space, while skipping costly system calls in the I/O-path. Due to its >> many virtues, RDMA has found its way to be a standard data transfer >> protocol in HPC (High Performance Computing) applications, with MPI being >> the most prominent. >> >> RDMA has been traditionally associated with InfiniBand networks, but with >> the standardization of RDMA over Converged Ethernet (RoCE), RDMA is >> supported and widely used on Ethernet networks for many years now. >> >> Since Spark is all about performing everything in-memory, RDMA seems like >> a perfect fit for filling-in the gap of transferring intermediate in-memory >> data between the participating nodes. SparkRDMA ( >> https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle >> performance can dramatically improve with the use of RDMA. Today, it is >> gaining significant traction with many users, and has successfully >> demonstrated major performance improvement on production applications in >> high-profile technology companies. SparkRDMA is a generic and easy-to-use >> plugin. However, in order to gain wide adoption with its effort-less >> acceleration, it must be integrated into Apache Spark itself. >> >> The purpose of this SPIP, is to introduce RDMA into mainstream Spark for >> improving Shuffle performance, and to pave the way for further >> accelerations such as GPUDirect (acceleration with NVidia GPUs over CUDA), >> NVMeoF (NVMe over Fabrics) and more. >> >> SparkRDMA will be presented at Spark Summit 2017 in Dublin ( >> https://spark-summit.org/eu-2017/events/accelerating-shuffl >> e-a-tailor-made-rdma-solution-for-apache-spark/) >> >> >> >> *Target Personas* >> >> Any Spark user who cares about performance >> >> >> >> *Goals* >> >> Use RDMA to improve Shuffle data transfer performance and reduce total >> job runtime. >> >> Automatically activate RDMA accelerated shuffles where supported. >> >> >> >> *Non-Goals* >> >> This SPIP limits the usage of RDMA for Shuffle data transfers only. This >> SPIP is the first step in introducing RDMA to Spark and open a range of >> possibilities for future improvements. In the future, SPIPs can address >> other network consumers that can benefit from RDMA such as, but not limited >> to: Broadcast, RDD transfers, RPC messaging, storage access, GPU access and >> HDFS-RDMA interface. . >> >> >> >> *API Changes* >> >> There will be no API changes required >> >> >> >> *Proposed Design* >> >> SparkRDMA currently utilizes the ShuffleManager interface ( >> https://github.com/apache/spark/blob/master/core/src/main/s >> cala/org/apache/spark/shuffle/ShuffleManager.scala) to allow RDMA >> accelerated Shuffles. This interface is sufficient to allow network savvy >> users to take advantage of RDMA capabilities in the context of Spark. >> However, to make this technology more easily accessible, we propose to add >> the code to mainstream Spark, and implement a method to automatically use >> the RdmaShuffleManager when RDMA is supported on the system. This way, any >> Spark user that already has the hardware support for RDMA, can seamlessly >> enjoy its performance benefits. >> >> Furthermore, SparkRDMA in its current plugin form, is limited by several >> constraints that can be removed once introduced into mainstream Spark. >> Among those are: >> >> · SparkRDMA manages its own memory, off-heap. When integrated >> into Spark, it can use Tungsten physical memory for all of its needs, >> allowing for faster allocations and memory registrations that can increase >> performance significantly. Also, any data that resides in Tungsten memory >> can be transferred with almost no overhead. >> >> · MapStatuses are redundant – no need for those extra transfers >> that take precious seconds in many jobs >> >> >> >> *Rejected Designs* >> >> Support RDMA with the SparkRDMA plugin: >> >> · The SparkRDMA plugin approach introduces limitations and >> overhead that reduce performance >> >> · Plugins are awkward to build, install and deploy and that is >> why they are usually avoided >> >> · Forward support is difficult to maintain for plugins that are >> not part of the upstream project, specifically for Spark which is a rapidly >> changing project >> >> To ensure maximum performance and to allow mass adoption of this general >> solution, RDMA capabilities must be introduced into Spark itself. >> >> >> > >