Re: [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)

Yuval Degani Tue, 17 Oct 2017 10:02:45 -0700

"Can normal hardware benifit from this RDMA optimization? 100G network is
very expensive and I think only a few Spark clusters are running on it."


Hi Wenchen,
Yes, also RDMA supported network adapters with lower speeds - as low as 10G
will show improvement in transfer time.
Actually, in most cases, we found that 25G will give you the same results
as 40/50G. This is due to Spark's lower bandwidth transfers bounded by the
machine's CPU power.

On Tue, Oct 17, 2017 at 3:33 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> Can normal hardware benifit from this RDMA optimization? 100G network is
> very expensive and I think only a few Spark clusters are running on it.
>
> On Wed, Oct 11, 2017 at 3:53 PM, andymhuang(黄明) <andymhu...@tencent.com>
> wrote:
>
>> +1
>>
>>
>> In fact, not only Shuffle can benefit from RDMA. Broadcast modules can
>> also benefit from this with decent modification, with benefits of less CPU
>> occupation and better network proformance.
>>
>>
>> Tencent is evaluating on this in Lab, and we observe a roughly 50%
>> improvemnet in TeraSort, in 100G network environment. And we believe this
>> will be a key improvement of speed for most distributed framework of in
>> near feature.
>>
>> ________________
>>
>> AndyHuang
>>
>>  原始邮件
>> *发件人:* Yuval Degani<yuval...@gmail.com>
>> *收件人:* dev<dev@spark.apache.org>
>> *发送时间:* 2017年10月10日(周二) 09:40
>> *主题:* [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)
>>
>> Dear Spark community,
>>
>> I would like to call for the review of SPARK-22229: "RDMA Accelerated
>> Shuffle Engine".
>>
>> The request purpose is to embed an RDMA accelerated Shuffle Manager into
>> mainstream Spark.
>>
>>
>>
>> Such implementation is available as an external plugin as part of the
>> "SparkRDMA" project: <a href="https://github.com/Mellanox/SparkRDMA";>
>> https://github.com/Mellanox/SparkRDMA</a>.
>>
>> SparkRDMA already demonstrated enormous potential for accelerating
>> shuffles seamlessly in both benchmarks and actual production environments.
>>
>>
>>
>> Adding RDMA capabilities to Spark will be one more important step in
>> enabling lower-level acceleration as conveyed by the "Tungsten" project.
>>
>>
>>
>> SparkRDMA will be presented at Spark Summit 2017 in Dublin (
>> https://spark-summit.org/eu-2017/events/accelerating-shuffl
>> e-a-tailor-made-rdma-solution-for-apache-spark/).
>>
>>
>>
>> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22229
>>
>> PDF version: https://issues.apache.org/jira/secure/attachment/12891122/SP
>> ARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>>
>>
>>
>> *Overview*
>>
>> An RDMA-accelerated shuffle engine can provide enormous performance
>> benefits to shuffle-intensive Spark jobs, as demonstrated in the
>> “SparkRDMA” plugin open-source project (https://github.com/Mellanox/S
>> parkRDMA).
>>
>> Using RDMA for shuffle improves CPU utilization significantly and reduces
>> I/O processing overhead by bypassing the kernel and networking stack as
>> well as avoiding memory copies entirely. Those valuable CPU cycles are then
>> consumed directly by the actual Spark workloads, and help reducing the job
>> runtime significantly.
>>
>> This performance gain is demonstrated with both industry standard HiBench
>> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive
>> customer applications.
>>
>> SparkRDMA will be presented at Spark Summit 2017 in Dublin (
>> https://spark-summit.org/eu-2017/events/accelerating-shuffl
>> e-a-tailor-made-rdma-solution-for-apache-spark/).
>>
>>
>>
>> *Background and Motivation*
>>
>> Spark's current Shuffle engine implementation over “Netty” faces many
>> performance issues often seen in other socket-based applications.
>>
>> Using standard TCP/IP communication socket-based model for heavy data
>> transfers usually requires copying the data multiple times and going
>> through many system calls in the I/O-path. These consume significant
>> amounts of CPU cycles and memory that could have been otherwise assigned to
>> the actual job at hand. This becomes even more critical with
>> latency-sensitive Spark streaming and Deep Learning applications over
>> SparkML.
>>
>> RDMA (Remote Direct Memory Access) is already a commodity technology that
>> is supported on most mid-range to high-end Network Adapter cards,
>> manufactured by various companies. Furthermore, RDMA-capable networks are
>> already offered on public clouds such as Microsoft Azure, and will probably
>> be supported in AWS soon to appeal to MPI users. Existing users of Spark on
>> Microsoft Azure servers can get the benefits of RDMA by running on a
>> suitable instance with this plugin, without needing any application changes.
>>
>> RDMA provides a unique approach for accessing memory locations over the
>> network, without the need for copying on the transmitter side nor on the
>> receiver side.
>>
>> These remote memory read and write operations are enabled by a standard
>> interface which is part of mainstream Linux releases for many years now.
>> This standardized interface allows direct access to remote memory from
>> user-space, while skipping costly system calls in the I/O-path. Due to its
>> many virtues, RDMA has found its way to be a standard data transfer
>> protocol in HPC (High Performance Computing) applications, with MPI being
>> the most prominent.
>>
>> RDMA has been traditionally associated with InfiniBand networks, but with
>> the standardization of RDMA over Converged Ethernet (RoCE), RDMA is
>> supported and widely used on Ethernet networks for many years now.
>>
>> Since Spark is all about performing everything in-memory, RDMA seems like
>> a perfect fit for filling-in the gap of transferring intermediate in-memory
>> data between the participating nodes. SparkRDMA (
>> https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle
>> performance can dramatically improve with the use of RDMA. Today, it is
>> gaining significant traction with many users, and has successfully
>> demonstrated major performance improvement on production applications in
>> high-profile technology companies. SparkRDMA is a generic and easy-to-use
>> plugin. However, in order to gain wide adoption with its effort-less
>> acceleration, it must be integrated into Apache Spark itself.
>>
>> The purpose of this SPIP, is to introduce RDMA into mainstream Spark for
>> improving Shuffle performance, and to pave the way for further
>> accelerations such as GPUDirect (acceleration with NVidia GPUs over CUDA),
>> NVMeoF (NVMe over Fabrics) and more.
>>
>> SparkRDMA will be presented at Spark Summit 2017 in Dublin (
>> https://spark-summit.org/eu-2017/events/accelerating-shuffl
>> e-a-tailor-made-rdma-solution-for-apache-spark/)
>>
>>
>>
>> *Target Personas*
>>
>> Any Spark user who cares about performance
>>
>>
>>
>> *Goals*
>>
>> Use RDMA to improve Shuffle data transfer performance and reduce total
>> job runtime.
>>
>> Automatically activate RDMA accelerated shuffles where supported.
>>
>>
>>
>> *Non-Goals*
>>
>> This SPIP limits the usage of RDMA for Shuffle data transfers only. This
>> SPIP is the first step in introducing RDMA to Spark and open a range of
>> possibilities for future improvements. In the future, SPIPs can address
>> other network consumers that can benefit from RDMA such as, but not limited
>> to: Broadcast, RDD transfers, RPC messaging, storage access, GPU access and
>> HDFS-RDMA interface. .
>>
>>
>>
>> *API Changes*
>>
>> There will be no API changes required
>>
>>
>>
>> *Proposed Design*
>>
>> SparkRDMA currently utilizes the ShuffleManager interface (
>> https://github.com/apache/spark/blob/master/core/src/main/s
>> cala/org/apache/spark/shuffle/ShuffleManager.scala) to allow RDMA
>> accelerated Shuffles. This interface is sufficient to allow network savvy
>> users to take advantage of RDMA capabilities in the context of Spark.
>> However, to make this technology more easily accessible, we propose to add
>> the code to mainstream Spark, and implement a method to automatically use
>> the RdmaShuffleManager when RDMA is supported on the system. This way, any
>> Spark user that already has the hardware support for RDMA, can seamlessly
>> enjoy its performance benefits.
>>
>> Furthermore, SparkRDMA in its current plugin form, is limited by several
>> constraints that can be removed once introduced into mainstream Spark.
>> Among those are:
>>
>> ·         SparkRDMA manages its own memory, off-heap. When integrated
>> into Spark, it can use Tungsten physical memory for all of its needs,
>> allowing for faster allocations and memory registrations that can increase
>> performance significantly. Also, any data that resides in Tungsten memory
>> can be transferred with almost no overhead.
>>
>> ·         MapStatuses are redundant – no need for those extra transfers
>> that take precious seconds in many jobs
>>
>>
>>
>> *Rejected Designs*
>>
>> Support RDMA with the SparkRDMA plugin:
>>
>> ·         The SparkRDMA plugin approach introduces limitations and
>> overhead that reduce performance
>>
>> ·         Plugins are awkward to build, install and deploy and that is
>> why they are usually avoided
>>
>> ·         Forward support is difficult to maintain for plugins that are
>> not part of the upstream project, specifically for Spark which is a rapidly
>> changing project
>>
>> To ensure maximum performance and to allow mass adoption of this general
>> solution, RDMA capabilities must be introduced into Spark itself.
>>
>>
>>
>
>

Re: [SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine(Internet mail)

Reply via email to