Re: Help with Shuffle Read performance

Leszek Reimus Thu, 29 Sep 2022 21:33:15 -0700

Hi Everyone,

To add my 2 cents here:

Advantage of containers, to me, is that it leaves the host system pristine
and clean, allowing standardized devops deployment of hardware for any
purpose. Way back before - when using bare metal / ansible, reusing hw
always involved full reformat of base system. This alone is worth the ~1-2%
performance tax cgroup containers have.

Advantage of kubernetes is more on the deployment side of things. Unified
deployment scripts that can be written by devs. Same deployment yaml (or
helm chart) can be used on local Dev Env / QA / Integration Env and finally
Prod (with some tweaks).

Depending on the networking CNI, and storage backend - Kubernetes can have
a very close to bare metal performance. In the end it is always a
trade-off. You gain some, you pay with extra overhead.

I'm running YARN on kubernetes and mostly run Spark on top of YARN (some
legacy MapReduce jobs too though) . Finding it much more manageable to
allocate larger memory/cpu chunks to yarn pods and then have run
auto-scaler to scale out YARN if needed; than to manage individual
memory/cpu requirements on Spark on Kubernetes deployment.

As far as I tested, Spark on Kubernetes is immature when reliability is
concerned (or maybe our homegrown k8s does not do fencing/STONITH well
yet). When a node dies / goes down, I find executors not getting
rescheduled to other nodes - the driver just gets stuck for the executors
to come back. This does not happen on YARN / Standalone deployment (even
when ran on same k8s cluster)

Sincerely,

Leszek Reimus

On Thu, Sep 29, 2022 at 7:06 PM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> dont containers finally run on systems, and the only advantage of
> containers is that you can do better utilisation of system resources by
> micro management of jobs running in it? Some say that containers have their
> own binaries which isolates environment, but that is a lie, because in a
> kubernetes environments that is running your SPARK jobs you will have the
> same environment for all your kubes.
>
> And as you can see there are several other configurations, disk mounting,
> security, etc issues to handle as an overhead as well.
>
> And the entire goal of all those added configurations is that someone in
> your devops team feels using containers makes things more interesting
> without any real added advantage to large volume jobs.
>
> But I may be wrong, and perhaps we need data, and not personal attacks
> like the other person in the thread did.
>
> In case anyone does not know EMR does run on containers as well, and in
> EMR running on EC2 nodes you can put all your binaries in containers and
> use those for running your jobs.
>
> Regards,
> Gourav Sengupta
>
> On Thu, Sep 29, 2022 at 7:46 PM Vladimir Prus <vladimir.p...@gmail.com>
> wrote:
>
>> Igor,
>>
>> what exact instance types do you use? Unless you use local instance
>> storage and have actually configured your Kubernetes and Spark to use
>> instance storage, your 30x30 exchange can run into EBS IOPS limits. You can
>> investigate that by going to an instance, then to volume, and see
>> monitoring charts.
>>
>> Another thought is that you're essentially giving 4GB per core. That
>> sounds pretty low, in my experience.
>>
>>
>>
>> On Thu, Sep 29, 2022 at 9:13 PM Igor Calabria <igor.calab...@gmail.com>
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> I'm running spark 3.2 on kubernetes and have a job with a decently sized
>>> shuffle of almost 4TB. The relevant cluster config is as follows:
>>>
>>> - 30 Executors. 16 physical cores, configured with 32 Cores for spark
>>> - 128 GB RAM
>>> -  shuffle.partitions is 18k which gives me tasks of around 150~180MB
>>>
>>> The job runs fine but I'm bothered by how underutilized the cluster gets
>>> during the reduce phase. During the map(reading data from s3 and writing
>>> the shuffle data) CPU usage, disk throughput and network usage is as
>>> expected, but during the reduce phase it gets really low. It seems the main
>>> bottleneck is reading shuffle data from other nodes, task statistics
>>> reports values ranging from 25s to several minutes(the task sizes are
>>> really close, they aren't skewed). I've tried increasing
>>> "spark.reducer.maxSizeInFlight" and
>>> "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
>>> a little, but not enough to saturate the cluster resources.
>>>
>>> Did I miss some more tuning parameters that could help?
>>> One obvious thing would be to vertically increase the machines and use
>>> less nodes to minimize traffic, but 30 nodes doesn't seem like much even
>>> considering 30x30 connections.
>>>
>>> Thanks in advance!
>>>
>>>
>>
>> --
>> Vladimir Prus
>> http://vladimirprus.com
>>
>

-- 
--------------
"It is the common fate of the indolent to see their rights become a prey to
the active. The condition upon which God hath given liberty to man is
eternal vigilance; which condition if he break, servitude is at once the
consequence of his crime and the punishment of his guilt." - John Philpot
Curran: Speech upon the Right of Election, 1790.

Re: Help with Shuffle Read performance

Reply via email to