Thanks a lot for the answers foks.

It turned out that spark was just IOPs starved. Using better disks solved
my issue, so nothing related to kubernetes at all.

Have a nice weekend everyone

On Fri, Sep 30, 2022 at 4:27 PM Artemis User <arte...@dtechspace.com> wrote:

> The reduce phase is always more resource-intensive than the map phase.
> Couple of suggestions you may want to consider:
>
>    1. Setting the number of partitions to 18K may be way too high (the
>    default number is only 200).  You may want to just use the default and the
>    scheduler will automatically increase the partitions if needed.
>    2. Turn on dynamic resource allocation (DRA) (
>    
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation).
>    It would allow those executors that finish the map tasks returning the
>    resources (e.g. RAM, CPU cores) back to the cluster, and reallocate the
>    resources to the reduce tasks.  This feature (post Spark 3.0) is also
>    available to K8, but turned off by default.
>    3. With DRA turned on, you may want also try to play with a small
>    number of number of executors/nodes thus reducing shuffling needs, given
>    the fact that you only have 128GB RAM.
>
> Hope this helps...
>
> On 9/29/22 2:12 PM, Igor Calabria wrote:
>
> Hi Everyone,
>
> I'm running spark 3.2 on kubernetes and have a job with a decently sized
> shuffle of almost 4TB. The relevant cluster config is as follows:
>
> - 30 Executors. 16 physical cores, configured with 32 Cores for spark
> - 128 GB RAM
> -  shuffle.partitions is 18k which gives me tasks of around 150~180MB
>
> The job runs fine but I'm bothered by how underutilized the cluster gets
> during the reduce phase. During the map(reading data from s3 and writing
> the shuffle data) CPU usage, disk throughput and network usage is as
> expected, but during the reduce phase it gets really low. It seems the main
> bottleneck is reading shuffle data from other nodes, task statistics
> reports values ranging from 25s to several minutes(the task sizes are
> really close, they aren't skewed). I've tried increasing
> "spark.reducer.maxSizeInFlight" and
> "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
> a little, but not enough to saturate the cluster resources.
>
> Did I miss some more tuning parameters that could help?
> One obvious thing would be to vertically increase the machines and use
> less nodes to minimize traffic, but 30 nodes doesn't seem like much even
> considering 30x30 connections.
>
> Thanks in advance!
>
>
>

Reply via email to