Hi Everyone,

I'm running spark 3.2 on kubernetes and have a job with a decently sized
shuffle of almost 4TB. The relevant cluster config is as follows:

- 30 Executors. 16 physical cores, configured with 32 Cores for spark
- 128 GB RAM
-  shuffle.partitions is 18k which gives me tasks of around 150~180MB

The job runs fine but I'm bothered by how underutilized the cluster gets
during the reduce phase. During the map(reading data from s3 and writing
the shuffle data) CPU usage, disk throughput and network usage is as
expected, but during the reduce phase it gets really low. It seems the main
bottleneck is reading shuffle data from other nodes, task statistics
reports values ranging from 25s to several minutes(the task sizes are
really close, they aren't skewed). I've tried increasing
"spark.reducer.maxSizeInFlight" and
"spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
a little, but not enough to saturate the cluster resources.

Did I miss some more tuning parameters that could help?
One obvious thing would be to vertically increase the machines and use less
nodes to minimize traffic, but 30 nodes doesn't seem like much even
considering 30x30 connections.

Thanks in advance!

Reply via email to