Hi, why not use EMR or data proc, kubernetes does not provide any benefit at all for such scale of work. It is a classical case of over engineering and over complication just for the heck of it.
Also I think that in case you are in AWS, Redshift Spectrum or Athena for 90% of use cases are way optimal. Regards, Gourav On Thu, Sep 29, 2022 at 7:13 PM Igor Calabria <igor.calab...@gmail.com> wrote: > Hi Everyone, > > I'm running spark 3.2 on kubernetes and have a job with a decently sized > shuffle of almost 4TB. The relevant cluster config is as follows: > > - 30 Executors. 16 physical cores, configured with 32 Cores for spark > - 128 GB RAM > - shuffle.partitions is 18k which gives me tasks of around 150~180MB > > The job runs fine but I'm bothered by how underutilized the cluster gets > during the reduce phase. During the map(reading data from s3 and writing > the shuffle data) CPU usage, disk throughput and network usage is as > expected, but during the reduce phase it gets really low. It seems the main > bottleneck is reading shuffle data from other nodes, task statistics > reports values ranging from 25s to several minutes(the task sizes are > really close, they aren't skewed). I've tried increasing > "spark.reducer.maxSizeInFlight" and > "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by > a little, but not enough to saturate the cluster resources. > > Did I miss some more tuning parameters that could help? > One obvious thing would be to vertically increase the machines and use > less nodes to minimize traffic, but 30 nodes doesn't seem like much even > considering 30x30 connections. > > Thanks in advance! > >