Hello,

My platform runs hundreds of Spark jobs every day each with its own
datasize from 20mb to 20TB. This means that we need to set resources
dynamically. One major pain point for doing this is
spark.sql.shuffle.partitions, the number of partitions to use when
shuffling data for joins or aggregations. It is to be arbitrarily hard
coded to 200. The only way to set this config is in the spark submit
command or in the SparkConf before the executor is created.

This creates a lot of problems when I want to set this config dynamically
based on the in memory size of a dataframe. I only know the in memory size
of the dataframe halfway through the spark job. So I would need to stop the
context and recreate it in order to set this config.

Is there any better way to set this? How does  spark.sql.shuffle.partitions
work differently than .repartition?

Brandon

Reply via email to