Hello, My platform runs hundreds of Spark jobs every day each with its own datasize from 20mb to 20TB. This means that we need to set resources dynamically. One major pain point for doing this is spark.sql.shuffle.partitions, the number of partitions to use when shuffling data for joins or aggregations. It is to be arbitrarily hard coded to 200. The only way to set this config is in the spark submit command or in the SparkConf before the executor is created.
This creates a lot of problems when I want to set this config dynamically based on the in memory size of a dataframe. I only know the in memory size of the dataframe halfway through the spark job. So I would need to stop the context and recreate it in order to set this config. Is there any better way to set this? How does spark.sql.shuffle.partitions work differently than .repartition? Brandon