Hello there, I have a basic question with how the number of tasks are determined per spark job. Let's say the scope of this discussion around parquet and Spark 2.x. 1. I thought that, the number of jobs is proportional to the number of part files that exist. Is this correct? 2. I noticed that for a 25 core job, the number of tasks scheduled was around 250. But the same job, when executed with 75 cores had around 460 tasks. Are the number of tasks proportional to cores used? Note, the number of tasks, I refer to here are the tasks count during ` spark.read.parquet("")` operation. I do understand that, during join / reduce operation, the shuffle takes control of the number of tasks for the next stage (from "spark.sql.shuffle.partitions" which defaults to 200 -- https://spark.apache.org/docs/latest/sql-performance-tuning.html). 3. Also for "spark.sql.shuffle.partitions" is there anyway I can provide a computed value based on input data / join / UDAF functions used? Ideally if the tasks are around 200, I might see OutOfMemory issue (depending on the data size). Too large of a number would create many small tasks. The right balance may be based on input data size + shuffle operation.
Please advice Muthu