Hello there,

I have a basic question with how the number of tasks are determined per
spark job.
Let's say the scope of this discussion around parquet and Spark 2.x.
1. I thought that, the number of jobs is proportional to the number of part
files that exist. Is this correct?
2. I noticed that for a 25 core job, the number of tasks scheduled was
around 250. But the same job, when executed with 75 cores had around 460
tasks. Are the number of tasks proportional to cores used?
Note, the number of tasks, I refer to here are the tasks count during `
spark.read.parquet("")` operation. I do understand that, during join /
reduce operation, the shuffle takes control of the number of tasks for the
next stage (from "spark.sql.shuffle.partitions" which defaults to 200 --
https://spark.apache.org/docs/latest/sql-performance-tuning.html).
3. Also for "spark.sql.shuffle.partitions" is there anyway I can provide a
computed value based on input data / join / UDAF functions used? Ideally if
the tasks are around 200, I might see OutOfMemory issue (depending on the
data size). Too large of a number would create many small tasks. The right
balance may be based on input data size + shuffle operation.

Please advice
Muthu

Reply via email to