The specifics depend on what's going on underneath. At the 10,000 foot level, you probably know that Spark creates a Logical execution plan when you call it. It converts it into a execution plan when you call an action. The Execution plan has stages that are run sequentially. Stages are broken up into tasks that are run in parallel.
So, there 2 questions to be answered How does Spark determine stages? How does Spark break up stages into tasks? How does Spark determine stages? At a 5,000 foot view, Spark will break up a job into stages at points where a shuffle is required. This usually happens if you are doing a join, aggregation or partition. The basic idea is that Spark is trying to minimize data movement. It looks at the logical plan, divides it into stages such that data movement is minimized. The specifics depends on what the job is doing. How does Spark break up a stage into tasks? At a 5,000 foot view, it is trying to break up the stage into bite sized pieces that can be processed independently. The specifics depends on what is being done in the stage. For example, let's say, a stage is reading a file with 5 Million rows, transforming it and writing out. Spark will check how much memory it needs it for one row by looking at the dataframe schema. Let's say each row is 10K. Then it looks at how much memory it has per executor. Let's say 100M. From this it determines that one executor can process 10K rows. So, each task can have 10K rows. 5M/10K = 500. It will create 500 tasks. Each task will read 10K rows, transform them and write it to output Again, this is an example. How tasks get divided depend a lot on what's happening in the stage. It also optimizes the code and tries to push down predicates, which complicates things for us users. On 5/13/21, 10:21 AM, "abhilash.kr" <abhilash.kho...@gmail.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hello, What happens when a job is submitted to a cluster? I know the 10,000 foot overview of the spark architecture. But I need the minute details as to how spark estimates the resources to ask yarn, what's the response of yarn etc... I need the *step by step* understanding of the complete process. I searched through the net but I couldn't find any good material on this. Can anyone help me here? Thanks, Abhilash -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org