The specifics depend on what's going on underneath. At the 10,000 foot level, 
you probably know that Spark creates a Logical execution plan when you call it. 
It converts it into a execution plan when you call an action. The Execution 
plan has stages that are run sequentially. Stages are broken up into tasks that 
are run in parallel. 

So, there 2 questions to be answered How does Spark determine stages? How does 
Spark break up stages into tasks?

How does Spark determine stages? At a 5,000 foot view, Spark will break up a 
job into stages at points where a shuffle is required. This usually happens if 
you are doing a join, aggregation or partition. The basic idea is that Spark is 
trying to minimize data movement. It looks at the logical plan, divides it into 
stages such that data movement is minimized. The specifics depends on what the 
job is doing.

How does Spark break up a stage into tasks? At a 5,000 foot view, it is trying 
to break up the stage into bite sized pieces that can be processed 
independently. The specifics depends on what is being done in the stage. For 
example, let's say, a stage is reading a file with 5 Million rows, transforming 
it and writing out. Spark will check how much memory it needs it for one row by 
looking at the dataframe schema. Let's say each row is 10K. Then it looks at 
how much memory it has per executor. Let's say 100M. From this it determines 
that one executor can process 10K rows. So, each task can have 10K rows. 5M/10K 
= 500. It will create 500 tasks. Each task will read 10K rows, transform them 
and write it to output
Again, this is an example. How tasks get divided depend a lot on what's 
happening in the stage. It also optimizes the code and tries to push down 
predicates, which complicates things for us users.

On 5/13/21, 10:21 AM, "abhilash.kr" <abhilash.kho...@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    Hello,

           What happens when a job is submitted to a cluster? I know the 10,000
    foot overview of the spark architecture. But I need the minute details as to
    how spark estimates the resources to ask yarn, what's the response of yarn
    etc... I need the *step by step* understanding of the complete process. I
    searched through the net but I couldn't find any good material on this. Can
    anyone help me here?

    Thanks,
    Abhilash



    --
    Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Reply via email to