Hey all,

Thanks in advance. I ran into a situation where spark driver reduced the
total executors count for my job even with dynamic allocation disabled, and
caused the job to hang for ever. 

Setup: 
Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
All servers in cluster running Linux version 2.6.32. 
Job in yarn-client mode.

Scenario:
1. Application running with required number of executors.
2. One of the DN's losses connectivity and is timed out.
2. Spark issues a killExecutor for the executor on the DN which was timed
out. 
3. Even with dynamic allocation off, spark's driver reduces the
"targetNumExecutors".

On analysing the code (Spark 1.3.1): 

When my DN goes unreachable: 
        
Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if
Dynamic Allocation is supported and then invokes "sc.killExecutor()"

        /if (sc.supportDynamicAllocation) {
                sc.killExecutor(executorId)
        }/ 

Surprisingly supportDynamicAllocation in sparkContext.scala is defined as,
resulting "True" if dynamicAllocationTesting flag is enabled or spark is
running over "yarn".

/private[spark] def supportDynamicAllocation = 
                    master.contains("yarn") || dynamicAllocationTesting /

"sc.killExecutor()" matches it to configured "schedulerBackend"
(CoarseGrainedSchedulerBackend in this case) and invokes
"killExecutors(executorIds)"

CoarseGrainedSchedulerBackend calculates a "newTotal" for the total number
of executors required, and sends a update to application master by invoking
"doRequestTotalExecutors(newTotal)"
 
CoarseGrainedSchedulerBackend then invokes a
"doKillExecutors(filteredExecutorIds)" for the lost executors. 

Thus reducing the total number of executors in a host intermittently
unreachable scenario.


I noticed that this change to "CoarseGrainedSchedulerBackend" was introduced
while fixing :  https://issues.apache.org/jira/browse/SPARK-6325
<https://issues.apache.org/jira/browse/SPARK-6325>  



I am new to this code, If any of you could comment on why do we need
"doRequestTotalExecutors" in "killExecutors" would be a great help. Also why
do we have "supportDynamicAllocation" = True even if i have not enabled
dynamic allocation. 

Regards,
Prakhar.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-driver-reducing-total-executors-count-even-when-Dynamic-Allocation-is-disabled-tp14679.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to