GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/2840
[WIP][SPARK-3822] Executor scaling mechanism for Yarn
This is part of a broader effort to enable dynamic scaling of executors
([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is
intended to work alongside with SPARK-3795 (#2746), SPARK-3796 and SPARK-3797.
The logic is built on top of @PraveenSeluka at #2798. This is different
from the changes there in that the mechanism is implemented within the existing
scheduler backend framework rather than in new `Actor` classes. This also
introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common
logic to communicate with the Yarn `ApplicationMaster`.
I have tested this on a stable Yarn cluster. This is still WIP because when
an executor is removed, `SparkContext` and its components react as if it has
failed, resulting in many scary error messages and eventual timeouts. While
it's not strictly necessary to fix this as of the first-cut implementation of
this mechanism, it would be good to add logic to distinguish this case if it
doesn't require too much more work.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark yarn-scaling-mechanism
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2840.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2840
----
commit bbee669d414bca292380028070ebd684ddca6c88
Author: Andrew Or <[email protected]>
Date: 2014-10-16T20:33:09Z
Add mechanism in yarn client mode
This commit allows the YarnClientSchedulerBackend to communicate
with the AM to request or kill executors dynamically. This extends
the existing MonitorActor in ApplicationMaster to also accept
messages to scale the number of executors up/down.
TODO: extend this to the yarn cluster backend as well, and put
the common functionality in its own class.
commit 53e81454a65c2543afdfbd602f2fec3b850872c4
Author: Andrew Or <[email protected]>
Date: 2014-10-17T01:04:08Z
Start yarn actor early to listen for AM registration message
Previously it was dropped, and the feature was never successfully
enabled. This is an easy fix but it took a long time to understand
what is wrong. This is Akka in its finest glory.
commit c4dfaac0c73689b79663e924e875e8ff93d8e5c6
Author: Andrew Or <[email protected]>
Date: 2014-10-17T01:45:30Z
Avoid thrashing when removing executors
Previously, we immediately add an executor back upon removing it.
This is simply because we don't keep a relevant counter updated.
An important TODO at this point is to inform the SparkContext
sooner about the executor successfully being killed. Otherwise
we have to wait for the BlockManager timeout, which may take a
long time.
commit 47466cd9d0f4a4f2f060e82796b25867092f5de0
Author: Andrew Or <[email protected]>
Date: 2014-10-18T01:32:24Z
Refactor common Yarn scheduler backend logic
As of this commit the mechanism is accessible from cluster mode
in addition to just client mode.
commit 7b76d0a1b0724f0c6572b8ffa9a13135d6d63b5f
Author: Andrew Or <[email protected]>
Date: 2014-10-18T02:04:30Z
Expose mechanism in SparkContext as developer API
commit d987b3e9e33165a482189343c324a0babc6ff3f9
Author: Andrew Or <[email protected]>
Date: 2014-10-18T02:33:54Z
Move addWebUIFilters to Yarn scheduler backend
It's only used by Yarn.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]