Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

Michal Šenkýř Mon, 05 Dec 2016 09:53:13 -0800

Hello Travis,

I am just a short-time member of this list but I can definitely see thebenefit of using built-in OS resource management facilities todynamically manage cluster resources on the node level in this manner.At our company we often fight for resources on our development clusteras well as sometimes cancel running jobs in production to free upimmediately needed resources. If I understand it correctly, this wouldsolve a lot of our problems.



The only downside I see with this is that it is Linux-specific.


Michal


On 5.12.2016 16:36, Hegner, Travis wrote:

My apologies, in my excitement of finding a rather simple way toaccomplish the scheduling goal I have in mind, I hastily jumpedstraight into a technical solution, without explaining that goal, orthe problem it's attempting to solve.
You are correct that I'm looking for an additional running mode forthe standalone scheduler. Perhaps you could/should classify it as adifferent scheduler, but I don't want to give the impression that thiswill be as difficult to implement as most schedulers are. Initially,from a memory perspective, we would still allocate in a FIFOmanner. This new scheduling mode (or new scheduler, if you'drather) would mostly benefit any users with small-ish clusters, bothon-premise and cloud based. Essentially, my end goal is to be able torun multiple *applications* simultaneously with each applicationhaving *access* to the entire core count of the cluster.
I have a very cpu intensive application that I'd like to run weekly. Ihave a second application that I'd like to run hourly. The hourlyapplication is more time critical (higher priority), so I'd like it tofinish in a small amount of time. If I allow the first app to run withall cores (this takes several days on my 64 core cluster), thennothing else can be executed when running with the default FIFOscheduler. All of the cores have been allocated to thefirst application, and it will not release them until it is finished.Dynamic allocation does not help in this case, as there is always abacklog of tasks to run until the first application is nearing the endanyway. Naturally, I could just limit the number of cores that thefirst application has access to, but then I have idle cpu time whenthe second app is not running, and that is not optimal. Secondly inthat case, the second application only has access to the *leftover*cores that the first app has not allocated, and will take aconsiderably longer amount of time to run.
You could also imagine a scenario where a developer has a spark-shellrunning without specifying the number of cores they want to utilize(whether intentionally or not). As I'm sure you know, the default isto allocate the entire cluster to this application. The coresallocated to this shell are unavailable to other applications, even ifthey are just sitting idle while a developer is getting theirenvironment set up to run a very big job interactively. Otherdevelopers that would like to launch interactive shells are stuckwaiting for the first one to exit their shell.
My proposal would eliminate this static nature of core counts andallow as many simultaneous applications to be running as the clustermemory (still statically partitioned, at least initially) will allow.Applications could be configured with a "cpu shares" parameter (justan arbitrary integer relative only to other applications) which isessentially just passed through to the linux cgroup cpu.sharessetting. Since each executor of an application on a given worker runsin it's own process/jvm, then that process could be easily be placedinto a cgroup created and dedicated for that application.
Linux cgroups cpu.shares are pretty well documented, but the gist isthat processes competing for cpu time are allocated a percentage oftime equal to their share count as a percentage of all shares in thatlevel of the cgroup hierarchy. If two applications are both scheduledon the same core with the same weight, each will get to utilize 50% ofthe time on that core. This is all built into the kernel, and the onlything the spark worker has to do is create a cgroup for eachapplication, set the cpu.shares parameter, and assign the executorsfor that application to the new cgroup. If multiple executors arerunning on a single worker, for a single application, the cpu timeavailable to that application is divided among each of those executorsequally. The default for cpu.shares is that they are not limiting inany way. A process can consume all available cpu time if it wouldotherwise be idle anyway.
Another benefit to passing cpu.shares directly to the kernel (asopposed to some abstraction) is that cpu share allocations areheterogeneous to all processes running on a machine. An admin couldhave very fine grained control over which processes get priorityaccess to cpu time, depending on their needs.
To continue my personal example above, my long running cpu intensiveapplication could utilize 100% of all cluster cores if they are idle.Then my time sensitive app could be launched with nine times thepriority and the linux kernel would scale back the first applicationto 10% of all cores (completely seemlessly and automatically: nopre-emption, just fewer time slices of cpu allocated by the kernel tothe first application), while the second application gets 90% of allthe cores until it completes.
The only downside that I can think of currently is that thisscheduling mode would create an increase in context switching on eachhost. This issue is somewhat mitigated by still statically allocatingmemory however, since there wouldn't typically be an exorbitant numberof applications running at once.
In my opinion, this would allow the most optimal usage of clusterresources. Linux cgroups allow you to control access to more than justcpu shares. You can apply the same concept to other resources (memory,disk io). You can also set up hard limits so that an application willnever get more than is allocated to it. I know that those limitationsare important for some use cases involving predictability ofapplication execution times. Eventually, this idea could be expandedto include many more of the features that cgroups provide.
Thanks again for any feedback on this idea. I hope that I haveexplained it a bit better now. Does anyone else can see value in it?
Travis


------------------------------------------------------------------------
*From:* Shuai Lin <linshuai2...@gmail.com>
*Sent:* Saturday, December 3, 2016 06:52
*To:* Hegner, Travis
*Cc:* dev@spark.apache.org
*Subject:* Re: SPARK-18689: A proposal for priority based appscheduling utilizing linux cgroups.Sorry but I don't get the scope of the problem from your description.Seems it's an improvement for spark standalone scheduler (i.e. not foryarn or mesos)?
On Sat, Dec 3, 2016 at 4:27 AM, Hegner, Travis <theg...@trilliumit.com<mailto:theg...@trilliumit.com>> wrote:
    Hello,


    I've just created a JIRA to open up discussion of a new feature
    that I'd like to propose.


    https://issues.apache.org/jira/browse/SPARK-18689
    <https://issues.apache.org/jira/browse/SPARK-18689>


    I'd love to get some feedback on the idea. I know that normally
    anything related to scheduling or queuing automatically throws up
    the "hard to implement" red flags, but the proposal contains a
    rather simple way to implement the concept, which delegates the
    scheduling logic to the actual kernel of each worker, rather than
    in any spark core code. I believe this to be more flexible and
    simpler to set up and maintain than dynamic allocation, and avoids
    the need for any preemption type of logic.


    The proposal does not contain any code. I am not (yet) familiar
    enough with the core spark code to confidently create an
    implementation.


    I appreciate your time and am looking forward to your feedback!


    Thanks,


    Travis

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

Reply via email to