Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

Hegner, Travis Tue, 06 Dec 2016 07:50:58 -0800

Steve,

I appreciate your experience and insight when dealing with large clusters at 
the data-center scale. I'm also well aware of the complex nature of schedulers, 
and that it is an area of ongoing research being done by people/companies with 
many more resources than I have. This might explain my apprehension in even 
calling this idea a *scheduler*: I wanted to avoid this exact kind of debate 
over what I want to accomplish. This is also why I mentioned that this idea 
will mostly benefit users with small clusters.

I've used many of the big named "cluster schedulers" (YARN, Mesos, and 
Kubernetes) and the main thing that they have in common is that they don't work 
well for my use case. Those systems are designed for large scale 1000+ node 
clusters, and become painful to manage in the small cluster range. Most of the 
tools that we've attempted to use don't work well for us, so we've written 
several of our own: https://github.com/trilliumit/.

It can be most easily stated by the fact that *we are not* Google, Facebook, or 
Amazon: we don't have a *data-center* of servers to manage, we barely have half 
of a rack. *We are not trying to solve the problem that you are referring to*. 
We are operating at a level that if we aren't meeting SLAs, then we could just 
buy another server to add to the cluster. I imagine that we are not alone in 
that fact either, I've seen that many of the questions on SO and on the user 
list are from others operating at a level similar to ours.

I understand that pre-emption isn't inherently a bad thing, and that these 
multi-node systems typically handle it gracefully. However, if idle CPU is 
expensive, then how much more does wasted CPU cost when a nearly complete task 
is pre-empted and has to be started over? Fortunately for me, that isn't a 
problem that I have to solve at the moment.

>Instead? Use a multi-user cluster scheduler and spin up different spark 
>instances for the different workloads

See my above comment on how well these cluster schedulers work for us. I have 
considered the avenue of multiple spark clusters, and in reality the 
infrastructure we have set up would allow me to do this relatively easily. In 
fact, in my environment, this is a similar solution to what I'm proposing, just 
managed one layer up the stack and with less flexibility. I am trying to avoid 
this solution however because it does require more overhead and maintenance. 
What if I want two spark apps to run on the same cluster at the same time, 
sharing the available CPU capacity equally? I can't accomplish that easily with 
multiple spark clusters. Also, we are a 1 to 2 man operation at this point, I 
don't have teams of ops people to task with managing as many spark clusters as 
I feel like launching.

>FWIW, it's often memory consumption that's most problematic here.

Perhaps in the use-cases you have experience with, but not currently in mine. 
In fact, my initial proposal is net yet changing the allocation of memory as a 
resource. This would still be statically allocated in a FIFO manner as long as 
memory is available on the cluster, the same way it is now.

>I would strongly encourage you to avoid this topic

Thanks for the suggestion, but I will choose how I spend my time. If I can find 
a simple solution to a problem that I face, and I'm willing to share that 
solution, I'd hope one would encourage that instead.

Perhaps I haven't yet clearly communicated what I'm trying to do. In short, *I 
am not trying to write a scheduler*: I am trying to slightly (and optionally) 
tweak the way executors are allocated and launched, so that I can more 
intuitively and more optimally utilize my small spark cluster.

Thanks,

Travis

________________________________
From: Steve Loughran <ste...@hortonworks.com>
Sent: Tuesday, December 6, 2016 06:54
To: Hegner, Travis
Cc: Shuai Lin; Apache Spark Dev
Subject: Re: SPARK-18689: A proposal for priority based app scheduling 
utilizing linux cgroups.

This is essentially what the cluster schedulers do: allow different people to 
submit work with different credentials and priority; cgroups & equivalent to 
limit granted resources to requested ones. If you have pre-emption enabled, you 
can even have one job kill work off the others. Spark does recognise 
pre-emption failures and doesn't treat it as a sign of problems in the 
executor, that is: it doesn't over-react.

cluster scheduling is one of the cutting edge bits of datacentre-scale 
computing —if you are curious about what is state of the art, look at the 
Morning Paper https://blog.acolyer.org/ for coverage last week of MS and google 
work there. YARN, Mesos, Borg, whatever Amazon use, at scale it's not just 
meeting SLAs, its about how much idle CPU costs, and how expensive even a 1-2% 
drop in throughput would be.

I would strongly encourage you to avoid this topic, unless you want dive deep 
into the whole world of cluster scheduling, the debate over centralized vs 
decentralized, the idelogical one of "should services ever get allocated 
RAM/CPU in times of low overall load?", the challenge of swap, or more 
specifically, "how do you throttle memory consumption", as well as what to do 
when the IO load of a service is actually incurred on a completely different 
host from the one your work is running on.

There's also a fair amount of engineering work; to get a hint download the 
Hadoop tree and look at 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux
 for the cgroup support, and then 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl
 for the native code needed alongside this. Then consider that it's not just a 
matter of writing something similar, it's getting an OSS project to actually 
commit to maintaining such code after you provide that initial contribution.

Instead? Use a multi-user cluster scheduler and spin up different spark 
instances for the different workloads, with different CPU & memory limits, 
queue priorities, etc. Other people have done the work, written the tests, 
deployed it in production, met their own SLAs *and are therefore committed to 
maintaining this stuff*.

-Steve

On 5 Dec 2016, at 15:36, Hegner, Travis 
<theg...@trilliumit.com<mailto:theg...@trilliumit.com>> wrote:

My apologies, in my excitement of finding a rather simple way to accomplish the 
scheduling goal I have in mind, I hastily jumped straight into a technical 
solution, without explaining that goal, or the problem it's attempting to solve.

You are correct that I'm looking for an additional running mode for the 
standalone scheduler. Perhaps you could/should classify it as a different 
scheduler, but I don't want to give the impression that this will be as 
difficult to implement as most schedulers are. Initially, from a memory 
perspective, we would still allocate in a FIFO manner. This new scheduling mode 
(or new scheduler, if you'd rather) would mostly benefit any users with 
small-ish clusters, both on-premise and cloud based. Essentially, my end goal 
is to be able to run multiple *applications* simultaneously with each 
application having *access* to the entire core count of the cluster.

I have a very cpu intensive application that I'd like to run weekly. I have a 
second application that I'd like to run hourly. The hourly application is more 
time critical (higher priority), so I'd like it to finish in a small amount of 
time. If I allow the first app to run with all cores (this takes several days 
on my 64 core cluster), then nothing else can be executed when running with the 
default FIFO scheduler. All of the cores have been allocated to the first 
application, and it will not release them until it is finished. Dynamic 
allocation does not help in this case, as there is always a backlog of tasks to 
run until the first application is nearing the end anyway. Naturally, I could 
just limit the number of cores that the first application has access to, but 
then I have idle cpu time when the second app is not running, and that is not 
optimal. Secondly in that case, the second application only has access to the 
*leftover* cores that the first app has not allocated, and will take a 
considerably longer amount of time to run.

You could also imagine a scenario where a developer has a spark-shell running 
without specifying the number of cores they want to utilize (whether 
intentionally or not). As I'm sure you know, the default is to allocate the 
entire cluster to this application. The cores allocated to this shell are 
unavailable to other applications, even if they are just sitting idle while a 
developer is getting their environment set up to run a very big job 
interactively. Other developers that would like to launch interactive shells 
are stuck waiting for the first one to exit their shell.

My proposal would eliminate this static nature of core counts and allow as many 
simultaneous applications to be running as the cluster memory (still statically 
partitioned, at least initially) will allow. Applications could be configured 
with a "cpu shares" parameter (just an arbitrary integer relative only to other 
applications) which is essentially just passed through to the linux cgroup 
cpu.shares setting. Since each executor of an application on a given worker 
runs in it's own process/jvm, then that process could be easily be placed into 
a cgroup created and dedicated for that application.

Linux cgroups cpu.shares are pretty well documented, but the gist is that 
processes competing for cpu time are allocated a percentage of time equal to 
their share count as a percentage of all shares in that level of the cgroup 
hierarchy. If two applications are both scheduled on the same core with the 
same weight, each will get to utilize 50% of the time on that core. This is all 
built into the kernel, and the only thing the spark worker has to do is create 
a cgroup for each application, set the cpu.shares parameter, and assign the 
executors for that application to the new cgroup. If multiple executors are 
running on a single worker, for a single application, the cpu time available to 
that application is divided among each of those executors equally. The default 
for cpu.shares is that they are not limiting in any way. A process can consume 
all available cpu time if it would otherwise be idle anyway.

That's the issue that surfaces in google papers: should jobs get idle capacity. 
Current consensus is "no". Why not? Because you may end up writing an 
SLA-sensitive app which just happens to meet it's SLAs in times of light 
cluster load, but precisely when the cluster is busy, it suddenly slows down, 
leading to stress all round, in the "why is this service suddenly unusable" 
kind of stress. Instead you keep the cluster busy with low priority preemptible 
work, use labels to allocate specific hosts to high-SLA apps, etc.

Another benefit to passing cpu.shares directly to the kernel (as opposed to 
some abstraction) is that cpu share allocations are heterogeneous to all 
processes running on a machine. An admin could have very fine grained control 
over which processes get priority access to cpu time, depending on their needs.

To continue my personal example above, my long running cpu intensive 
application could utilize 100% of all cluster cores if they are idle. Then my 
time sensitive app could be launched with nine times the priority and the linux 
kernel would scale back the first application to 10% of all cores (completely 
seemlessly and automatically: no pre-emption, just fewer time slices of cpu 
allocated by the kernel to the first application), while the second application 
gets 90% of all the cores until it completes.

FWIW, it's often memory consumption that's most problematic here. If one 
process starts to swap, it hurts everything else. But Java isn't that good at 
handling limited heap/memory size; you have to spec that heap up front.

The only downside that I can think of currently is that this scheduling mode 
would create an increase in context switching on each host. This issue is 
somewhat mitigated by still statically allocating memory however, since there 
wouldn't typically be an exorbitant number of applications running at once.

In my opinion, this would allow the most optimal usage of cluster resources. 
Linux cgroups allow you to control access to more than just cpu shares. You can 
apply the same concept to other resources (memory, disk io). You can also set 
up hard limits so that an application will never get more than is allocated to 
it. I know that those limitations are important for some use cases involving 
predictability of application execution times. Eventually, this idea could be 
expanded to include many more of the features that cgroups provide.

Thanks again for any feedback on this idea. I hope that I have explained it a 
bit better now. Does anyone else can see value in it?

I'm not saying "don't get involved in the scheduling problem"; I'm trying to 
show just how complex it gets in a large system. Before you begin to write a 
line of code, I'd recommend

-you read as much of the published work as you can, including the google and 
microsoft papers, Facebook's FairScheduler work, etc, etc.
-have a look at the actual code inside those schedulers whose source is public, 
that's YARN and Mesos.
-try using these schedulers for your own workloads.

really: scheduling work across a datacentre a complex problem that is still 
considered a place for cutting-edge research. Avoid unless you want to do that.

-Steve

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

Reply via email to