OpenWhisk invoker overloads - "Rescheduling Run message"

Sven Lange-Last Thu, 04 Jul 2019 01:39:28 -0700

When running Apache OpenWhisk with a high load of heterogeneous action 
invocations - i.e. high variance in memory / time limit and high number of 
different namespaces / actions -, single invokers are occasionally 
overloaded. With this post, I want to share my observations and 
conclusions with the community. As a consequence, I'm planning to provide 
a series of pull requests that are meant to provide more insights when 
running into the invoker overloads.

Please share your observations and conclusions as well. I'm looking
forward to your feedback.

Invoker overloads can be concluded from the occurrence of the (in)famous
"Rescheduling Run message" log entry on invokers [1]:

"Rescheduling Run message, too many message in the pool, freePoolSize: 0
containers and 0 MB, busyPoolSize: 2 containers and 512 MB,
maxContainersMemory 512 MB, userNamespace: namespace, action:
ExecutableWhiskAction/namespace/package/action@0.0.1, needed memory: 256
MB, waiting messages: 0"

Excess activations on overloaded invokers need to wait until any of the
currently running actions completes. This can take up to the maximum
action time limit, potentially causing inacceptable wait times for these
excess activations.

In large OW installations, we see thousands of said log entries at times
in a few days. Do other OW adoptors share these observations?

I'm aware of two conditions that cause overloaded invokers by design:

1. A controller schedules an activation - but no acknowledgement is
received from the invoker within the expected time because the invoker
takes too long to complete the activation. Based on timer expiration, a
forced acknowledgement removes said activation from the load balancer's
memory book-keeping. With the released capacity, the load balancer
schedules a new activation to the invoker - which may still be running the
action that timed out in the controller before.

2. A controller cannot identify an invoker that has enough free capacity
to schedule an activation - in particular, this can happen if the
activation's memory limit is equal to the controller's shard memory size
on the invoker. If there is at least one usable invoker, the load balancer
will select a random invoker to schedule the activation. This situation is
called overload in the load balancer code and will yield a book-keeping
semaphore with a negative value. Apparently, the selected invoker cannot
process the scheduled activation.

Did I miss other conditions that cause overloaded invokers by design?

I suspect that there are additional causes for overloaded invokers -
design flaws in the controller / load balancer or even bugs. I'm
suggesting to extend log messages and improve existing metrics / introduce
new metrics to better understand what's going on with overloaded invokers.
We need to be careful when extending log messages - we must neither
considerably increase the log volume nor impact performance due to
additional operations. Overall, the goal is to eventually fix invoker
overloads.

I already opened a pull request (together with Sugandha) to add the action
timeout limit to the invoker assignment message in load balancer plus some
other changes [2]. Please check the PR for details.

I'm planning further pull requests in these areas:

* At times, we see a non-negligible amount of forced acknowledgements in
large OW installations. For this reason, I suggest to extend log messages
in the processCompletion() method [3] in the common load balancer code to
provide more diagnostic information when forced acknowledgements occur. In
particular, I want to add more information about the invoker processing
the activation and the action itself. A metric reflecting forced
acknowledgements also seems helpful.

* As discussed above, the load balancer will schedule an activation to a
randomly selected usable invoker if it cannot find a usable invoker that
has enough free user memory ("overload"). This can also be caused by
fragmentation where all invokers are running activations with small memory
limits and the activation to be scheduled has a very high memory limit.
Even though the invoker pool may have plenty of free user memory in total,
no single invoker may have enough free memory to fit a large activation.

For this reason, I'm planning to extend the schedule() method [4] in the
ShardingContainerPoolBalancer to collect more information about scheduling
that is logged afterwards: how many invokers were visited? Which minimum,
average and maximum free memory did the usable invokers have that were not
selected?

* When the "Rescheduling Run message" log entry on invokers [1] occurs, we
don't know what's currently going on in busy and free container pools. I'm
planning to extend the log message with more detail information about the
pools to better understand the scheduling history of this invoker. We need
to understand which activations currently occupy the invoker.

Please let me know what you think.

[1]
https://github.com/apache/incubator-openwhisk/blob/ee33c22224515d0161b5b595b0d45dbc14164c6e/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerPool.scala#L175-L199
[2] https://github.com/apache/incubator-openwhisk/pull/4537
[3]
https://github.com/apache/incubator-openwhisk/blob/ee33c22224515d0161b5b595b0d45dbc14164c6e/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L217-L222
[4]
https://github.com/apache/incubator-openwhisk/blob/ee33c22224515d0161b5b595b0d45dbc14164c6e/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/ShardingContainerPoolBalancer.scala#L377-L398

Mit freundlichen Grüßen / Regards,

Sven Lange-Last
Senior Software Engineer
IBM Cloud Functions
Apache OpenWhisk

E-mail: sven.lange-l...@de.ibm.com
Find me on:

Schoenaicher Str. 220
Boeblingen, 71032
Germany

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
HRB 243294

OpenWhisk invoker overloads - "Rescheduling Run message"

Reply via email to