Re: Opaque automatic hook retries from API

Stuart Bishop Fri, 06 Jan 2017 04:10:52 -0800

On 6 January 2017 at 01:39, Casey Marshall <casey.marsh...@canonical.com>
wrote:

> On Thu, Jan 5, 2017 at 3:33 AM, Adam Collard <adam.coll...@canonical.com>
> wrote:
>
>> Hi,
>>
>> The automatic hook retries[0] that landed as part of 2.0 (are documented
>> as) run indefinitely[1] - this causes problems as an API user:
>>
>> Imagine you are driving Juju using the API, and when you perform an
>> operation (e.g. set the configuration of a service, or reboot the unit, or
>> add a relation..) - you want to show the status of that operation.
>>
>> Prior to the automatic retries, you simply perform your operation, and
>> watch the delta streams for the corresponding change to the unit - the
>> success or otherwise of the operation is reflected in the unit
>> agent-status/workload-status pair.
>>
>> Now, with retries, if you see a unit in the error state, you can't
>> accurately reflect the status of the operation, since the unit will
>> undoubtedly retry the hook again. Maybe it succeeds, maybe it fails again.
>> How can one say after receiving the first delta of a unit error if the
>> operation succeeded or failed?
>>
>> With no visibility up front on the retry strategy that Juju will perform
>> (e.g. something representing the exponential backoff and a fixed number of
>> retries before Juju admits defeat) it is impossible to say at any point in
>> the delta stream what the result of a failed-at-least-once operation is.
>>
>
> I think the retry strategy is great -- it leverages the immutability we
> expect hooks to provide, to deliver a robust result over unreliable
> substrates -- and all substrates are unreliable where there's
> internetworking involved!
>
> However I see your point about the retry strategy muddling status. I've
> noticed this sometimes when watching openstack or k8s bundles "shake out"
> the errors as they come up. I don't think this is always a charm quality
> issue, it's maybe because we're trying to show two different things with
> status?
>

errors being 'shaken out' are almost always unhandled race conditions. I
find destroy-service/remove-application is particularly problematic,
because the doomed units don't know they are being destroyed but rather is
informed about departing one relation at a time (which is inherently racy,
because the units the doomed service are related too will process their
relation-departed hooks almost immediately and stop talking to the doomed
service, while the doomed service still thinks it can access their
resources while it falls apart one piece at a time).

I'm becoming more and more a believer that we can't reasonably avoid these
errors, and instead maybe we should assume that they will happen and it is
perfectly normal. We can stick to writing nice idempotent handlers, simpler
because we can ignore and bubble up failures. Simpler protocols (eg.
removing all the handshaking the PostgreSQL interface does to try to avoid
races with authorization). And going back to Adam's point, have hooks
retried a few times with some sort of backoff before even being reported as
a failure to the end user. One of the reasons test suites are currently
flaky is that there are race conditions we have no reasonable way of
solving, such as a database restarting itself while a hook on another unit
is attempting to use it. Even though I currently bootstrap test envs with
the retry behaviour off, I'm thinking of changing that.

What if Juju made a clearer distinction between result-state ("what I'm
> doing most recently or last attempted to do") vs. goal-state ("what I'm
> trying to get done") in the status? Would that help?
>

Isn't the goal state just the failed hook? I would certainly like to see
the list of hooks queued to run on each unit though if that is what you
mean (not in the default tabular status, but in the json status dump).

>> Can retries be limited to a small number, with a backoff algorithm
>> explicitly documented and stuck to by Juju, with the retry attempt number
>> included in the delta stream?
>>
>
This sounds like a good idea. The limit could even be dynamic, with a retry
attempted every time a unit it is related too successfully runs a hook,
until the environment is quiescent.

-- 
Stuart Bishop <stuart.bis...@canonical.com>

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Re: Opaque automatic hook retries from API

Reply via email to