I am still +1 on the idea to have default paused state on creation. I
think we could still differentiate between initially paused and timed
out states internally by looking at pause reason. It's quite different
if we want to store explicit NACK reasons from the external service
though. That would r
I like the idea of implementing this scheduler-side purely through volatile
state, but the lack of feedback (generic vs specific error messages when an
update is paused) leaves something to be desired. Maybe we can address that
with a metadata field in the initial call to startUpdate (with an optio
The main reason I preferred the lack-of-ACK approach over an explicit
NACK one is simplicity. As Joshua pointed out there is more state to
handle in that case. The lack-of-ACK model can be completely
implemented in volatile memory sidestepping the persistent storage
entirely. With the NACK we would
"The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. If
we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
to resume when we receive another ACK. In other words, a service toggling
unhealthy might not be deemed safe to proceed."
Lack-of-ACK is the scena
>
> the generic paused-waiting-for-heartbeat message will be quickly replaced
> by "high 502 rate"
>From the doc Maxim linked, i don't believe that's the plan:
External service detects service health problems and stops heartbeats
> Heartbeat timeout occurs. Scheduler pauses the update.
-=Bill
If the service sending the heartbeat RPC is working, the generic
paused-waiting-for-heartbeat message will be quickly replaced by "high 502
rate". If it's not working (or has connectivity issues) we at least won't
give a false sense of progress.
On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner wrote:
Re: user experience, NACK-via-timeout fails here as well.
"PAUSED - Heartbeat not received in 60s" is objectively worse than "PAUSED
- Heartbeat failed: high 502 rate".
This is part of the impedance mismatch i'm calling out.
-=Bill
On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney wrote:
> On Mo
On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner wrote:
> What is the guidance for deploying while the heartbeat service is broken?
> I think i know the answer, but it's important to spell out.
>
>
>
> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> > state to avoid any prog
What is the guidance for deploying while the heartbeat service is broken?
I think i know the answer, but it's important to spell out.
> Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> state to avoid any progress until the first heartbeat call arrives.
I'm not sold on th
I'm +1 on the sentiment. My main concern is with the potential for effort
to be gated on documentation review. This essentially requires design
review for all new features before development begins. Great in theory, but
I worry in practice things may get held up while we try to work out details
tha
Slight change of position - i think the documentation with the pitch should
reflect the end state. In other words, it should be suitable for
committing on master when the feature is ready; and not be a pitch itself.
This is slightly different from the doc linked above.
-=Bill
On Mon, Oct 13, 201
Maxim's 'External Update Coordination' thread [1] sets a precedent for
pitching a feature via documentation before diving into code. I propose we
standardize on this practice for big and small features. This will take
some trial-and-error to get right, but i think for small changes requiring
docu
Agreed. That would be a logical generalization of the post failover behavior.
I have updated the above document with the following changes:
- Reply with PAUSED any time a job was paused by user;
- Start in paused state by default.
On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney wrote:
> The doc
Sounds good to me.
On Mon, Oct 13, 2014 at 11:51 AM, Bill Farner wrote:
> We had a discussion about this in our weekly community meeting in IRC
> today, and after some debate there was unanimous agreement to avoid all
> time control but to use the presence of the snooze file only. Below is the
We had a discussion about this in our weekly community meeting in IRC
today, and after some debate there was unanimous agreement to avoid all
time control but to use the presence of the snooze file only. Below is the
excerpt from the discussion. If you disagree, feel free to continue the
discussi
Summary of IRC Meeting in #aurora at Mon Oct 13 18:02:25 2014:
Attendees: wickman, jcohen, wfarner, Yasumoto, kts, mkhutornenko, davelester,
zmanji
- Preface
- Aurora doc day
- 0.6.0 release
- Test coverage flakiness
- External update coordination
- Ticket resolution field
- Health check snooze
The doc mentioned that the scheduler will start an update subject to the
heartbeat countdown, and if it doesn't receive a heartbeat it will pause
the update. Why not start with the update paused-due-to-no-heartbeat to
fail-fast any connectivity issues between the service providing the
heartbeats an
17 matches
Mail list logo