+1
-=Bill
On Thu, Oct 16, 2014 at 11:21 AM, Maxim Khutornenko
wrote:
> Correct. The presence of the SessionKey does indeed mean that
> heartbeats are going to be authenticated. Given that external service
> has to solve authentication story to use pauseJobUpdate anyway, having
> heartbeats auth
Correct. The presence of the SessionKey does indeed mean that
heartbeats are going to be authenticated. Given that external service
has to solve authentication story to use pauseJobUpdate anyway, having
heartbeats authenticated seems like a natural progression. Also, given
our current admin thrift
I inferred that authentication was required due to the presence of a
SessionKey in the RPC. Of course any authentication mechanism here could
have serious scaling issues (barring something like HTTP basic auth in
memory)
On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen
wrote:
> What are our though
What are our thoughts about authentication with regards to heartbeats? It
seems like they should be authenticated since there does exist the
potential for a malicious actor to send its own heartbeats even if the real
monitoring service has detected a problem and ceased sending heartbeats.
I'm not s
David - the plan is to synthesize the waiting state. Exactly how is not
yet certain.
On Wednesday, October 15, 2014, Maxim Khutornenko wrote:
> It is certainly possible to add new state or a status message but I
> don't think it's a blocker for the first iteration. Provided there is
> enough de
It is certainly possible to add new state or a status message but I
don't think it's a blocker for the first iteration. Provided there is
enough demand a state/message could be synthesized during the 'get'
call based on the volatile state.
On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin wrote:
+1 for pause being explicit RPC pauses, but does it really add complexity
to just add a new state (WAITING?) when no heartbeat is sent? Not being
able to see that an update was blocked because of a lack of heartbeat seems
like a missing feature.
On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko
+1. Updated the doc:
https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner wrote:
> +1 to the scheduler not proceeding on an update when heartbeats are absent,
> and requiring the heartbeat service to explicitly call paus
+1 to the scheduler not proceeding on an update when heartbeats are absent,
and requiring the heartbeat service to explicitly call pauseJobUpdate when
it detects problems.
-=Bill
On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney wrote:
> Chatted with Maxim and Bill, I think we figured it out
>
> I
Chatted with Maxim and Bill, I think we figured it out
I think the confusion stems from the fact that there are two types of
pauses in this system, explicit, persisted pauses generated by the
pauseJobUpdate RPC and implicit, volatile pauses caused due to the absence
of a sufficiently fresh heartbe
I think we should assess that after building the rest of the feature. IIUC
the rest of the code doesn't care if the update is initially paused.
-=Bill
On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko
wrote:
> Can we get a consensus here? Looks like the only sticky point left is
> around sta
Can we get a consensus here? Looks like the only sticky point left is
around starting an update in paused vs. non-paused state. I can argue
either way as it's easy to add later if needed.
On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner wrote:
> I'm not arguing against the merits of the approach. Ju
I'm not arguing against the merits of the approach. Just feeling out
whether that should be done _after_ the rest of the heartbeat support.
Seems like it can be cleanly added at the end to get something usable
earlier.
-=Bill
On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney wrote:
> I'm +1 for
I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy
signal, and punting on a more complex NACK signal (which we'd have to
reliably persist).
I think the only disagreement in this thread is whether the default state
for a new update should be running or waiting-for-heartbeat. I th
I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy
signal, and punting on a more complex NACK signal (which we'd have to
reliably persist).
I think the only disagreement in this thread is whether the default state
for a new update should be running or waiting-for-heartbeat. I th
Wait - simpler solution than what? We're talking about not doing either.
-=Bill
On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney wrote:
> I think waiting for the first heartbeat before taking any action is the
> simpler solution here as it allows the implementation to be entirely
> soft-state a
I think waiting for the first heartbeat before taking any action is the
simpler solution here as it allows the implementation to be entirely
soft-state and still catches the bugs I described.
The implementation is just PulseMonitorImpl - heartbeat calls
pulse and mutation operations check isAlive.
I think waiting for the first heartbeat before taking any action is the
simpler solution here as it allows the implementation to be entirely
soft-state and still catches the bugs I described.
The implementation is just PulseMonitorImpl - heartbeat calls
pulse and mutation operations check isAlive.
Pausing update on creation seems like a logical approach when dealing
with inverted dependency model. I.e. updater is happy to act as long
as it's greenlighted by the external signal. It's also aligned with a
failover experience where coordinated updates are rehydrated in paused
state waiting for H
If the goal is to reduce complexity now and add features later, why not
nuke both for now - kick off the update right away, and let lack of
heartbeats serve as a uniform "unknown or unhealthy" signal?
-=Bill
On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko wrote:
> I am still +1 on the idea t
I am still +1 on the idea to have default paused state on creation. I
think we could still differentiate between initially paused and timed
out states internally by looking at pause reason. It's quite different
if we want to store explicit NACK reasons from the external service
though. That would r
I like the idea of implementing this scheduler-side purely through volatile
state, but the lack of feedback (generic vs specific error messages when an
update is paused) leaves something to be desired. Maybe we can address that
with a metadata field in the initial call to startUpdate (with an optio
The main reason I preferred the lack-of-ACK approach over an explicit
NACK one is simplicity. As Joshua pointed out there is more state to
handle in that case. The lack-of-ACK model can be completely
implemented in volatile memory sidestepping the persistent storage
entirely. With the NACK we would
"The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. If
we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
to resume when we receive another ACK. In other words, a service toggling
unhealthy might not be deemed safe to proceed."
Lack-of-ACK is the scena
>
> the generic paused-waiting-for-heartbeat message will be quickly replaced
> by "high 502 rate"
>From the doc Maxim linked, i don't believe that's the plan:
External service detects service health problems and stops heartbeats
> Heartbeat timeout occurs. Scheduler pauses the update.
-=Bill
If the service sending the heartbeat RPC is working, the generic
paused-waiting-for-heartbeat message will be quickly replaced by "high 502
rate". If it's not working (or has connectivity issues) we at least won't
give a false sense of progress.
On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner wrote:
Re: user experience, NACK-via-timeout fails here as well.
"PAUSED - Heartbeat not received in 60s" is objectively worse than "PAUSED
- Heartbeat failed: high 502 rate".
This is part of the impedance mismatch i'm calling out.
-=Bill
On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney wrote:
> On Mo
On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner wrote:
> What is the guidance for deploying while the heartbeat service is broken?
> I think i know the answer, but it's important to spell out.
>
>
>
> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> > state to avoid any prog
What is the guidance for deploying while the heartbeat service is broken?
I think i know the answer, but it's important to spell out.
> Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> state to avoid any progress until the first heartbeat call arrives.
I'm not sold on th
Agreed. That would be a logical generalization of the post failover behavior.
I have updated the above document with the following changes:
- Reply with PAUSED any time a job was paused by user;
- Start in paused state by default.
On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney wrote:
> The doc
The doc mentioned that the scheduler will start an update subject to the
heartbeat countdown, and if it doesn't receive a heartbeat it will pause
the update. Why not start with the update paused-due-to-no-heartbeat to
fail-fast any connectivity issues between the service providing the
heartbeats an
I presume you are referring to case #5?
Joshua correctly pointed to the assumption I made (should have probably
documented it) is that the pause/resume actions would issue relevant
stop/start calls for the monitoring service to either suspend or resume
heartbeats. You may argue it applies unnecess
In theory when the user resumes they'd also resume monitoring (and thus
resume heartbeats)? Maybe the resumeJobUpdate RPC needs to support
pauseIfNoHeartbeatsAfterMs as well?
I'm not sure what a monitoring/heartbeating service would do with its
record of a paused job while it's paused. How would i
- A heartbeatJobUpdate RPC is called with the matching update ID.
Scheduler resets countdown and responds with STOP
Paused is a tricky state because the user can resume at any time. I'd
propose we have a different response here. You really don't want to "stop"
monitoring the update while it
Hi all,
We are proposing a new feature for the scheduler updater, which you
may find helpful.
I have posed a brief feature summary here:
https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
Please, reply with your feedback/concerns/comments.
Thanks,
Maxim
35 matches
Mail list logo