> On Oct. 16, 2017, 6:33 p.m., Greg Mann wrote:
> > src/messages/messages.proto
> > Lines 637-638 (patched)
> > <https://reviews.apache.org/r/63001/diff/1-2/?file=1856406#file1856406line637>
> >
> >     I think that this list should include pending operations, but should 
> > _not_ include terminal operations with unacknowledged updates.
> >     
> >     When the master receives an update, it needs to decide whether or not 
> > it should update the allocator. If the operation was pending when 
> > `UpdateSlaveMessage` was sent, then the master should update the allocator. 
> > If the operation was not pending but its update had not been acknowledged, 
> > then the result of that operation was already reflected in the last 
> > `UpdateSlaveMessage` from the agent, and the master should not update the 
> > allocator.
> >     
> >     If the agent includes only pending operations in this field, then the 
> > master will be able to make this distinction.
> 
> Jie Yu wrote:
>     The reason we need to send terminal but unacked operations is for 
> reconciliation. Master always keeps a cache about all the known non-completed 
> operations (either pending or terminal but unacked). This information can be 
> used to answer reconciliation requests.
>     
>     The master can distinguish between a pending operation vs. a terminal but 
> unacked operation by looking at latest state of the operation. The master 
> update the allocator if the latest state if terminal, irrespective whether 
> it's acked or not.
>     
>     That does bring up an issue. Consider the following two cases:
>     
>     1) Master knows about an operation that the agent does not know about. 
> This can happen if the operation gets lost on its way to the agent. This will 
> trigger an agent re-registration. The master in this case will need to 
> generate a reconcile message (similar to that in `reconcileKnownSlave`) so 
> that the agent or LRP can reply with a terminal status update. This will 
> guarantee that the resources allocated for this operation are properly 
> recovered to allocator.
>     
>     This case is also possible if there is a speculation failure on old 
> operations and result in an `UpdateSlaveMessage` being sent to the master. 
> The operation that master knows about (but agent does not) will eventually 
> reach agent or LRP, and will be rejected by the LRP or the agent. The master 
> will properly recovered the resources to the allocator when a failed update 
> is received.
>     
>     2) Agent knows about an operation that the master does not know about. 
> For instance, This is a new master that just fails over, and an LRP 
> re-registers with the agent. In that case, the master needs to update the 
> 'allocated' (i.e., 'used') resources for those new operations. This is 
> similar to what we do when slave registers. However, the tricky part is that 
> the allocator does not have an interface currently allowing us to add more 
> allocation to a given agent. We likely need to add that (or make it part of 
> `Allocator::updateSlave` interface).
> 
> Greg Mann wrote:
>     OK, I see why we need to include unacked updates as well. Thanks Jie!
> 
> Greg Mann wrote:
>     Hey Jie, I have another question related to this issue so I re-opened it 
> for discussion.
>     
>     Previously you wrote "The master update the allocator if the latest state 
> if terminal, irrespective whether it's acked or not."
>     This implies that subsequent retries of the update will _not_ have the 
> `latest_state` set; otherwise, the master would not be able to distinguish 
> between the first terminal status update and subsequent retries. Was it your 
> intention to suggest that behavior? Looking at the `Slave::forward` helper 
> that we use when intitializing the status update manager, it looks like we 
> _always_ set `latest_state` for task status updates: 
> https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/slave/slave.cpp#L4867-L4870
>     
>     I suppose we could set the latest state the first time an operation 
> update is sent to master, and then _not_ set it in subsequent retries. Then 
> the value of `.has_latest_state()` would be a direct indication of whether or 
> not an update is a retry.
> 
> Greg Mann wrote:
>     Ah, I see that `latest_status` is optional, so I suspect this was indeed 
> your intention. Just want to confirm :)

Greg, the allocator will be updated only if the state is transitioning from non 
terminal to terminal:
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L9122

No need to strip latest_state when sending the update.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63001/#review188162
-----------------------------------------------------------


On Oct. 19, 2017, 12:08 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63001/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2017, 12:08 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, Gaston Kleiman, 
> Greg Mann, Jan Schlicht, and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Updated protobuf definitions related to offer operations.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 859fdff4d9a0604bc506b08af79075084ae23466 
>   include/mesos/resource_provider/resource_provider.proto 
> f5a9073075327019fd133bd51265f695ef464845 
>   include/mesos/v1/mesos.proto cfd4abd3af1d8c9fbd31659161eada9ec9f92282 
>   include/mesos/v1/resource_provider/resource_provider.proto 
> e5cbede5b6e57a8641fca1ebfee5454f292cc24c 
>   src/messages/messages.proto 0a32b3457e9143a7d48670610ca3e56dd516136f 
>   src/resource_provider/manager.cpp 31fcb789f5ab907511e868c374c49f7457a33ed3 
>   src/resource_provider/validation.cpp 
> d2927227f60ab0d4ae2481ad73a31ee444b48ee0 
>   src/tests/resource_provider_validation_tests.cpp 
> f182bff4670318e9de22c2915c5dbb423a74ad6c 
> 
> 
> Diff: https://reviews.apache.org/r/63001/diff/9/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>

Reply via email to