> On Oct. 16, 2017, 6:33 p.m., Greg Mann wrote:
> > src/messages/messages.proto
> > Lines 637-638 (patched)
> > <https://reviews.apache.org/r/63001/diff/1-2/?file=1856406#file1856406line637>
> >
> >     I think that this list should include pending operations, but should 
> > _not_ include terminal operations with unacknowledged updates.
> >     
> >     When the master receives an update, it needs to decide whether or not 
> > it should update the allocator. If the operation was pending when 
> > `UpdateSlaveMessage` was sent, then the master should update the allocator. 
> > If the operation was not pending but its update had not been acknowledged, 
> > then the result of that operation was already reflected in the last 
> > `UpdateSlaveMessage` from the agent, and the master should not update the 
> > allocator.
> >     
> >     If the agent includes only pending operations in this field, then the 
> > master will be able to make this distinction.
> 
> Jie Yu wrote:
>     The reason we need to send terminal but unacked operations is for 
> reconciliation. Master always keeps a cache about all the known non-completed 
> operations (either pending or terminal but unacked). This information can be 
> used to answer reconciliation requests.
>     
>     The master can distinguish between a pending operation vs. a terminal but 
> unacked operation by looking at latest state of the operation. The master 
> update the allocator if the latest state if terminal, irrespective whether 
> it's acked or not.
>     
>     That does bring up an issue. Consider the following two cases:
>     
>     1) Master knows about an operation that the agent does not know about. 
> This can happen if the operation gets lost on its way to the agent. This will 
> trigger an agent re-registration. The master in this case will need to 
> generate a reconcile message (similar to that in `reconcileKnownSlave`) so 
> that the agent or LRP can reply with a terminal status update. This will 
> guarantee that the resources allocated for this operation are properly 
> recovered to allocator.
>     
>     This case is also possible if there is a speculation failure on old 
> operations and result in an `UpdateSlaveMessage` being sent to the master. 
> The operation that master knows about (but agent does not) will eventually 
> reach agent or LRP, and will be rejected by the LRP or the agent. The master 
> will properly recovered the resources to the allocator when a failed update 
> is received.
>     
>     2) Agent knows about an operation that the master does not know about. 
> For instance, This is a new master that just fails over, and an LRP 
> re-registers with the agent. In that case, the master needs to update the 
> 'allocated' (i.e., 'used') resources for those new operations. This is 
> similar to what we do when slave registers. However, the tricky part is that 
> the allocator does not have an interface currently allowing us to add more 
> allocation to a given agent. We likely need to add that (or make it part of 
> `Allocator::updateSlave` interface).

OK, I see why we need to include unacked updates as well. Thanks Jie!


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63001/#review188162
-----------------------------------------------------------


On Oct. 17, 2017, 6:52 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63001/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2017, 6:52 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, Gaston Kleiman, 
> Greg Mann, Jan Schlicht, and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Updated protobuf definitions related to offer operations.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto ba87339dbe341f4d16ceea74adc09647a3c07f32 
>   include/mesos/resource_provider/resource_provider.proto 
> f5a9073075327019fd133bd51265f695ef464845 
>   include/mesos/v1/mesos.proto a6d662fb26aa4f78ef20ffe6e013f7a45f7f8c21 
>   include/mesos/v1/resource_provider/resource_provider.proto 
> e5cbede5b6e57a8641fca1ebfee5454f292cc24c 
>   src/messages/messages.proto 0a32b3457e9143a7d48670610ca3e56dd516136f 
>   src/resource_provider/manager.cpp 31fcb789f5ab907511e868c374c49f7457a33ed3 
>   src/resource_provider/validation.cpp 
> d2927227f60ab0d4ae2481ad73a31ee444b48ee0 
>   src/tests/resource_provider_validation_tests.cpp 
> f182bff4670318e9de22c2915c5dbb423a74ad6c 
> 
> 
> Diff: https://reviews.apache.org/r/63001/diff/6/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>

Reply via email to