> On Oct. 16, 2017, 6:33 p.m., Greg Mann wrote: > > src/messages/messages.proto > > Lines 637-638 (patched) > > <https://reviews.apache.org/r/63001/diff/1-2/?file=1856406#file1856406line637> > > > > I think that this list should include pending operations, but should > > _not_ include terminal operations with unacknowledged updates. > > > > When the master receives an update, it needs to decide whether or not > > it should update the allocator. If the operation was pending when > > `UpdateSlaveMessage` was sent, then the master should update the allocator. > > If the operation was not pending but its update had not been acknowledged, > > then the result of that operation was already reflected in the last > > `UpdateSlaveMessage` from the agent, and the master should not update the > > allocator. > > > > If the agent includes only pending operations in this field, then the > > master will be able to make this distinction. > > Jie Yu wrote: > The reason we need to send terminal but unacked operations is for > reconciliation. Master always keeps a cache about all the known non-completed > operations (either pending or terminal but unacked). This information can be > used to answer reconciliation requests. > > The master can distinguish between a pending operation vs. a terminal but > unacked operation by looking at latest state of the operation. The master > update the allocator if the latest state if terminal, irrespective whether > it's acked or not. > > That does bring up an issue. Consider the following two cases: > > 1) Master knows about an operation that the agent does not know about. > This can happen if the operation gets lost on its way to the agent. This will > trigger an agent re-registration. The master in this case will need to > generate a reconcile message (similar to that in `reconcileKnownSlave`) so > that the agent or LRP can reply with a terminal status update. This will > guarantee that the resources allocated for this operation are properly > recovered to allocator. > > This case is also possible if there is a speculation failure on old > operations and result in an `UpdateSlaveMessage` being sent to the master. > The operation that master knows about (but agent does not) will eventually > reach agent or LRP, and will be rejected by the LRP or the agent. The master > will properly recovered the resources to the allocator when a failed update > is received. > > 2) Agent knows about an operation that the master does not know about. > For instance, This is a new master that just fails over, and an LRP > re-registers with the agent. In that case, the master needs to update the > 'allocated' (i.e., 'used') resources for those new operations. This is > similar to what we do when slave registers. However, the tricky part is that > the allocator does not have an interface currently allowing us to add more > allocation to a given agent. We likely need to add that (or make it part of > `Allocator::updateSlave` interface).
OK, I see why we need to include unacked updates as well. Thanks Jie! - Greg ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63001/#review188162 ----------------------------------------------------------- On Oct. 17, 2017, 6:52 p.m., Jie Yu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63001/ > ----------------------------------------------------------- > > (Updated Oct. 17, 2017, 6:52 p.m.) > > > Review request for mesos, Benjamin Bannier, Benjamin Mahler, Gaston Kleiman, > Greg Mann, Jan Schlicht, and Vinod Kone. > > > Repository: mesos > > > Description > ------- > > Updated protobuf definitions related to offer operations. > > > Diffs > ----- > > include/mesos/mesos.proto ba87339dbe341f4d16ceea74adc09647a3c07f32 > include/mesos/resource_provider/resource_provider.proto > f5a9073075327019fd133bd51265f695ef464845 > include/mesos/v1/mesos.proto a6d662fb26aa4f78ef20ffe6e013f7a45f7f8c21 > include/mesos/v1/resource_provider/resource_provider.proto > e5cbede5b6e57a8641fca1ebfee5454f292cc24c > src/messages/messages.proto 0a32b3457e9143a7d48670610ca3e56dd516136f > src/resource_provider/manager.cpp 31fcb789f5ab907511e868c374c49f7457a33ed3 > src/resource_provider/validation.cpp > d2927227f60ab0d4ae2481ad73a31ee444b48ee0 > src/tests/resource_provider_validation_tests.cpp > f182bff4670318e9de22c2915c5dbb423a74ad6c > > > Diff: https://reviews.apache.org/r/63001/diff/6/ > > > Testing > ------- > > make check > > > Thanks, > > Jie Yu > >
