> On Nov. 16, 2017, 2:05 a.m., Jie Yu wrote: > > src/master/master.cpp > > Lines 7107-7121 (patched) > > <https://reviews.apache.org/r/63732/diff/2/?file=1893485#file1893485line7107> > > > > Do you need to also update allocator for added or removed new > > operations? > > > > For instance, the allocator currently think the new operation A uses > > 2cpus. Now, if A is removed (because it's dropped), do we need to tell the > > allocator that the 2cpus are no longer used and they can be allocated to > > others?
Yes, this was missing. I added a method to inform the allocator of allocations after the fact and adjusted this patch. > On Nov. 16, 2017, 2:05 a.m., Jie Yu wrote: > > src/master/master.cpp > > Lines 7108 (patched) > > <https://reviews.apache.org/r/63732/diff/2/?file=1893485#file1893485line7108> > > > > Think about the case where agent crashes and restarts, not all RP has > > re-registered yet. In that case, some operation from some not yet > > re-registered RP will not be part of this operation list. > > > > I don't think we want to remove those operations just yet. I think we > > should remove those operations only if the corresponding RP has > > re-registered with the agent and show up in the offer operation list (or > > total resources). > > > > This is similar to we don't remove tasks when agent disconnects. In > > fact, you should follow the similar patter in reconcileKnownSlave here. If > > an operation is unknown, instead of calling `removeOfferOperation` > > directly, we should probably send a `reconcileOfferOperationMessage` to the > > agent to ask the RP to generate a status update, and rely on status update > > handler to properly handle the resource accounting. > > > > Also realized that we probably should also do the same in the agent > > code. Instead of directly calling removeOfferOperation and > > addOfferOperation, send a `RECONCILE` message to RP to asking the RP to > > generate a status udpate. If it's unknown to the RP, RP will send > > OFFER_OPERATION_DROPPED, which is terminal. > > Benjamin Bannier wrote: > I am not sure it should be the master's responsibility to keep these > operations alive since it e.g., has no visibility into whether or when a > resource provider might come back. It seems to me that we should instead let > the agent decide when to prune operations from removed resource providers > from the list of operations processed here. While it currently would not send > operations for not yet resubscribed RPs after agent failover, these would > appear together with the RP resources once it resubscribes, so I think the > difference here is purely semantic. For the case of RP failover the agent > will send operations from disconnected RPs with the list processed here. > > In the end I feel the scenario of agent failover's task analogue is > rather a master failover with some agent not yet registered. There we would > not be able to provide information about tasks running on disconnected agents. As we discussed offline, we will now only remove disappeared operations in the master if the resource provider is registered with the agent (we infer that from the total resources the agent reports). I adjusted the patch accordingly. - Benjamin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63732/#review191135 ----------------------------------------------------------- On Nov. 21, 2017, 10:09 p.m., Benjamin Bannier wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63732/ > ----------------------------------------------------------- > > (Updated Nov. 21, 2017, 10:09 p.m.) > > > Review request for mesos, Jie Yu and Jan Schlicht. > > > Bugs: MESOS-8207 > https://issues.apache.org/jira/browse/MESOS-8207 > > > Repository: mesos > > > Description > ------- > > See summary. > > > Diffs > ----- > > src/master/master.cpp 7417b5d641fd4bb6d91cb0e6456c60201bbc8206 > > > Diff: https://reviews.apache.org/r/63732/diff/4/ > > > Testing > ------- > > `make check`, tested as part of https://reviews.apache.org/r/63843/. > > > Thanks, > > Benjamin Bannier > >
