> On March 30, 2016, 2:08 a.m., Adam B wrote: > > src/tests/slave_tests.cpp, lines 3541-3545 > > <https://reviews.apache.org/r/45474/diff/3/?file=1318849#file1318849line3541> > > > > Here you shutdown the slave and wait (you'll probably want to advance > > the clock rather than wait for 90s) for the slave to be declared > > SLAVE_LOST. Once this occurs, the master will no longer allow the slave to > > reregister with the same slaveId, and the slave will be told to kill all > > running tasks. The slave will do so and then restart and register as a new > > slaveId. > > This is what is meant by the quote from the design doc: "Currently this > > can only be handled by stopping / draining a mesos slave entirely (Killing > > all of its running jobs), removing it from the cluster, then bringing it > > back up as a brand new slave." > > > > To truly observe this behavior, you should start a task on the slave > > before you shut it down. Then you will see a TASK_LOST and the task will be > > killed. > > Deshi Xiao wrote: > Thanks Adam, i will udpate the test case. > > Deshi Xiao wrote: > @Adam B > Here i have a confuse,need your guide. use test case to track the > TASK_LOST in restart slave. do we expect keep the slave_id is not outdate?
Desired behavior: Operator can kill a slave process and restart it with new --attributes. Existing tasks will continue to run. No TASK_LOST or SLAVE_LOST message is sent. The slaveId remains the same. Outstanding offers from that slave will be rescinded, and those offers will be remade with the updated attributes. Current behavior 1: Operator shuts down a slave process, and restarts with --recover=cleanup, which kills all its tasks, clears the work_dir, and notifies the master that the old slaveId is "shutdown" and will never be reused again (SLAVE_LOST, offers rescinded, TASK_KILLED/LOST). Operator then restarts the slave with new --attributes, it gets a new slaveId, and new offers will be made with the new slaveId and updated attributes. Current behavior 2: Slave process dies/killed and tries to restart with new --attributes. Errors on recovery. Current behavior 3: Slave process dies/killed and doesn't reregister in `slave_ping_timeout*max_slave_ping_timeouts` (90s). Master considers it gone, sends SLAVE_LOST, TASK_LOST. Future attempts to reregister with the same slaveId fail. Slave must be cleaned up (tasks killed, work_dir removed) so it can register with a new slaveId (and new attributes). - Adam ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/45474/#review126066 ----------------------------------------------------------- On March 30, 2016, 1:13 a.m., Deshi Xiao wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/45474/ > ----------------------------------------------------------- > > (Updated March 30, 2016, 1:13 a.m.) > > > Review request for mesos, Adam B, Greg Mann, haosdent huang, and Jiang Yan Xu. > > > Bugs: MESOS-1739 > https://issues.apache.org/jira/browse/MESOS-1739 > > > Repository: mesos > > > Description > ------- > > Phase 1 > Make SlaveInfo mutable throughout the stack, and allow for expansion of > resources and attributes only (Which allows testing to make sure it > propagates to the allocator, shows up in offers, etc). Ensure there is > unified checking for incompatibilities in both the slave and master (the > slave should validate the config, the master should validate that all > operations the slave takes are legal). > > it derived from another PR(https://reviews.apache.org/r/25525/) > > > Diffs > ----- > > src/tests/slave_tests.cpp 1f1a31020096efa5db698e86ac74e61dfdb4b94a > > Diff: https://reviews.apache.org/r/45474/diff/ > > > Testing > ------- > > make check on localhost > > > Thanks, > > Deshi Xiao > >
