Re: Review Request 45474: MESOS-1739: Allow slave reconfiguration on restart, Phase 1.

Adam B Wed, 18 May 2016 12:27:05 -0700


> On March 30, 2016, 2:08 a.m., Adam B wrote:
> > src/tests/slave_tests.cpp, lines 3541-3545
> > <https://reviews.apache.org/r/45474/diff/3/?file=1318849#file1318849line3541>
> >
> >     Here you shutdown the slave and wait (you'll probably want to advance 
> > the clock rather than wait for 90s) for the slave to be declared 
> > SLAVE_LOST. Once this occurs, the master will no longer allow the slave to 
> > reregister with the same slaveId, and the slave will be told to kill all 
> > running tasks. The slave will do so and then restart and register as a new 
> > slaveId. 
> >     This is what is meant by the quote from the design doc: "Currently this 
> > can only be handled by stopping / draining a mesos slave entirely (Killing 
> > all of its running jobs), removing it from the cluster, then bringing it 
> > back up as a brand new slave."
> >     
> >     To truly observe this behavior, you should start a task on the slave 
> > before you shut it down. Then you will see a TASK_LOST and the task will be 
> > killed.
> 
> Deshi Xiao wrote:
>     Thanks Adam, i will udpate the test case.
> 
> Deshi Xiao wrote:
>     @Adam B
>     Here i have a confuse,need your guide. use test case to track the 
> TASK_LOST in restart slave. do we expect keep the slave_id is not outdate?


Desired behavior: Operator can kill a slave process and restart it with new 
--attributes. Existing tasks will continue to run. No TASK_LOST or SLAVE_LOST 
message is sent. The slaveId remains the same. Outstanding offers from that 
slave will be rescinded, and those offers will be remade with the updated 
attributes.
Current behavior 1: Operator shuts down a slave process, and restarts with 
--recover=cleanup, which kills all its tasks, clears the work_dir, and notifies 
the master that the old slaveId is "shutdown" and will never be reused again 
(SLAVE_LOST, offers rescinded, TASK_KILLED/LOST). Operator then restarts the 
slave with new --attributes, it gets a new slaveId, and new offers will be made 
with the new slaveId and updated attributes.
Current behavior 2: Slave process dies/killed and tries to restart with new 
--attributes. Errors on recovery.
Current behavior 3: Slave process dies/killed and doesn't reregister in 
`slave_ping_timeout*max_slave_ping_timeouts` (90s). Master considers it gone, 
sends SLAVE_LOST, TASK_LOST. Future attempts to reregister with the same 
slaveId fail. Slave must be cleaned up (tasks killed, work_dir removed) so it 
can register with a new slaveId (and new attributes).


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/45474/#review126066
-----------------------------------------------------------


On March 30, 2016, 1:13 a.m., Deshi Xiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/45474/
> -----------------------------------------------------------
> 
> (Updated March 30, 2016, 1:13 a.m.)
> 
> 
> Review request for mesos, Adam B, Greg Mann, haosdent huang, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-1739
>     https://issues.apache.org/jira/browse/MESOS-1739
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Phase 1
> Make SlaveInfo mutable throughout the stack, and allow for expansion of 
> resources and attributes only (Which allows testing to make sure it 
> propagates to the allocator, shows up in offers, etc). Ensure there is 
> unified checking for incompatibilities in both the slave and master (the 
> slave should validate the config, the master should validate that all 
> operations the slave takes are legal).
> 
> it derived from another PR(https://reviews.apache.org/r/25525/)
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp 1f1a31020096efa5db698e86ac74e61dfdb4b94a 
> 
> Diff: https://reviews.apache.org/r/45474/diff/
> 
> 
> Testing
> -------
> 
> make check on localhost
> 
> 
> Thanks,
> 
> Deshi Xiao
> 
>

Re: Review Request 45474: MESOS-1739: Allow slave reconfiguration on restart, Phase 1.

Reply via email to