Re: Review Request 54183: Improved management of unreachable and completed tasks in master.

Neil Conway Mon, 09 Jan 2017 14:38:55 -0800


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/master/master.hpp, lines 2634-2641
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588709#file1588709line2634>
> >
> >     this reads like `unreachableTasks` are completed tasks of PA 
> > frameworks. can you split this comment up and move the unreachable specific 
> > comments to #2644?


I added an additional comment describing the purpose of `unreachableTasks`, but 
I felt like it was useful to still keep a note in the `completedTasks` comment 
to describe which tasks are stored in `unreachableTasks` vs `completedTasks` 
when an agent is marked unreachable. Let me know if you think this is still 
confusing.


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5496
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588710#file1588710line5496>
> >
> >     it's not necessary that the agent has failed over if we are here right?

Do you mean "it's not necessary that the master has failed over if we are here"?

If so, that's right: the master may or may not have re-registered in the time 
since the agent was marked unreachable. The comment talks about how these two 
cases are handled differently.


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5507
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588710#file1588710line5507>
> >
> >     can we inline this?

We could, but personally I find it more readable to make it a separate 
function. The logic is a bit involved here (since we need to check two 
different data structures due to backward compatibility requirements), and 
there's already a lot going on in `Master::_reregisterSlave`.


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5539
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588710#file1588710line5539>
> >
> >     what if the agent was marked unreachable by the previous master? do we 
> > still want to add all its tasks?

If the master has failed over, all tasks on the agent will be allowed to 
continue running (whether partition-aware or not) -- so that's why we re-add 
all tasks here.


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/tests/partition_tests.cpp, lines 1819-1847
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588711#file1588711line1819>
> >
> >     hmm. it is unfortunate that we send TASK_LOST followed by TASK_FINISHED 
> > and then send TASK_LOST on reconciliation. can you remind me why the master 
> > forwards status updates for unknown tasks? looks like it can just drop them 
> > if the reason for doing so is no longer valid.

I don't know why we forward status updates for unknown tasks. I can see it 
having some value, though -- in the scenario above, the first `TASK_LOST` just 
means the agent become unreachable. `TASK_FINISHED` tells the framework that 
the task actually did complete successfully. I can imagine situations where, 
say, the framework ignores `TASK_LOST` or uses a large timeout before assuming 
the task is "really lost", whereas `TASK_FINISHED` / `TASK_KILLED` / etc. is a 
more definitive signal.


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5523
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588710#file1588710line5523>
> >
> >     not sure if this is worth making into a function?

I thought it was a bit more readable (against, `Master::_reregisterSlave` has a 
lot of stuff going on), but happy to change it if you disagree.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54183/#review160019
-----------------------------------------------------------


On Dec. 18, 2016, 11:29 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54183/
> -----------------------------------------------------------
> 
> (Updated Dec. 18, 2016, 11:29 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6619
>     https://issues.apache.org/jira/browse/MESOS-6619
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Before partition-awareness, when an agent failed health checks, the
> master removed the agent from the registry, marked all of its tasks
> TASK_LOST, and moved them to the `completedTasks` list in the master's
> memory. Although "lost" tasks might still be running, partitioned agents
> would only be allowed to re-register if the master failed over, in which
> case the `completedTasks` map would be emptied.
> 
> When partition-awareness was introduced, we initially followed the same
> scheme, with the only difference that partition-aware tasks are marked
> TASK_UNREACHABLE, not TASK_LOST.
> 
> This scheme has a few shortcomings. First, partition-aware tasks might
> resume running when the partitioned agent re-registers. Second, we
> re-added non-partition aware tasks when the agent re-registered but then
> marked them completed when the framework is shutdown, resulting in two
> entries in `completedTasks`.
> 
> This commit introduces a separate bounded map, `unreachableTasks`. These
> tasks are reported separately via the HTTP endpoints, because they have
> different semantics (unlike completed tasks, unreachable tasks can
> resume running). The size of this map is limited by a new master flag,
> `--max_unreachable_tasks_per_framework`. This commit also changes the
> master to omit re-adding non-partition-aware tasks on re-registering
> agents (unless the master has failed over): those tasks will shortly be
> shutdown anyway.
> 
> Finally, this commit fixes a minor bug in the previous code: the
> previous coding neglected to shutdown non-partition-aware frameworks
> running on pre-1.0 Mesos agents that re-register with the master after
> a network partition.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md efe3e9bd9d203a7ba44adf4ead24f14b8b577637 
>   include/mesos/master/master.proto 483dc2f59684481c9f549de9f06eef4dda8a46e1 
>   include/mesos/v1/master/master.proto 
> 09a82af88303a2d971da7c56a7075d7005932363 
>   src/master/constants.hpp 5dd0667f62d2c0617cc0d5aed8cc005bd8344c88 
>   src/master/flags.hpp 6a17b763dc76daa10073394f416b049e97a44238 
>   src/master/flags.cpp e5edf3333d0a0c529a10dc602ef9a88a0ec60c69 
>   src/master/http.cpp ee559804c490d53be2d12f233ae0bfb93ed9f96d 
>   src/master/master.hpp 89b3c394b268a8645885412aeb19980db8d73c69 
>   src/master/master.cpp b664550d57ef9805bd23ea35ca7f9cd8f4b1ab78 
>   src/tests/partition_tests.cpp e1e2025bd0f078836323cbd8c6d7836815c4c38d 
> 
> Diff: https://reviews.apache.org/r/54183/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 54183: Improved management of unreachable and completed tasks in master.

Reply via email to