> On Янв. 31, 2020, 12:50 п.п., Qian Zhang wrote: > > src/launcher/default_executor.cpp > > Lines 1089-1098 (original), 1095-1104 (patched) > > <https://reviews.apache.org/r/72029/diff/4/?file=2210076#file2210076line1095> > > > > I see `_shutdown` will be called in some error cases, like: > > > > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L390:L392 > > > > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L1041:L1044 > > So for such cases the previous behavior is self terminate just after > > sleeping 1 second, but now it is after sleeping 60 seconds with your patch. > > I do not think we should sleep so long before self termination for those > > cases. > > Andrei Budnik wrote: > Updated. > > Qian Zhang wrote: > I see you have updated `_shutdown` to: > ``` > void _shutdown() > { > if (unacknowledgedUpdates.empty()) { > terminate(self()); > } else { > // This is a fail safe in case the agent doesn't send an ACK for > // a status update for some reason. > const Duration duration = Seconds(60); > > LOG(INFO) << "Terminating after " << duration; > > delay(duration, self(), &Self::__shutdown); > } > } > ``` > That's also what I thought, and I think it can handle the following cases > well. > > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L390:L392 > > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L406:L408 > > But what about the cases like below? > > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L559:L565 > > In such cases, `unacknowledgedUpdates` is likely not empty and agent has > failed (i.e. no ACKs can be sent to the executor), so executor will sleep 60s > before self termination, but I think the executor should self terminate > immediately in this case instead, HDYT?
I think it makes sense to wait for 1 minute before terminating in this particular case. If the connection is lost due to the agent restart, then there is a high chance that it'll reconnect to the executor later. So it'd be nice to give the executor a chance to resend all unacknowledged status updates (TASK_STARTING). Also, I'd say that this case happens rarely. There are also a few cases of calling `_shutdown` on internal error or a bug. If there are unacknowledged status updates, then we'd better give a chance to send these status updates as well. - Andrei ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/72029/#review219448 ----------------------------------------------------------- On Янв. 30, 2020, 3:28 п.п., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/72029/ > ----------------------------------------------------------- > > (Updated Янв. 30, 2020, 3:28 п.п.) > > > Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod > Kone. > > > Bugs: MESOS-8537 > https://issues.apache.org/jira/browse/MESOS-8537 > > > Repository: mesos > > > Description > ------- > > Previously, the default executor terminated itself after all containers > had terminated. This could lead to termination of the executor before > processing of a terminal status update by the agent. In order > to mitigate this issue, the executor slept for one second to give a > chance to send all status updates and receive all status update > acknowledgements before terminating itself. This might have led to > various race conditions in some circumstances (e.g., on a slow host). > This patch terminates the default executor if all status updates have > been acknowledged by the agent and no running containers left. > Also, this patch increases the timeout from one second to one minute > for fail-safety. > > > Diffs > ----- > > src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c > > > Diff: https://reviews.apache.org/r/72029/diff/5/ > > > Testing > ------- > > internal CI > > > Thanks, > > Andrei Budnik > >
