cf-natali opened a new pull request #380: URL: https://github.com/apache/mesos/pull/380
If the agent is interrupted after garbage collecting the executor's latest run meta directory but before garbage collecting the top-level executor meta directory, the "latest" symlink will dangle, which would cause the agent executor recovery to fail. Instead, we can simply ignore if the "latest" symlink dangles, since it's always created after the latest run directory it points to, and never deleted until the top-level executor meta directory is garbage collected. Example logs showing the problem: Agent GC'ing the directory: ``` I0129 22:38:45.060012 28292 slave.cpp:7107] Executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055' of framework 1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002 exited with status 0 I0129 22:38:45.060871 28292 slave.cpp:7218] Cleaning up executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055' of framework 1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002 at executor(1)@127.0.1.1:40075 [...] I0129 22:38:45.061872 29250 gc.cpp:95] Scheduling '/tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab' for gc 4.938180864secs in the future I0129 22:38:45.061939 29250 gc.cpp:95] Scheduling '/tmp/tmp2y330b17mesos_agent_work_dir/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055' for gc 4.93812992secs in the future [...] I0129 22:38:50.019327 29251 gc.cpp:272] Deleting /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab I0129 22:38:50.019573 29251 gc.cpp:288] Deleted '/tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab' ``` The agent got killed, and didn't get to GC `/tmp/tmp2y330b17mesos_agent_work_dir/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055`. Then the agent restarted: ``` [...] E0129 22:38:54.942884 29402 slave.cpp:8355] EXIT with status 1: Failed to perform recovery: Failed to recover framework 1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002: Failed to recover executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055': Failed to find latest run of executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055': No such file or directory ``` We can see that `latest` for executor `task-72954d99-5719-414f-b7d9-5f35c5d70055` points to run `fa5986f6-777b-42fb-88b4-e4ce339c21ab` which has already been GCed. ``` cf@thinkpad:~/src/mesos$ ls -l /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/latest lrwxrwxrwx 1 cf cf 235 janv. 29 22:28 /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/latest -> /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab cf@thinkpad:~/src/mesos$ ls -l /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/ total 4 lrwxrwxrwx 1 cf cf 235 janv. 29 22:28 latest -> /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab cf@thinkpad:~/src/mesos$ ``` @bbannier ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
