Hi Aurora, I am currently working on a feature that allows for health checks to be disabled temporarily for a running instance of a job. The code review can be found at https://reviews.apache.org/r/26383/. The idea is that the presence of a special "snooze file" in the task's sandbox will trigger the disabling of the health checks.
Currently, the code reviewers have split off into two camps: 1. One set of reviewers believe that simplicity is key. Disable the health checks if the snooze file is present, enable it otherwise. 2. The other set of reviewers believe that there should be a snooze duration. The timer starts when the snooze file is touched. After the snooze duration is exhausted, the snooze file should be deleted by the health checker, and health checks resume. This is useful if the process that initially disabled the health checks dies unexpectedly, and is no longer there to re-enable the health checks. I would like to invite anyone interested to voice your opinions and chime in. Thanks, David Pan