Hi all, Looking for some feedback on the idea of creating a new dynamic config for the coordinator that allows cluster admins to pause coordination by setting the new config to true (default is false). By pause coordination, I mean to skip running any coordinator helpers every time the coordinator runs. Some more details are included below as well as a link to a PR with the initial implementation that I came up with. Any feedback helps, we want to make sure we are not overlooking any negative side effects!
My organization is preparing to undergo some heavy maintenance on our HDFS cluster that backs our production Druid clusters. This involves HDFS downtime. Our plan was to stop the coordinators and overlords and rolling restart the Historical nodes during the outage to lay down the new site files and retain a static picture of the world for client queries to run against. During our tests in stage we realized the Historical's check in with the coordinator when starting up. Therefore, we wanted to find a way to leave the coordinator up, but not actually coordinate segments on the cluster, try run kill tasks, etc. (because HDFS is offline and we don't want to be talking with it until we know it is back up and healthy). Thus, Pull 9224 <https://github.com/apache/druid/pull/9224/files> was born. This seemed like an easy and effective way to halt coordination and keep the API up. We've done some small scale testing in a dev environment and I am currently looking into writing some time of integration test that flexes this code path. Despite the changes perceived simplicity, it would be nice to have something there. Thanks! Lucas Capistrant