Hi all,

Looking for some feedback on the idea of creating a new dynamic config for
the coordinator that allows cluster admins to pause coordination by setting
the new config to true (default is false). By pause coordination, I mean to
skip running any coordinator helpers every time the coordinator runs. Some
more details are included below as well as a link to a PR with the initial
implementation that I came up with. Any feedback helps, we want to make
sure we are not overlooking any negative side effects!

My organization is preparing to undergo some heavy maintenance on our HDFS
cluster that backs our production Druid clusters. This involves HDFS
downtime. Our plan was to stop the coordinators and overlords and rolling
restart the Historical nodes during the outage to lay down the new site
files and retain a static picture of the world for client queries to run
against. During our tests in stage we realized the Historical's check in
with the coordinator when starting up. Therefore, we wanted to find a way
to leave the coordinator up, but not actually coordinate segments on the
cluster, try run kill tasks, etc. (because HDFS is offline and we don't
want to be talking with it until we know it is back up and healthy). Thus, Pull
9224 <https://github.com/apache/druid/pull/9224/files> was born. This
seemed like an easy and effective way to halt coordination and keep the API
up.

We've done some small scale testing in a dev environment and I am currently
looking into writing some time of integration test that flexes this code
path. Despite the changes perceived simplicity, it would be nice to have
something there.

Thanks!
Lucas Capistrant

Reply via email to