I have no idea about the implementation, but the concept is certainly one we have been looking for for quite a while in the several clusters I manage. I'm excited to see this capability added to the system.
Will On Mon, Jan 20, 2020, 1:55 PM Lucas Capistrant <capistrant.lu...@gmail.com> wrote: > Hi all, > > Looking for some feedback on the idea of creating a new dynamic config for > the coordinator that allows cluster admins to pause coordination by setting > the new config to true (default is false). By pause coordination, I mean to > skip running any coordinator helpers every time the coordinator runs. Some > more details are included below as well as a link to a PR with the initial > implementation that I came up with. Any feedback helps, we want to make > sure we are not overlooking any negative side effects! > > My organization is preparing to undergo some heavy maintenance on our HDFS > cluster that backs our production Druid clusters. This involves HDFS > downtime. Our plan was to stop the coordinators and overlords and rolling > restart the Historical nodes during the outage to lay down the new site > files and retain a static picture of the world for client queries to run > against. During our tests in stage we realized the Historical's check in > with the coordinator when starting up. Therefore, we wanted to find a way > to leave the coordinator up, but not actually coordinate segments on the > cluster, try run kill tasks, etc. (because HDFS is offline and we don't > want to be talking with it until we know it is back up and healthy). Thus, > Pull > 9224 <https://github.com/apache/druid/pull/9224/files> was born. This > seemed like an easy and effective way to halt coordination and keep the API > up. > > We've done some small scale testing in a dev environment and I am currently > looking into writing some time of integration test that flexes this code > path. Despite the changes perceived simplicity, it would be nice to have > something there. > > Thanks! > Lucas Capistrant >