Yes, and you should suggest solution, e.g. throttle rebalancing threads more to produce less load.
What you suggesting kills the idea of this enhancement. --Yakov 2018-09-07 19:03 GMT+03:00 Andrey Kuznetsov <stku...@gmail.com>: > Yakov, > > Thanks for reply. Indeed, initial design assumed node termination when > hanging critical thread has been detected. But sometimes it looks > inappropriate. Let, for example fsync in WAL writer thread takes too long, > and we terminate the node. Upon rebalancing, this may lead to long fsyncs > on other nodes due to increased per node load, hence we can terminate the > next node as well. Eventually we can collapse the entire cluster. Is it a > possible scenario? > > пт, 7 сент. 2018 г. в 18:44, Yakov Zhdanov <yzhda...@apache.org>: > > > Andrey, > > > > I don't understand your point. My opinion, the idea of these changes is > to > > make cluster more stable and responsive by eliminating hanged nodes. I > > would not make too much difference between threads trapped in deadlock > and > > threads hanging on fsync calls for too long. Both situations lead to > > increasing latency in cluster till its full unavailability. > > > > So, killing node hanging on fsync may be reasonable. Agree? > > > > You may implement the approach when you have warning messages in logs by > > default, but termination option should also be available. > > > > Thanks! > > > > --Yakov > > > > >