Hi, We had some issues recently with HDFS - hardware issue with one of the nodes, nodes died, HDFS recovered, but we figured out that something is wrong with HBase. Checking HMaster log, we saw that bunch of our region servers got to the famous failed servers list, and it was going on and on until we restarted every one of them.
Are we doing something wrong? Is it possible somehow to tune this out, once the server is in this list to forget about it or something? Main question - how HMaster decides at all that server should be in the failed server list, and what does this means exactly? Was looking into HBase book, googling, but beside some generic answers wasn't able to find anything more internal. Thanks in advance!
