We have a cluster comprised of 21 nodes holding a total capacity of about 55T where we have had a problem twice in the last couple weeks on startup of NameNode. We are running 0.18.1. DFS space is currently just below the halfway point of actual occupation, about 25T.
Symptom is that there is normal startup logging on NameNode's part, where it self-analyzes its expected DFS content, reports #files known, and begins to accept reports from slaves' DataNodes about blocks they hold. During this time, NameNode is in safe mode pending adequate block discovery from slaves. As the fraction of reported blocks rises, eventually it hits the required 0.9990 threshold and announces that it will leave safe mode in 30 seconds. The problem occurs when, at the point of logging "0 seconds to leave safe mode," NameNode hangs: It uses no more CPU; it logs nothing further; it stops responding on its port 50070 web interface; "hadoop fs" commands report no contact with NameNode; "netstat -atp" shows a number of open connections on 9000 and 50070, indicating the connections are being accepted, but NameNode never processes them. This has happened twice in the last 2 weeks and it has us fairly concerned. Both times, it has been adequate simply to start over again, and NameNode successfully comes to life the 2nd time around. Is anyone else familiar with this sort of hang, and do you know of any solutions?
