(I don't know anything about the pacemaker service introduced in storm 1.0, so this statement is pre-1.0).
The executor threads within the worker processes write heartbeats to ZooKeeper. If they aren't successfully heartbeating then it could be many things: 1. ZK too busy? (Seems unlikely) 2. Network too busy? (Seems unlikely) 3. Worker process died due to exception (this is almost always what we see) 4. Worker process hung (eg doing GC). (This would usually first be caught by the supervisor on that host since it checks a local hearbeat file that the worker normally writes to every second -- if the heartbeat doesn't get refreshed before the timeout then the supervisor kills the worker process with State being :timed-out (or :time-out, something like that). This of course depends on the various timeout config values you have on the worker and nimbus hosts.) - Erik On Saturday, April 30, 2016, Kevin Conaway <[email protected]> wrote: > We are using Storm 0.10 and we noticed that Nimbus decided to restart our > topology. From researching past threads it seems like this is related to > not receiving heartbeats from the supervisors but I'm unsure if this was > the case. Our topology was mostly idle at the time that the restart was > triggered. > > We have a 5 node Zookeeper (3.4.5) ensemble. On one of the ZK nodes, I > saw the following messages at the time of the restart: > > 2016-04-30 01:33:46,001 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x45453e198e8007f, timeout of 20000ms exceeded > 2016-04-30 01:33:46,003 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x25453e1c2640085, timeout of 20000ms exceeded > 2016-04-30 01:33:46,003 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x45453e198e80076, timeout of 20000ms exceeded > 2016-04-30 01:33:48,003 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x35453e1a529008b, timeout of 20000ms exceeded > 2016-04-30 01:33:50,001 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x15453e198d10084, timeout of 20000ms exceeded > 2016-04-30 01:33:50,002 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x35453e1a5290090, timeout of 20000ms exceeded > 2016-04-30 01:33:50,002 [myid:4] - INFO > [SessionTracker:ZooKeeperServer@325] - Expiring session > 0x15453e198d1008e, timeout of 20000ms exceeded > > In the nimbus log, there was the following log message: > > 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not > alive > > Shortly thereafter, the supervisors started restarting the workers. The > following log message was in the supervisor log: > > 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and clearing > state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current supervisor time: > 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040, > :storm-id "<topology>", :executors [[111 111] [75 75] [51 51] [3 3] [39 39] > [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87] [171 171] > [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700} > > Previous threads have suggested that this was due to heavy GC causing the > heartbeats not to reach Zookeeper but the topology was idle at this time so > I don't think GC was the culprit. The GC par new time was about 50ms on > each node (as reported to Graphite). > > Thoughts on what could have happened here and how to debug further? > > -- > Kevin Conaway > http://www.linkedin.com/pub/kevin-conaway/7/107/580/ > https://github.com/kevinconaway >
