Re: How Does Nimbus Decide to Restart Topology?

Erik Weathers Sat, 30 Apr 2016 18:22:33 -0700

(I don't know anything about the pacemaker service introduced in storm 1.0,
so this statement is pre-1.0).


The executor threads within the worker processes write heartbeats to
ZooKeeper.  If they aren't successfully heartbeating then it could be many
things:

1. ZK too busy? (Seems unlikely)
2. Network too busy? (Seems unlikely)
3. Worker process died due to exception (this is almost always what we see)
4. Worker process hung (eg doing GC). (This would usually first be caught
by the supervisor on that host since it checks a local hearbeat file that
the worker normally writes to every second -- if the heartbeat doesn't get
refreshed before the timeout then the supervisor kills the worker process
with State being :timed-out (or :time-out, something like that). This of
course depends on the various timeout config values you have on the worker
and nimbus hosts.)

- Erik

On Saturday, April 30, 2016, Kevin Conaway <[email protected]>
wrote:

> We are using Storm 0.10 and we noticed that Nimbus decided to restart our
> topology.  From researching past threads it seems like this is related to
> not receiving heartbeats from the supervisors but I'm unsure if this was
> the case.  Our topology was mostly idle at the time that the restart was
> triggered.
>
> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I
> saw the following messages at the time of the restart:
>
> 2016-04-30 01:33:46,001 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x45453e198e8007f, timeout of 20000ms exceeded
> 2016-04-30 01:33:46,003 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x25453e1c2640085, timeout of 20000ms exceeded
> 2016-04-30 01:33:46,003 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x45453e198e80076, timeout of 20000ms exceeded
> 2016-04-30 01:33:48,003 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x35453e1a529008b, timeout of 20000ms exceeded
> 2016-04-30 01:33:50,001 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x15453e198d10084, timeout of 20000ms exceeded
> 2016-04-30 01:33:50,002 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x35453e1a5290090, timeout of 20000ms exceeded
> 2016-04-30 01:33:50,002 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x15453e198d1008e, timeout of 20000ms exceeded
>
> In the nimbus log, there was the following log message:
>
> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not
> alive
>
> Shortly thereafter, the supervisors started restarting the workers.  The
>  following log message was in the supervisor log:
>
> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current supervisor time:
> 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
> :storm-id "<topology>", :executors [[111 111] [75 75] [51 51] [3 3] [39 39]
> [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87] [171 171]
> [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>
> Previous threads have suggested that this was due to heavy GC causing the
> heartbeats not to reach Zookeeper but the topology was idle at this time so
> I don't think GC was the culprit.  The GC par new time was about 50ms on
> each node (as reported to Graphite).
>
> Thoughts on what could have happened here and how to debug further?
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Re: How Does Nimbus Decide to Restart Topology?

Reply via email to