Thanks Erik. We're using Storm 0.10 so Pacemaker doesn't come in to play
here.
3. Worker process died due to exception (this is almost always what we see)
4. Worker process hung (eg doing GC).
I don't think its either of these, TBH. There were no abnormal
terminations in the logs (or in the Storm error logs). I have the GC
metrics that were reported to graphite and the GC time was fairly low
(~50ms).
My feeling is that it is Zookeeper related and not worker related because
it appears that ZK also expired sessions from the supervisor nodes as well:
I saw the following messages in the log of one of the ZK nodes:
2016-04-30 01:33:46,003 [myid:4] - INFO [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x45453e198e80076, timeout of 20000ms exceeded
2016-04-30 01:33:46,005 [myid:4] - INFO [ProcessThread(sid:4
cport:-1)::PrepRequestProcessor@476] - Processed session termination for
sessionid: 0x45453e198e80076
2016-04-30 01:33:46,009 [myid:4] - INFO
[CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
client /<ip>:44500 which had sessionid 0x45453e198e80076
And some corresponding log messages on one of the supervisor log files (for
the same ZK session ID). It looks like this occurred right as the
supervisor was shutting down due to Nimbus restarting it:
2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
additional data from server sessionid 0x45453e198e80076, likely server has
closed socket, closing socket connection and attempting reconnect
2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current supervisor time:
1461980064. State: :timed-out, Heartbeat: {:time-secs 1461980008, :storm-id
"<topology>", :executors [[62 62] [2 2] [38 38] [98 98] [50 50] [-1 -1]
[170 170] [86 86] [194 194] [158 158] [122 122] [74 74] [26 26] [146 146]
[110 110] [14 14] [182 182] [134 134]], :port 6700}
2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
4e6934f9-88e4-4506-9d6e-25db72be941b
2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
execution of cleanup threads on worker.
2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
State change: SUSPENDED
2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
:disconnected::none: with disconnected Zookeeper.
2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
connection to server <zk host>:2181. Will not attempt to authenticate using
SASL (unknown error)
2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection
established to <zk host>:2181, initiating session
2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to reconnect
to ZooKeeper service, session 0x45453e198e80076 has expired, closing socket
connection
2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
State change: LOST
2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event :expired::none:
with disconnected Zookeeper.
2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
expired event received
On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <[email protected]>
wrote:
> (I don't know anything about the pacemaker service introduced in storm
> 1.0, so this statement is pre-1.0).
>
> The executor threads within the worker processes write heartbeats to
> ZooKeeper. If they aren't successfully heartbeating then it could be many
> things:
>
> 1. ZK too busy? (Seems unlikely)
> 2. Network too busy? (Seems unlikely)
> 3. Worker process died due to exception (this is almost always what we see)
> 4. Worker process hung (eg doing GC). (This would usually first be caught
> by the supervisor on that host since it checks a local hearbeat file that
> the worker normally writes to every second -- if the heartbeat doesn't get
> refreshed before the timeout then the supervisor kills the worker process
> with State being :timed-out (or :time-out, something like that). This of
> course depends on the various timeout config values you have on the worker
> and nimbus hosts.)
>
> - Erik
>
>
> On Saturday, April 30, 2016, Kevin Conaway <[email protected]>
> wrote:
>
>> We are using Storm 0.10 and we noticed that Nimbus decided to restart our
>> topology. From researching past threads it seems like this is related to
>> not receiving heartbeats from the supervisors but I'm unsure if this was
>> the case. Our topology was mostly idle at the time that the restart was
>> triggered.
>>
>> We have a 5 node Zookeeper (3.4.5) ensemble. On one of the ZK nodes, I
>> saw the following messages at the time of the restart:
>>
>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x45453e198e8007f, timeout of 20000ms exceeded
>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x25453e1c2640085, timeout of 20000ms exceeded
>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x45453e198e80076, timeout of 20000ms exceeded
>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x35453e1a529008b, timeout of 20000ms exceeded
>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x15453e198d10084, timeout of 20000ms exceeded
>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x35453e1a5290090, timeout of 20000ms exceeded
>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>
>> In the nimbus log, there was the following log message:
>>
>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not
>> alive
>>
>> Shortly thereafter, the supervisors started restarting the workers. The
>> following log message was in the supervisor log:
>>
>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>
>> Previous threads have suggested that this was due to heavy GC causing the
>> heartbeats not to reach Zookeeper but the topology was idle at this time so
>> I don't think GC was the culprit. The GC par new time was about 50ms on
>> each node (as reported to Graphite).
>>
>> Thoughts on what could have happened here and how to debug further?
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>
--
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway