I found a temporary workaround which has just make my toy topology last for over 90 minutes now. I manually restarted all supervisors when I found out any worker went into the hung state and it seems like every component is happy now. I did this just once so I don't know if I need to do it again :-(
I originally thought it might be due to supervisor jvm memory issues, but I tuned all mem parameters and enabled gc, but found nothing suspicious. I will test with my real topology later. Thanks, Fang On Sun, Jun 14, 2015 at 11:04 PM, Fang Chen <[email protected]> wrote: > I also believe something is going on but just can't find it out. > > But I do observe in all my experiments, the first worker that started to > lose heartbeats is the one with kakfa spout task (I have only one spout > task). And when it happens, it seems like the whole worker process hangs, > none of the bolt tasks running in the same worker could spit out logging > messages. I double-checked the spout creation process, there is nothing > suspicious since I don't really do anything except for using the > forceFromStart param. This continues until the worker gets recycled by > supervisor after 600 seconds timeout. Then I see another worker (on > another node) starts to lose heartbeats. > > If I kill all supervisors, I can no longer observe the hang issue. It > seems as if the supervisor is also affecting the worker behavior (maybe by > reading heartbeats for zk??) > > I am planning to re-arrange the cluster and see if that makes any > difference. > > > > > On Fri, Jun 12, 2015 at 11:52 PM, Erik Weathers <[email protected]> > wrote: > >> There is something fundamentally wrong. You need to get to the root >> cause of what the worker process is doing that is preventing the heartbeats >> from arriving. >> >> - Erik >> >> >> On Friday, June 12, 2015, Fang Chen <[email protected]> wrote: >> >>> I tuned up all worker timeout and task time out to 600 seconds, and >>> seems like nimbus is happy about it after running the topology for >>> 40minutes. But still one supervisor complained timeout from worker and then >>> shut it down: >>> >>> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and >>> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current >>> supervisor time: 1434153560. State: :timed-out, Heartbeat: >>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959, >>> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11] >>> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703} >>> >>> >>> It's so hard to believe that even 600 seconds is not enough. >>> >>> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <[email protected]> wrote: >>> >>>> I turned on debug and seems like the nimbus reassign was indeed caused >>>> by heartbeat timeouts after running the topology for about 20 minutes. You >>>> can see that those non-live executors have a ":is-timed-out true" status >>>> and executor reported time is about 100 second behind nimbus time, while >>>> other live executors have executor time head of nimbus time. >>>> >>>> >>>> ========== >>>> >>>> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true, >>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355, >>>> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16] >>>> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time >>>> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true, >>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475, >>>> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355, >>>> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false, >>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28] >>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >>>> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355, >>>> :executor-reported-time 1434146250}} >>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >>>> asyncVarGenTopology-1-1434145217:[6 6] not alive >>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >>>> asyncVarGenTopology-1-1434145217:[11 11] not alive >>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >>>> asyncVarGenTopology-1-1434145217:[16 16] not alive >>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >>>> asyncVarGenTopology-1-1434145217:[21 21] not alive >>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >>>> asyncVarGenTopology-1-1434145217:[26 26] not alive >>>> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor >>>> asyncVarGenTopology-1-1434145217:[1 1] not alive >>>> >>>> >>>> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <[email protected]> wrote: >>>> >>>>> supervisor.heartbeat.frequency.secs 5 >>>>> supervisor.monitor.frequency.secs 3 >>>>> >>>>> task.heartbeat.frequency.secs 3 >>>>> worker.heartbeat.frequency.secs 1 >>>>> >>>>> some nimbus parameters: >>>>> >>>>> nimbus.monitor.freq.secs 120 >>>>> nimbus.reassign true >>>>> nimbus.supervisor.timeout.secs 60 >>>>> nimbus.task.launch.secs 120 >>>>> nimbus.task.timeout.secs 30 >>>>> >>>>> When worker dies, the log in one of supervisors shows shutting down >>>>> worker with state of disallowed (which I googled around and some people >>>>> say it's due to nimbus reassign). Other logs only show shutting down >>>>> worker >>>>> without any further information. >>>>> >>>>> >>>>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]> >>>>> wrote: >>>>> >>>>>> I'll have to look later, I think we are using ZooKeeper v3.3.6 >>>>>> (something like that). Some clusters have 3 ZK hosts, some 5. >>>>>> >>>>>> The way the nimbus detects that the executors are not alive is by not >>>>>> seeing heartbeats updated in ZK. There has to be some cause for the >>>>>> heartbeats not being updated. Most likely one is that the worker >>>>>> process is dead. Another one could be that the process is too busy >>>>>> Garbage >>>>>> Collecting, and so missed the timeout for updating the heartbeat. >>>>>> >>>>>> Regarding Supervisor and Worker: I think it's normal for the worker >>>>>> to be able to live absent the presence of the supervisor, so that sounds >>>>>> like expected behavior. >>>>>> >>>>>> What are your timeouts for the various heartbeats? >>>>>> >>>>>> Also, when the worker dies you should see a log from the supervisor >>>>>> noticing it. >>>>>> >>>>>> - Erik >>>>>> >>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>> >>>>>>> Hi Erik >>>>>>> >>>>>>> Thanks for your reply! It's great to hear about real production >>>>>>> usages. For our use case, we are really puzzled by the outcome so far. >>>>>>> The >>>>>>> initial investigation seems to indicate that workers don't die by >>>>>>> themselves ( i actually tried killing the supervisor and the worker >>>>>>> would >>>>>>> continue running beyond 30 minutes). >>>>>>> >>>>>>> The sequence of events is like this: supervisor immediately >>>>>>> complains worker "still has not started" for a few seconds right after >>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus >>>>>>> complains executors (related to the worker) "not alive" and started to >>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor >>>>>>> shuts >>>>>>> down its worker --> other peer workers complain about netty issues. and >>>>>>> the >>>>>>> loop goes on. >>>>>>> >>>>>>> Could you kindly tell me what version of zookeeper is used with >>>>>>> 0.9.4? and how many nodes in the zookeeper cluster? >>>>>>> >>>>>>> I wonder if this is due to zookeeper issues. >>>>>>> >>>>>>> Thanks a lot, >>>>>>> Fang >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hey Fang, >>>>>>>> >>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ >>>>>>>> nodes. >>>>>>>> >>>>>>>> One of the challenges with storm is figuring out what the root >>>>>>>> cause is when things go haywire. You'll wanna examine why the nimbus >>>>>>>> decided to restart your worker processes. It would happen when >>>>>>>> workers die >>>>>>>> and the nimbus notices that storm executors aren't alive. (There are >>>>>>>> logs >>>>>>>> in nimbus for this.) Then you'll wanna dig into why the workers died >>>>>>>> by >>>>>>>> looking at logs on the worker hosts. >>>>>>>> >>>>>>>> - Erik >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>>>> >>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), >>>>>>>>> and >>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes >>>>>>>>> on a >>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes >>>>>>>>> but on >>>>>>>>> different disks. >>>>>>>>> >>>>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>>>> bolt, >>>>>>>>> no io except for reading from kafka queue. >>>>>>>>> >>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field >>>>>>>>> grouping, >>>>>>>>> msg size=1k). >>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe >>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then >>>>>>>>> another after another 4 minutes, and on and on. >>>>>>>>> >>>>>>>>> I can understand there might be issues in the coordination among >>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any >>>>>>>>> doable >>>>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>>>> production >>>>>>>>> :-) >>>>>>>>> >>>>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>>>> topology working! >>>>>>>>> >>>>>>>>> Fang >>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >>> >
