I found a temporary workaround which has just make my toy topology last for
over 90 minutes now. I manually restarted all supervisors when I found out
any worker went into the hung state and it seems like every component is
happy now. I did this just once so I don't know if I need to do it again :-(

I originally thought it might be due to supervisor jvm memory issues, but I
tuned all mem parameters and enabled gc, but found nothing suspicious.  I
will test with my real topology later.

Thanks,
Fang


On Sun, Jun 14, 2015 at 11:04 PM, Fang Chen <[email protected]> wrote:

> I also believe something is going on but just can't find it out.
>
> But I do observe in all my experiments, the first worker that started to
> lose heartbeats is the one with kakfa spout task (I have only one spout
> task). And when it happens, it seems like the whole worker process hangs,
> none of the bolt tasks running in the same worker could spit out logging
> messages.  I double-checked the spout creation process, there is nothing
> suspicious since I don't really do anything except for using the
> forceFromStart param.   This continues until the worker gets recycled by
> supervisor after 600 seconds timeout.  Then I see another worker (on
> another node) starts to lose heartbeats.
>
> If I kill all supervisors, I can no longer observe the hang issue. It
> seems as if the supervisor is also affecting the worker behavior (maybe by
> reading heartbeats for zk??)
>
> I am planning to re-arrange the cluster and see if that makes any
> difference.
>
>
>
>
> On Fri, Jun 12, 2015 at 11:52 PM, Erik Weathers <[email protected]>
> wrote:
>
>> There is something fundamentally wrong.  You need to get to the root
>> cause of what the worker process is doing that is preventing the heartbeats
>> from arriving.
>>
>> - Erik
>>
>>
>> On Friday, June 12, 2015, Fang Chen <[email protected]> wrote:
>>
>>> I tuned up all worker timeout and task time out to 600 seconds, and
>>> seems like nimbus is happy about it after running the topology for
>>> 40minutes. But still one supervisor complained timeout from worker and then
>>> shut it down:
>>>
>>> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
>>> supervisor time: 1434153560. State: :timed-out, Heartbeat:
>>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
>>> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
>>> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>>>
>>>
>>> It's so hard to believe that even 600 seconds is not enough.
>>>
>>> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <[email protected]> wrote:
>>>
>>>> I turned on debug and seems like the nimbus reassign was indeed caused
>>>> by heartbeat timeouts after running the topology for about 20 minutes. You
>>>> can see that those non-live executors have a ":is-timed-out true"  status
>>>> and executor reported time is about 100 second behind nimbus time, while
>>>> other live executors have executor time head of nimbus time.
>>>>
>>>>
>>>> ==========
>>>>
>>>> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
>>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
>>>> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
>>>> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
>>>> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
>>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
>>>> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
>>>> :executor-reported-time 1434146250}}
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[6 6] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[11 11] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[16 16] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[21 21] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[26 26] not alive
>>>> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[1 1] not alive
>>>>
>>>>
>>>> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <[email protected]> wrote:
>>>>
>>>>> supervisor.heartbeat.frequency.secs 5
>>>>> supervisor.monitor.frequency.secs 3
>>>>>
>>>>> task.heartbeat.frequency.secs 3
>>>>> worker.heartbeat.frequency.secs 1
>>>>>
>>>>> some nimbus parameters:
>>>>>
>>>>> nimbus.monitor.freq.secs 120
>>>>> nimbus.reassign true
>>>>> nimbus.supervisor.timeout.secs 60
>>>>> nimbus.task.launch.secs 120
>>>>> nimbus.task.timeout.secs 30
>>>>>
>>>>> When worker dies, the log in one of supervisors shows shutting down
>>>>> worker with  state of disallowed (which I googled around and some people
>>>>> say it's due to nimbus reassign). Other logs only show shutting down 
>>>>> worker
>>>>> without any further information.
>>>>>
>>>>>
>>>>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'll have to look later, I think we are using ZooKeeper v3.3.6
>>>>>> (something like that).  Some clusters have 3 ZK hosts, some 5.
>>>>>>
>>>>>> The way the nimbus detects that the executors are not alive is by not
>>>>>> seeing heartbeats updated in ZK.  There has to be some cause for the
>>>>>> heartbeats not being updated.  Most likely one is that the worker
>>>>>> process is dead.  Another one could be that the process is too busy 
>>>>>> Garbage
>>>>>> Collecting, and so missed the timeout for updating the heartbeat.
>>>>>>
>>>>>> Regarding Supervisor and Worker: I think it's normal for the worker
>>>>>> to be able to live absent the presence of the supervisor, so that sounds
>>>>>> like expected behavior.
>>>>>>
>>>>>> What are your timeouts for the various heartbeats?
>>>>>>
>>>>>> Also, when the worker dies you should see a log from the supervisor
>>>>>> noticing it.
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Erik
>>>>>>>
>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. 
>>>>>>> The
>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>> themselves ( i actually tried killing the supervisor and the worker 
>>>>>>> would
>>>>>>> continue running beyond 30 minutes).
>>>>>>>
>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor 
>>>>>>> shuts
>>>>>>> down its worker --> other peer workers complain about netty issues. and 
>>>>>>> the
>>>>>>> loop goes on.
>>>>>>>
>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>
>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>
>>>>>>> Thanks a lot,
>>>>>>> Fang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hey Fang,
>>>>>>>>
>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>> decided to restart your worker processes.  It would happen when 
>>>>>>>> workers die
>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are 
>>>>>>>> logs
>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died 
>>>>>>>> by
>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>
>>>>>>>> - Erik
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), 
>>>>>>>>> and
>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes 
>>>>>>>>> on a
>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes 
>>>>>>>>> but on
>>>>>>>>> different disks.
>>>>>>>>>
>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy 
>>>>>>>>> bolt,
>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>
>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field 
>>>>>>>>> grouping,
>>>>>>>>> msg size=1k).
>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>>
>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any 
>>>>>>>>> doable
>>>>>>>>> workarounds? I wish there are as so many of you are using it in 
>>>>>>>>> production
>>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>> topology working!
>>>>>>>>>
>>>>>>>>> Fang
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>

Reply via email to