I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
like that).  Some clusters have 3 ZK hosts, some 5.

The way the nimbus detects that the executors are not alive is by not
seeing heartbeats updated in ZK.  There has to be some cause for the
heartbeats not being updated.  Most likely one is that the worker
process is dead.  Another one could be that the process is too busy Garbage
Collecting, and so missed the timeout for updating the heartbeat.

Regarding Supervisor and Worker: I think it's normal for the worker to be
able to live absent the presence of the supervisor, so that sounds like
expected behavior.

What are your timeouts for the various heartbeats?

Also, when the worker dies you should see a log from the supervisor
noticing it.

- Erik

On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:

> Hi Erik
>
> Thanks for your reply!  It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome so far. The initial
> investigation seems to indicate that workers don't die by themselves ( i
> actually tried killing the supervisor and the worker would continue running
> beyond 30 minutes).
>
> The sequence of events is like this:  supervisor immediately complains
> worker "still has not started" for a few seconds right after launching the
> worker process, then silent --> after 26 minutes, nimbus complains
> executors (related to the worker) "not alive" and started to reassign
> topology --> after another ~500 milliseconds, the supervisor shuts down its
> worker --> other peer workers complain about netty issues. and the loop
> goes on.
>
> Could you kindly tell me what version of zookeeper is used with 0.9.4? and
> how many nodes in the zookeeper cluster?
>
> I wonder if this is due to zookeeper issues.
>
> Thanks a lot,
> Fang
>
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> Hey Fang,
>>
>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>
>> One of the challenges with storm is figuring out what the root cause is
>> when things go haywire.  You'll wanna examine why the nimbus decided to
>> restart your worker processes.  It would happen when workers die and the
>> nimbus notices that storm executors aren't alive.  (There are logs in
>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>> looking at logs on the worker hosts.
>>
>> - Erik
>>
>>
>> On Thursday, June 11, 2015, Fang Chen <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>> 0.9.5 yet but I don't see any significant differences there), and
>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>> different disks.
>>>
>>> I have huge troubles to give my data analytics topology a stable run. So
>>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>>> except for reading from kafka queue.
>>>
>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>> size=1k).
>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>> topology is dead, then after another 2 minutes, another kill, then another
>>> after another 4 minutes, and on and on.
>>>
>>> I can understand there might be issues in the coordination among nimbus,
>>> worker and executor (e.g., heartbeats). But are there any doable
>>> workarounds? I wish there are as so many of you are using it in production
>>> :-)
>>>
>>> I deeply appreciate any suggestions that could even make my toy topology
>>> working!
>>>
>>> Fang
>>>
>>>
>

Reply via email to