Yes, the netty errors from a large set of worker deaths really obscure the
original root cause.  Again you need to diagnose that.

- Erik

On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:

> Forgot to add, one complication of this problem is that, after several
> rounds of killing, workers re-spawned can no longer talk to their peers,
> with all sorts of netty exceptions.
>
> On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>> 0.9.5 yet but I don't see any significant differences there), and
>> unfortunately we could not even have a clean run for over 30 minutes on a
>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>> different disks.
>>
>> I have huge troubles to give my data analytics topology a stable run. So
>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>> except for reading from kafka queue.
>>
>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>> size=1k).
>> After 26 minutes, nimbus orders to kill the topology as it believe the
>> topology is dead, then after another 2 minutes, another kill, then another
>> after another 4 minutes, and on and on.
>>
>> I can understand there might be issues in the coordination among nimbus,
>> worker and executor (e.g., heartbeats). But are there any doable
>> workarounds? I wish there are as so many of you are using it in production
>> :-)
>>
>> I deeply appreciate any suggestions that could even make my toy topology
>> working!
>>
>> Fang
>>
>>
>

Reply via email to