Yes, the netty errors from a large set of worker deaths really obscure the original root cause. Again you need to diagnose that.
- Erik On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: > Forgot to add, one complication of this problem is that, after several > rounds of killing, workers re-spawned can no longer talk to their peers, > with all sorts of netty exceptions. > > On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried >> 0.9.5 yet but I don't see any significant differences there), and >> unfortunately we could not even have a clean run for over 30 minutes on a >> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on >> different disks. >> >> I have huge troubles to give my data analytics topology a stable run. So >> I tried the simplest topology I can think of, just an emtpy bolt, no io >> except for reading from kafka queue. >> >> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa >> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg >> size=1k). >> After 26 minutes, nimbus orders to kill the topology as it believe the >> topology is dead, then after another 2 minutes, another kill, then another >> after another 4 minutes, and on and on. >> >> I can understand there might be issues in the coordination among nimbus, >> worker and executor (e.g., heartbeats). But are there any doable >> workarounds? I wish there are as so many of you are using it in production >> :-) >> >> I deeply appreciate any suggestions that could even make my toy topology >> working! >> >> Fang >> >> >
