Forgot to add, one complication of this problem is that, after several rounds of killing, workers re-spawned can no longer talk to their peers, with all sorts of netty exceptions.
On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen <[email protected]> wrote: > We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried > 0.9.5 yet but I don't see any significant differences there), and > unfortunately we could not even have a clean run for over 30 minutes on a > cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on > different disks. > > I have huge troubles to give my data analytics topology a stable run. So I > tried the simplest topology I can think of, just an emtpy bolt, no io > except for reading from kafka queue. > > Just to report my latest testing on 0.9.4 with this empty bolt (kakfa > topic partition=1, spout task #=1, bolt #=20 with field grouping, msg > size=1k). > After 26 minutes, nimbus orders to kill the topology as it believe the > topology is dead, then after another 2 minutes, another kill, then another > after another 4 minutes, and on and on. > > I can understand there might be issues in the coordination among nimbus, > worker and executor (e.g., heartbeats). But are there any doable > workarounds? I wish there are as so many of you are using it in production > :-) > > I deeply appreciate any suggestions that could even make my toy topology > working! > > Fang > >
