We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried 0.9.5 yet but I don't see any significant differences there), and unfortunately we could not even have a clean run for over 30 minutes on a cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on different disks.
I have huge troubles to give my data analytics topology a stable run. So I tried the simplest topology I can think of, just an emtpy bolt, no io except for reading from kafka queue. Just to report my latest testing on 0.9.4 with this empty bolt (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping, msg size=1k). After 26 minutes, nimbus orders to kill the topology as it believe the topology is dead, then after another 2 minutes, another kill, then another after another 4 minutes, and on and on. I can understand there might be issues in the coordination among nimbus, worker and executor (e.g., heartbeats). But are there any doable workarounds? I wish there are as so many of you are using it in production :-) I deeply appreciate any suggestions that could even make my toy topology working! Fang
