Forgot to add, one complication of this problem is that, after several
rounds of killing, workers re-spawned can no longer talk to their peers,
with all sorts of netty exceptions.

On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen <[email protected]> wrote:

> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
> 0.9.5 yet but I don't see any significant differences there), and
> unfortunately we could not even have a clean run for over 30 minutes on a
> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
> different disks.
>
> I have huge troubles to give my data analytics topology a stable run. So I
> tried the simplest topology I can think of, just an emtpy bolt, no io
> except for reading from kafka queue.
>
> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
> size=1k).
> After 26 minutes, nimbus orders to kill the topology as it believe the
> topology is dead, then after another 2 minutes, another kill, then another
> after another 4 minutes, and on and on.
>
> I can understand there might be issues in the coordination among nimbus,
> worker and executor (e.g., heartbeats). But are there any doable
> workarounds? I wish there are as so many of you are using it in production
> :-)
>
> I deeply appreciate any suggestions that could even make my toy topology
> working!
>
> Fang
>
>

Reply via email to