Hi, storm community:

We have a storm cluster deployed with 15 workers and recently we often
experience failure since ack timeout. Our input source is kafka and we used
ganglia to monitor our cluster. Recently we experience failures every 12
hours and following are my observations from some monitoring tools when
problem happens:

   1. Topology page shows that no worker was down since uptime of each task
   are nearly equal to topology uptime
   2. I've checked ganglia, the cpu report and mem report does not give any
   clue about the problem. But network report shows something unusual: the in
   speed decreases a little while the out speed decreases to nearly zero on
   some workers.
   3. I've logged in to one of machines mentioned above, and found out that
   one of the survivor areas always remains 100% full.
   4. dstat show that csw turns to 4k+ every few seconds while it remains
   around 400 in normal condition.

Can anyone give us some hint about this problem?

Reply via email to