Hi, storm community: We have a storm cluster deployed with 15 workers and recently we often experience failure since ack timeout. Our input source is kafka and we used ganglia to monitor our cluster. Recently we experience failures every 12 hours and following are my observations from some monitoring tools when problem happens:
1. Topology page shows that no worker was down since uptime of each task are nearly equal to topology uptime 2. I've checked ganglia, the cpu report and mem report does not give any clue about the problem. But network report shows something unusual: the in speed decreases a little while the out speed decreases to nearly zero on some workers. 3. I've logged in to one of machines mentioned above, and found out that one of the survivor areas always remains 100% full. 4. dstat show that csw turns to 4k+ every few seconds while it remains around 400 in normal condition. Can anyone give us some hint about this problem?