Michal Turek created KAFKA-2978:
-----------------------------------
Summary: Topic partition is not sometimes consumed after
rebalancing of consumer group
Key: KAFKA-2978
URL: https://issues.apache.org/jira/browse/KAFKA-2978
Project: Kafka
Issue Type: Bug
Components: consumer, core
Affects Versions: 0.9.0.0
Reporter: Michal Turek
Assignee: Neha Narkhede
Priority: Critical
Hi there, we are evaluating Kafka 0.9 to find if it is stable enough and ready
for production. We wrote a tool that basically verifies that each produced
message is also properly consumed. We found the issue described below while
stressing Kafka using this tool.
Adding more and more consumers to a consumer group may result in unsuccessful
rebalancing. Data from one or more partitions
are not consumed and are not effectively available to the client application
(e.g. for 15 minutes). Situation can be resolved
externally by touching the consumer group again (add or remove a consumer)
which forces another rebalancing that may or may not be successful.
Significantly higher CPU utilization was observed in such cases (from about 3%
to 17%). The CPU utilization takes place in both the affected consumer and
Kafka broker according to htop and profiling using jvisualvm.
Jvisualvm indicates the issue may be related to KAFKA-2936 (see its screenshots
in the GitHub repo below), but I'm very unsure. I don't also know if the issue
is in consumer or broker because both are affected and I don't know Kafka
internals.
The issue is not deterministic but it can be easily reproduced after a few
minutes just by executing more and more consumers.
More parallelism with multiple CPUs probably gives the issue more chances to
appear.
The tool itself together with very detailed instructions for quite reliable
reproduction of the issue and initial analysis are available here:
- https://github.com/avast/kafka-tests
- https://github.com/avast/kafka-tests/tree/issue1/issues/1_rebalancing
- Prefer fixed tag {{issue1}} to branch {{master}} which may change.
- Note there are also various screenshots of jvisualvm together with full logs
from all components of the tool.
My colleague was able to independently reproduce the issue according to the
instructions above. If you have any questions or if you need any help with the
tool, just let us know. This issue is blocker for us.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)