I’m seeing behaviour that I don’t understand when I have Consumers fetching 
from multiple Partitions from the same Topic.  There are two different 
conditions arising:

1. A subset of the Partitions allocated to a given Consumer not being consumed 
at all.  The Consumer appears healthy, the Thread is running and logging 
activity and is successfully processing records from some of the Partitions it 
has been assigned.  I don’t think this is due to the first Partition fetched 
filling a Batch (KIP-387).  The problem does not occur if we have a particular 
number of Consumers (3 in this case) but it has failed with a range of other 
larger values.  I don’t think there is anything special about 3 - it just 
happens to work OK with that value although it is the same as the Broker and 
Replica count.  When we tried 6, 5 Consumers were fine but 1 exhibited this 
issue.

2. Up to a half second delay between Producer sending and Consumer receiving a 
message.  This looks suspiciously like the fetch.max.wait.ms=500 but we also 
have fetch.min.bytes=1 so should get messages as soon as something is 
available.  The only explanation I can think of is if the fetch.max.wait.ms is 
applied in full to the first Partition checked and it remains empty for the 
duration.  Then it moves on to a subsequent non-empty Partition and delivers 
messages from there.

Our environment is AWS MSK (Kafka 2.2.1) and Kafka Java client 2.4.0.

All environments appear healthy and under light load, e.g. clients only 
operating at a 1-2% CPU, Brokers (3) at 5-10% CPU.   No swap, no crashes, no 
dead threads etc.

Typical scenario is a Topic with 60 Partitions, 3 Replicas and a single 
ConsumerGroup with 5 Consumers.  The Partitioning is for semantic purposes with 
the intention being to add more Consumers as the business grows and load 
increases.  Some of the Partitions are always empty due to using short string 
keys and the default Partitioner - we will probably implement a custom 
Partitioner to achieve better distribution in the near future.

I don’t have access to the detailed JMX metrics yet but am working on that in 
the hope it will help diagnose.

Thoughts and advice appreciated!

Reply via email to