Hello, folks.

I'm encountering a bizarre situation where it appears that fetching for 
specific partitions stalls when using the 0.9.0 new consumer.  I know that no 
partitions are paused for extended periods; I issue a resume for all assigned 
partitions immediately before doing a poll.  Despite this, I'm ending up with 
approximately 7 (it varies from 3-9) partitions where no records are delivered 
to the consumer, despite records continuing to be published to those 
partitions.  As a result, I routinely end up with partition lag in the 
thousands for this small subset of partitions, while all other partitions have 
a lag under twenty.

For scale, I have 3 brokers, 100 partitions, and 16 consumer instances.  
Records range from 20k to 160k, typically around  30-40k.  Processing time is 
mostly linear with record size, on the order of 1 CPU-second per 6k of record 
data.  Because of the high processing time, processing is done multi-threaded 
across 34 cores, and if processing from a single poll hasn't completed in the 
heartbeat interval, I pause all assigned partitions, issue a poll(0) to force 
the heartbeat, and then resume all assigned partitions.

When partitions get wedged, bouncing one of the consumer instances (not 
necessarily the instance who would receive the partitions) will often unwedge 
the partitions that were wedged... but then other partitions get wedged, 
instead.

I have more than sufficient CPU to process all the records, and much of the 
consumer instance time is spent waiting on a poll(60000) result which doesn't 
return anything from the partitions that are wedged.  Also, my brokers seem to 
be running cold, with less than 30% CPU utilization and less than 2MB/sec disk 
i/o.

Has anyone seen anything like this?  Is it normal for the consumer fetcher to 
be biased in which partitions it fetches from?  Are there any suggestions on 
how to diagnose further?

- Alex

Reply via email to