We did some more testing with logging turned on (I figured out why it wasn't working). We tried increasing the JVM memory capacity on our test server (it's lower than in production) and increasing the zookeeper timeouts. Neither changed the results. With trace logging enabled, we saw that we were getting rebalances even though there is only one high level consumer running (there previously was a simple consumer that was told to disconnect, but that consumer only checked the offsets and never consumed data).
- Is there possibly a race condition where the simple consumer has a hold on a partition and shutdown is called before starting a high level consumer but shutdown is done asynchronously? - What are the various things that can cause a consumer rebalance other than adding / removing high level consumers? Thanks, Cliff On Wed, Oct 21, 2015 at 4:20 PM, Cliff Rhyne <crh...@signal.co> wrote: > Hi Kris, > > Thanks for the tip. I'm going to investigate this further. I checked and > we have fairly short zk timeouts and run with a smaller memory allocation > on the two environments we encounter this issue. I'll let you all know > what I find. > > I saw this ticket https://issues.apache.org/jira/browse/KAFKA-2049 that > seems to be related to the problem (but would only inform that an issue > occurred). Are there any other open issues that could be worked on to > improve Kafka's handling of this situation? > > Thanks, > Cliff > > On Wed, Oct 21, 2015 at 2:53 PM, Kris K <squareksc...@gmail.com> wrote: > >> Hi Cliff, >> >> One other case I observed in my environment is - when there were gc pauses >> on one of our high level consumer in the group. >> >> Thanks, >> Kris >> >> On Wed, Oct 21, 2015 at 10:12 AM, Cliff Rhyne <crh...@signal.co> wrote: >> >> > Hi James, >> > >> > There are two scenarios we run: >> > >> > 1. Multiple partitions with one consumer per partition. This rarely has >> > starting/stopping of consumers, so the pool is very static. There is a >> > configured consumer timeout, which is causing the >> ConsumerTimeoutException >> > to get thrown prior to the test starting. We handle this exception and >> > then resume consuming. >> > 2. Single partition with one consumer. This consumer is started by a >> > triggered condition (number of messages pending to be processed in the >> > kafka topic or a schedule). The consumer is stopped after processing is >> > completed. >> > >> > In both cases, based on my understanding there shouldn't be a rebalance >> as >> > either a) all consumers are running or b) there's only one consumer / >> > partition. Also, the same consumer group is used by all consumers in >> > scenario 1 and 2. Is there a good way to investigate whether rebalances >> > are occurring? >> > >> > Thanks, >> > Cliff >> > >> > On Wed, Oct 21, 2015 at 11:37 AM, James Cheng <jch...@tivo.com> wrote: >> > >> > > Do you have multiple consumers in a consumer group? >> > > >> > > I think that when a new consumer joins the consumer group, that the >> > > existing consumers will stop consuming during the group rebalance, and >> > then >> > > when they start consuming again, that they will consume from the last >> > > committed offset. >> > > >> > > You should get more verification on this, tho. I might be remembering >> > > wrong. >> > > >> > > -James >> > > >> > > > On Oct 21, 2015, at 8:40 AM, Cliff Rhyne <crh...@signal.co> wrote: >> > > > >> > > > Hi, >> > > > >> > > > My team and I are looking into a problem where the Java high level >> > > consumer >> > > > provides duplicate messages if we turn auto commit off (using >> version >> > > > 0.8.2.1 of the server and Java client). The expected sequence of >> > events >> > > > are: >> > > > >> > > > 1. Start high-level consumer and initialize a KafkaStream to get a >> > > > ConsumerIterator >> > > > 2. Consume n items (could be 10,000, could be 1,000,000) from the >> > > iterator >> > > > 3. Commit the new offsets >> > > > >> > > > What we are seeing is that during step 2, some number of the n >> messages >> > > are >> > > > getting returned by the iterator in duplicate (in some cases, we've >> > seen >> > > > n*5 messages consumed). The problem appears to go away if we turn >> on >> > > auto >> > > > commit (and committing offsets to kafka helped too), but auto commit >> > > causes >> > > > conflicts with our offset rollback logic. The issue seems to happen >> > more >> > > > when we are in our test environment on a lower-cost cloud provider. >> > > > >> > > > Diving into the Java and Scala classes including the >> ConsumerIterator, >> > > it's >> > > > not obvious what event causes a duplicate offset to be requested or >> > > > returned (there's even a loop that is supposed to exclude duplicate >> > > > messages in this class). I tried turning on trace logging but my >> log4j >> > > > config isn't getting the Kafka client logs to write out. >> > > > >> > > > Does anyone have suggestions of where to look or how to enable >> logging? >> > > > >> > > > Thanks, >> > > > Cliff >> > > >> > > >> > > ________________________________ >> > > >> > > This email and any attachments may contain confidential and privileged >> > > material for the sole use of the intended recipient. Any review, >> copying, >> > > or distribution of this email (or any attachments) by others is >> > prohibited. >> > > If you are not the intended recipient, please contact the sender >> > > immediately and permanently delete this email and any attachments. No >> > > employee or agent of TiVo Inc. is authorized to conclude any binding >> > > agreement on behalf of TiVo Inc. by email. Binding agreements with >> TiVo >> > > Inc. may only be made by a signed written agreement. >> > > >> > >> >