Ok still really struggling with this. We have sped up the flush time quite a bit but still failing. We are seeing all three group members are assigned partitions and are assigned tasks. Then tasks start dropping off.
The log line I think that indicates what is wrong is this: >[Thread-9] INFO Marking the coordinator 1932295911 dead. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) I strongly believe this is not the SinkTask consumer, the SinkTask sets the thread name to the task name, but the WorkGroupCoordinator. Can someone help me understand how the WorkGroupCoordinator's work ? Seems there is some election that happens and one of the connector hosts is chosen as the leader. I believe this log messages indicates that the elected Group Coordinator did not respond to the heartbeat in time. Searching through the source code I cannot figure out how the election happens nor where the heartbeat response is generated. Anyone have any guidance on where to look or how to debug ? Grasping at straws at this moment. On Fri, Apr 15, 2016 at 10:36 AM Scott Reynolds <sreyno...@twilio.com> wrote: > Awesome that is what I thought. Answer seems simple, speed up flush :-D, > which we should be able to do. > > On Fri, Apr 15, 2016 at 10:15 AM Liquan Pei <liquan...@gmail.com> wrote: > >> Hi Scott, >> >> It seems that your flush takes longer time than >> consumer.session.timeout.ms. >> The consumers used in SinkTasks for a SinkConnector are in the same >> consumer group. In case that your flush method takes longer than the >> consumer.session.timeout.ms, the consumer for a SinkTask may be kicked >> out >> by the coordinator. >> >> In this case, you may want to increase the consumer.session.timeout.ms or >> have some timeout mechanism in the implementation of the flush method to >> return the control back to the framework so that it can send heartbeat to >> the coordinator. >> >> Thanks, >> Liquan >> >> On Fri, Apr 15, 2016 at 9:56 AM, Scott Reynolds <sreyno...@twilio.com> >> wrote: >> >> > List, >> > >> > We are struggling with Kafka Connect settings. The process start up and >> > handle a bunch of messages and flush. Then slowly the Group coordinator >> > removes them. >> > >> > This is has to be a interplay between Connect's flush interval and the >> call >> > to poll for each of these tasks. Here is my current settings that I >> think >> > are relevant. >> > >> > Any insights someone could share with us ? >> > >> > # on shutdown wait this long for the tasks to finish their flush. >> > task.shutdown.graceful.timeout.ms=600000 >> > >> > # Flush records to s3 every 1/2 hour >> > offset.flush.interval.ms=1800000 >> > >> > # Max time to wait for flushing to finish. Wait at *most* this long >> every >> > offset.flush.interval.ms. >> > offset.flush.timeout.ms=600000 >> > >> > # Take your time on session timeouts. We do a lot of work. These control >> > the length of time a lock on a TopicPartition can be held >> > # by the coordinator broker. >> > session.timeout.ms=180000 >> > request.timeout.ms=190000 >> > consumer.session.timeout.ms=180000 >> > consumer.request.timeout.ms=190000 >> > >> >> >> >> -- >> Liquan Pei >> Software Engineer, Confluent Inc >> >