[ https://issues.apache.org/jira/browse/KAFKA-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973509#comment-15973509 ]
Chris Riccomini commented on KAFKA-5084: ---------------------------------------- [~rhauch], hah, this does indeed look like a dupe of KAFKA-3821. I'll close this then. > Allow Kafka connect source tasks to commit offsets without messages being sent > ------------------------------------------------------------------------------ > > Key: KAFKA-5084 > URL: https://issues.apache.org/jira/browse/KAFKA-5084 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect > Affects Versions: 0.10.2.0 > Reporter: Chris Riccomini > > We are currently running [Debezium|http://debezium.io/] connectors in Kafka > connect. These connectors consume from MySQL's binlog, and produce into Kafka. > One of the things we've observed is that some of our Debezium connectors are > not honoring the {{offset.flush.interval.ms}} setting (which is set to 60 > seconds). Some of our connectors seem to be committing only sporadically. For > low-volume connectors, the commits seem to happen once every hour or two, and > sometimes even longer. > It sounds like the issue is that Kafka connect will only commit source task > offsets when the source task produces new source records. This is because > Kafka connect gets the offset to commit from an incoming source record. The > problem with this approach is that there are (in my opinion) valid reasons to > want to commit consumed offsets WITHOUT sending any new messages. Taking > Debezium as an example, there are cases where Debezium consumes messages, but > filters out messages based on a regex, or filter rule (e.g. table black > lists). In such a case, Debezium is consuming messages from MySQL's binlog, > and dropping them before they get to the Kafka connect framework. As such, > Kafka connect never sees these messages, and doesn't commit any progress. > This results in several problems: > # In the event of a failure, the connector could fall WAY back, since the > last committed offset might be from hours ago, even thought it *has* > processed all recent messages--it just hasn't sent anything to Kafka. > # For connectors like Debezium that consume from a source that has a > *limited* window to fetch messages (MySQL's binlog has time/size based > retention), you can actually fall off the edge of the binlog because the last > commit can actually happen farther back than the binlog goes, even though > Debezium has fetched every single message in the binlog--it just hasn't > produced anything due to filtering. > Again, I don't see this as a Debezium-specific issue. I could imagine a > similar scenario with an [SST-based Cassandra > source|https://github.com/datamountaineer/stream-reactor/issues/162]. > It would be nice if Kafka connect allowed us a way to commit offsets for > source tasks even when messages haven't been sent recently. This would allow > source tasks to log their progress even if they're opting not to send > messages to Kafka due to filtering or for some other reason. > (See https://issues.jboss.org/browse/DBZ-220 for more context.) -- This message was sent by Atlassian JIRA (v6.3.15#6346)