On Thu, Jan 5, 2017 at 11:30 AM, Phillip Mann <pm...@trulia.com> wrote:
> I am working on setting up a Kafka Connect Distributed Mode application > which will be a Kafka to S3 pipeline. I am using Kafka 0.10.1.0-1 and Kafka > Connect 3.1.1-1. So far things are going smoothly but one aspect that is > important to the larger system I am working with requires knowing offset > information of the Kafka -> FileSystem pipeline. According to the > documentation, the offset.storage.topic configuration will be the location > the distributed mode application uses for storing offset information. This > makes sense given how Kafka stores consumer offsets in the 'new' Kafka. > However, after doing some testing with the FileStreamSinkConnector, nothing > is being written to my offset.storage.topic which is the default value: > connect-offsets. > The documentation may need to be clarified about this -- this same question has come up at least twice in the past 2 days. The offset.storage.topic is actually only for source connectors. Source connectors need to define their own format for offsets since we can't make assumptions about how source systems define an offset in their stream of data. Sink connectors are just reading data out of Kafka and there is already a good mechanism for tracking offsets, so we use that. > > To be specific, I am using a Python Kafka producer to push data to a topic > and using Kafka Connect with the FileStreamSinkConnect to output the data > from the topic to a file. This works and behaves as I expect the connector > to behave. Additionally, when I stop the connector and start the connector, > the application remembers the state in the topic and there is no data > duplication. However, when I go to the offset.storage.topic to see what > offset metadata is stored, there is nothing in the topic. > > This is the command that I use: > > kafka-console-consumer --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 > --topic connect-offsets --from-beginning > > I receive this message after letting this command run for a minute or so: > > Processed a total of 0 messages > > So to summarize, I have 2 questions: > > > 1. Why is offset metadata not being written to the topic that should be > storing this even though my distributed application is keeping state > correctly? > > > 1. How do I access offset metadata information for a Kafka Connect > distributed mode application? This is 100% necessary for my team's Lambda > Architecture implementation of our system. > We want to improve this (to expose both read and write operations, since it's also sometimes useful to be able to manually reset committed offsets): https://issues.apache.org/jira/browse/KAFKA-3820 For sink connectors, however, you can still get this information directly. You can use the consumer offset checker to lookup offsets for any consumer group. For connect, the consumer group for a connector will be connect-<connector_name>. -Ewen > > Thanks for the help. >