Hi John, Kafka offsets are sequential id numbers that identify messages in each partition. It might not be sequential within a topic (which can have multiple partition).
Offsets don't necessarily start at 0 since messages are deleted. .bin/kafka-run-class.sh kafka.tools.GetOffsetShell is pretty neat to look at offsets in your topic I'm not sure why resetting offset is needed in your case. If you need to read from the beginning using the high level consumer, you just need to delete that consumer group in zookeeper and set "auto.offset.reset" to "smallest". (this will direct the consumer to look for smallest offset if it doesnt find one in zookeeper) On Wed, Feb 17, 2016 at 1:06 PM, John Bickerstaff <j...@johnbickerstaff.com> wrote: > Hmmm... more info. > > So, inside /var/log/kafka-logs/myTopicName-0 I find two files > > 00000000000000026524.index 00000000000000026524.log > > Interestingly, they both bear the number of the "lowest" offset returned by > the command I mention above. > > If I "cat" the 000.....26524.log file, I get all my messages on the > commandline as if I'd issued the --from-beginning command > > I'm not sure what the index has, it's unreadable by the simple tools I've > tried.... > > I'm still scratching my head a bit - as the link you sent for Kafka > introduction says this: > > The messages in the partitions are each assigned a sequential id number > called the *offset* that uniquely identifies each message within the > partition. > I see how that could be exactly what you said (the previous message(s) byte > count) -- but the picture implies that it's a linear progression - 1,2,3 > etc... (and that could be an oversimplification for purposes of the > introduction - I get that...) > > Feel free to comment or not - I'm going to keep digging into it as best I > can - any clarifications will be gratefully accepted... > > > > On Wed, Feb 17, 2016 at 1:50 PM, John Bickerstaff < > j...@johnbickerstaff.com> > wrote: > > > Thank you Christian -- I appreciate your taking the time to help me out > on > > this. > > > > Here's what I found while continuing to dig into this. > > > > If I take 30024 and subtract the number of messages I know I have in > Kafka > > (3500) I get 26524. > > > > If I reset thus: set /kafka/consumers/myGroupName/offsets/myTopicName/0 > > 26524 > > > > ... and then re-run my consumer - I get all 3500 messages again. > > > > If I do this: set /kafka/consumers/myGroupName/offsets/myTopicName/0 > 26624 > > > > In other words, I increase the offset number by 100 -- then I get exactly > > 3400 messages on my consumer -- exactly 100 less than before which I > think > > makes sense, since I started the offset 100 higher... > > > > This seems to suggest that each number between 26624 and 30024 in the log > > represents one of my 3500 messages on this topic, but what you say > suggests > > that they represent byte count of the actual messages and not "one number > > per message"... > > > > I also find that if I issue this command: > > > > bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic=myTopicName > > --broker-list=192.168.56.3:9092 --time=-2 > > > > I get back that same number -- 26524... > > > > Hmmmm.... A little confused still... These messages are literally > stored > > in the Kafka logs, yes? I think I'll go digging in there and see... > > > > Thanks again! > > > > > > > > > > > > On Wed, Feb 17, 2016 at 12:38 PM, Christian Posta < > > christian.po...@gmail.com> wrote: > > > >> The number is the log-ordered number of bytes. So really, the offset is > >> kinda like the "number of bytes" to begin reading from. 0 means read the > >> log from the beginning. The second message is 0 + size of message. So > the > >> message "ids" are really just the offset of the previous message sizes. > >> > >> For example, if I have three messages of 10 bytes each, and set the > >> consumer offset to 0, i'll read everything. If you set the offset to 10, > >> I'll read the second and third messages, and so on. > >> > >> see more here: > >> > >> > http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf > >> and here: http://kafka.apache.org/documentation.html#introduction > >> > >> HTH! > >> > >> On Wed, Feb 17, 2016 at 12:16 PM, John Bickerstaff < > >> j...@johnbickerstaff.com > >> > wrote: > >> > >> > *Use Case: Disaster Recovery & Re-indexing SOLR* > >> > > >> > I'm using Kafka to hold messages from a service that prepares > >> "documents" > >> > for SOLR. > >> > > >> > A second micro service (a consumer) requests these messages, does any > >> final > >> > processing, and fires them into SOLR. > >> > > >> > The whole thing is (in part) designed to be used for disaster > recovery - > >> > allowing the rebuild of the SOLR index in the shortest possible time. > >> > > >> > To do this (and to be able to use it for re-indexing SOLR while > testing > >> > relevancy) I need to be able to "play all messages from the beginning" > >> at > >> > will. > >> > > >> > I find I can use the zkCli.sh tool to delete the Consumer Group Name > >> like > >> > this: > >> > rmr /kafka/consumers/myGroupName > >> > > >> > After which my microservice will get all the messages again when it > >> runs. > >> > > >> > I was trying to find a way to do this programmatically without > actually > >> > using the "low level" consumer api since the high level one is so > simple > >> > and my code already works. So I started playing with Zookeeper api > for > >> > duplicating "rmr /kafka/consumers/myGroupName" > >> > > >> > *The Question: What does that offset actually represent?* > >> > > >> > It was at this point that I discovered the offset must represent > >> something > >> > other than what I thought it would. Things obviously work, but I'm > >> > wondering what - exactly do the offsets represent? > >> > > >> > To clarify - if I run this command on a zookeeper node, after the > >> > microservice has run: > >> > get /kafka/consumers/myGroupName/offsets/myTopicName/0 > >> > > >> > I get the following: > >> > > >> > 30024 > >> > cZxid = 0x3600000355 > >> > ctime = Fri Feb 12 07:27:50 MST 2016 > >> > mZxid = 0x3600000357 > >> > mtime = Fri Feb 12 07:29:50 MST 2016 > >> > pZxid = 0x3600000355 > >> > cversion = 0 > >> > dataVersion = 2 > >> > aclVersion = 0 > >> > ephemeralOwner = 0x0 > >> > dataLength = 5 > >> > numChildren = 0 > >> > > >> > Now - I have exactly 3500 messages in this Kafka topic. I verify that > >> by > >> > running this command: > >> > bin/kafka-console-consumer.sh --zookeeper > 192.168.56.5:2181/kafka > >> > --topic myTopicName --from-beginning > >> > > >> > When I hit Ctrl-C, it tells me it consumed 3500 messages. > >> > > >> > So - what does that 30024 actually represent? If I reset that number > >> to 1 > >> > or 0 and re-run my consumer microservice, I get all the messages > again - > >> > and the number again goes to 30024. However, I'm not comfortable to > >> trust > >> > that because my assumption that the number represents a simple count > of > >> > messages that have been sent to this consumer is obviously wrong. > >> > > >> > (I reset the number like this -- to 1 -- and assume there's an API > >> command > >> > that will do it too.) > >> > set /kafka/consumers/myGroupName/offsets/myTopicName/0 1 > >> > > >> > Can someone help me clarify or point me at a doc that explains what is > >> > getting counted here? You can shoot me if you like for attempting the > >> > hack-ish solution of re-setting the offset through the Zookeeper API, > >> but I > >> > would still like to understand what, exactly, is represented by that > >> number > >> > 30024. > >> > > >> > I need to hand off to IT for the Disaster Recovery portion and saying > >> > "trust me, it just works" isn't going to fly very far... > >> > > >> > Thanks. > >> > > >> > >> > >> > >> -- > >> *Christian Posta* > >> twitter: @christianposta > >> http://www.christianposta.com/blog > >> http://fabric8.io > >> > > > > > -- "Dream no small dreams for they have no power to move the hearts of men." Johann Wolfgang von Goethe