Thanks Felix for sharing your work. Contrib hadoop-consumer looks like the same way.
I think i need to really understand this offset stuff. So far i have used only high level consumer.When consumer is done reading all the messages, i used to kill the process(because it won't on its own). Again i used Producer to pump more messages and Consumer to read the new messages(which is a new process as i killed the last consumer). But i never saw messages getting duplicating. Now its not very clear for me that how offsets is tracked specifically when i am re-launching the consumer? And why retention policy is not working when used with SimpleConsumer? For my experiment i made it 4 hours. Please help me understand. Thanks, Navneet On Tue, Jan 15, 2013 at 4:12 AM, Felix GV <fe...@mate1inc.com> wrote: > I think you may be misunderstanding the way Kafka works. > > A kafka broker is never supposed to clear messages just because a consumer > read them. > > The kafka broker will instead clear messages after their retention period > ends, though it will not delete the messages at the exact time when they > expire. Instead, a background process will periodically delete a batch of > expired messages. The retention policies guarantee a minimum retention > time, not an exact retention time. > > It is the responsibility of each consumer to keep track of which messages > they have consumed already (by recording an offset for each consumed > partition). The high-level consumer stores these offsets in ZK. The simple > consumer has no built-in capability to store and manage offsets, so it is > the developer's responsibility to do so. In the case of the hadoop consumer > in the contrib package, these offsets are stored in offset files within > HDFS. > > I wrote a blog post a while ago that explains how to use the offset files > generated by the contrib consumer to do incremental consumption (so that > you don't get duplicated messages by re-consuming everything in subsequent > runs). > > > http://felixgv.com/post/69/automating-incremental-imports-with-the-kafka-hadoop-consumer/ > > I'm not sure how up to date this is, regarding the current Kafka versions, > but it may still give you some useful pointers... > > -- > Felix > > -- > Felix > > > On Mon, Jan 14, 2013 at 1:34 PM, navneet sharma < > navneetsharma0...@gmail.com > > wrote: > > > Hi, > > > > I am trying to use the code supplied in hadoop-consumer package. I am > > running into following issues: > > > > 1) This code is using SimpleConsumer which is actually contacting Kafka > > Broker without Zookeeper. Because of which messages are not getting > cleared > > from broker. > > And i am getting duplicate messages in each run. > > > > 2) The retention policy specified as log.retention.hours in > > server.properties is not working. Not sure if its due to SimpleConsumer. > > > > Is it expected behaviour. Is there any code using high level consumer for > > same work? > > > > Thanks, > > Navneet Sharma > > >