> You are also correct and perceptive to notice that if you check the end of
> the log then begin consuming and read up to that point compaction may have
> already kicked in (if the reading takes a while) and hence you might have
> an incomplete snapshot.

Isn't it sufficient to just repeat the check at the end after reading
the log and repeat until you are truly done? At least for the purposes
of a snapshot?

On Wed, Feb 18, 2015 at 02:21:49PM -0800, Jay Kreps wrote:
> If you catch up off a compacted topic and keep consuming then you will
> become consistent with the log.
> 
> I think what you are saying is that you want to create a snapshot from the
> Kafka topic but NOT do continual reads after that point. For example you
> might be creating a backup of the data to a file.
> 
> I agree that this isn't as easy as it could be. As you say the only
> solution we have is that timeout which doesn't differentiate between GC
> stall in your process and no more messages left so you would need to tune
> the timeout. This is admittedly kind of a hack.
> 
> You are also correct and perceptive to notice that if you check the end of
> the log then begin consuming and read up to that point compaction may have
> already kicked in (if the reading takes a while) and hence you might have
> an incomplete snapshot.
> 
> I think there are two features we could add that would make this easier:
> 1. Make the cleaner point configurable on a per-topic basis. This feature
> would allow you to control how long the full log is retained and when
> compaction can kick in. This would give a configurable SLA for the reader
> process to catch up.
> 2. Make the log end offset available more easily in the consumer.
> 
> -Jay
> 
> 
> 
> On Wed, Feb 18, 2015 at 10:18 AM, Will Funnell <w.f.funn...@gmail.com>
> wrote:
> 
> > We are currently using Kafka 0.8.1.1 with log compaction in order to
> > provide streams of messages to our clients.
> >
> > As well as constantly consuming the stream, one of our use cases is to
> > provide a snapshot, meaning the user will receive a copy of every message
> > at least once.
> >
> > Each one of these messages represents an item of content in our system.
> >
> >
> > The problem comes when determining if the client has actually reached the
> > end of the topic.
> >
> > The standard Kafka way of dealing with this seems to be by using a
> > ConsumerTimeoutException, but we are frequently getting this error when the
> > end of the topic has not been reached or even it may take a long time
> > before a timeout naturally occurs.
> >
> >
> > On first glance it would seem possible to do a lookup for the max offset
> > for each partition when you begin consuming, stopping when this position it
> > reached.
> >
> > But log compaction means that if an update to a piece of content arrives
> > with the same message key, then this will be written to the end so the
> > snapshot will be incomplete.
> >
> >
> > Another thought is to make use of the cleaner point. Currently Kafka writes
> > out to a "cleaner-offset-checkpoint" file in each data directory which is
> > written to after log compaction completes.
> >
> > If the consumer was able to access the cleaner-offset-checkpoint you would
> > be able to consume up to this point, check the point was still the same,
> > and compaction had not yet occurred, and therefore determine you had
> > receive everything at least once. (Assuming there was no race condition
> > between compaction and writing to the file)
> >
> >
> > Has anybody got any thoughts?
> >
> > Will
> >

Reply via email to