Added this to our FAQ - https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactlyonemessagingfromKafka ?
On Mon, Feb 10, 2014 at 9:46 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > The out-of-the-box support for this in Kafka isn't great right now. > > Exactly once semantics has two parts: avoiding duplication during data > production and avoiding duplicates during data consumption. > > There are two approaches to getting exactly once semantics during data > production. > > 1. Use a single-writer per partition and every time you get a network error > check the last message in that partition to see if your last write > succeeded > 2. Include a primary key (UUID or something) in the message and deduplicate > on the consumer. > > If you do one of these things the log that Kafka hosts will be duplicate > free. However reading without duplicates depends on some co-operation from > the consumer too. If the consumer is periodically checkpointing its > position then if it fails and restarts it will restart from the > checkpointed position. Thus if the data output and the checkpoint are not > written atomically it will be possible to get duplicates here as well. This > problem is particular to your storage system. For example if you are using > a database you could commit these together in a transaction. The HDFS > loader Camus that LinkedIn wrote does something like this for Hadoop loads. > The other alternative that doesn't require a transaction is to store the > offset with the data loaded and deduplicate using the > topic/partition/offset combination. > > I think there are two improvements that would make this a lot easier: > 1. I think producer idempotence is something that could be done > automatically and much more cheaply by optionally integrating support for > this on the server. > 2. The existing high-level consumer doesn't expose a lot of the more fine > grained control of offsets (e.g. to reset your position). We will be > working on that soon. > > -Jay > > > > > > > > On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington < > g.turking...@improvedigital.com> wrote: > > > Hi, > > > > I've been doing some prototyping on Kafka for a few months now and like > > what I see. It's a good fit for some of my use cases in the areas of data > > distribution but also for processing - liking a lot of what I see in > Samza. > > I'm now working through some of the operational issues and have a > question > > to the community. > > > > I have several data sources that I want to push into Kafka but some of > the > > most important are arriving as a stream of files being dropped either > into > > a SFTP location or S3. Conceptually the data is really a stream but its > > being chunked and made more batch by the deployment model of the > > operational servers. So pulling the data into Kafka and seeing it more > as a > > stream again is a big plus. > > > > But, I really don't want duplicate messages. I know Kafka provides at > > least once semantics and that's fine, I'm happy to have the de-dupe logic > > external to Kafka. And if I look at my producer I can build up a protocol > > around adding record metadata and using Zookeeper to give me pretty high > > confidence that my clients will know if they are reading from a file that > > was fully published into Kafka or not. > > > > I had assumed that this wouldn't be a unique use case but on doing a > bunch > > of searches I really don't find much in terms of either tools that help > or > > even just best practice patterns for handling this type of need to > support > > exactly-once message processing. > > > > So now I'm thinking that either I just need better web search skills or > > that actually this isn't something many others are doing and if so then > > there's likely a reason for that. > > > > Any thoughts? > > > > Thanks > > Garry > > > > >