Ack, nice, should have thought of doing that... -Jay
On Mon, Feb 10, 2014 at 10:12 AM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > Added this to our FAQ - > > https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactlyonemessagingfromKafka > ? > > > > On Mon, Feb 10, 2014 at 9:46 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > > > The out-of-the-box support for this in Kafka isn't great right now. > > > > Exactly once semantics has two parts: avoiding duplication during data > > production and avoiding duplicates during data consumption. > > > > There are two approaches to getting exactly once semantics during data > > production. > > > > 1. Use a single-writer per partition and every time you get a network > error > > check the last message in that partition to see if your last write > > succeeded > > 2. Include a primary key (UUID or something) in the message and > deduplicate > > on the consumer. > > > > If you do one of these things the log that Kafka hosts will be duplicate > > free. However reading without duplicates depends on some co-operation > from > > the consumer too. If the consumer is periodically checkpointing its > > position then if it fails and restarts it will restart from the > > checkpointed position. Thus if the data output and the checkpoint are not > > written atomically it will be possible to get duplicates here as well. > This > > problem is particular to your storage system. For example if you are > using > > a database you could commit these together in a transaction. The HDFS > > loader Camus that LinkedIn wrote does something like this for Hadoop > loads. > > The other alternative that doesn't require a transaction is to store the > > offset with the data loaded and deduplicate using the > > topic/partition/offset combination. > > > > I think there are two improvements that would make this a lot easier: > > 1. I think producer idempotence is something that could be done > > automatically and much more cheaply by optionally integrating support for > > this on the server. > > 2. The existing high-level consumer doesn't expose a lot of the more fine > > grained control of offsets (e.g. to reset your position). We will be > > working on that soon. > > > > -Jay > > > > > > > > > > > > > > > > On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington < > > g.turking...@improvedigital.com> wrote: > > > > > Hi, > > > > > > I've been doing some prototyping on Kafka for a few months now and like > > > what I see. It's a good fit for some of my use cases in the areas of > data > > > distribution but also for processing - liking a lot of what I see in > > Samza. > > > I'm now working through some of the operational issues and have a > > question > > > to the community. > > > > > > I have several data sources that I want to push into Kafka but some of > > the > > > most important are arriving as a stream of files being dropped either > > into > > > a SFTP location or S3. Conceptually the data is really a stream but its > > > being chunked and made more batch by the deployment model of the > > > operational servers. So pulling the data into Kafka and seeing it more > > as a > > > stream again is a big plus. > > > > > > But, I really don't want duplicate messages. I know Kafka provides at > > > least once semantics and that's fine, I'm happy to have the de-dupe > logic > > > external to Kafka. And if I look at my producer I can build up a > protocol > > > around adding record metadata and using Zookeeper to give me pretty > high > > > confidence that my clients will know if they are reading from a file > that > > > was fully published into Kafka or not. > > > > > > I had assumed that this wouldn't be a unique use case but on doing a > > bunch > > > of searches I really don't find much in terms of either tools that help > > or > > > even just best practice patterns for handling this type of need to > > support > > > exactly-once message processing. > > > > > > So now I'm thinking that either I just need better web search skills or > > > that actually this isn't something many others are doing and if so then > > > there's likely a reason for that. > > > > > > Any thoughts? > > > > > > Thanks > > > Garry > > > > > > > > >