Added this to our FAQ -
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactlyonemessagingfromKafka
?



On Mon, Feb 10, 2014 at 9:46 AM, Jay Kreps <jay.kr...@gmail.com> wrote:

> The out-of-the-box support for this in Kafka isn't great right now.
>
> Exactly once semantics has two parts: avoiding duplication during data
> production and avoiding duplicates during data consumption.
>
> There are two approaches to getting exactly once semantics during data
> production.
>
> 1. Use a single-writer per partition and every time you get a network error
> check the last message in that partition to see if your last write
> succeeded
> 2. Include a primary key (UUID or something) in the message and deduplicate
> on the consumer.
>
> If you do one of these things the log that Kafka hosts will be duplicate
> free. However reading without duplicates depends on some co-operation from
> the consumer too. If the consumer is periodically checkpointing its
> position then if it fails and restarts it will restart from the
> checkpointed position. Thus if the data output and the checkpoint are not
> written atomically it will be possible to get duplicates here as well. This
> problem is particular to your storage system. For example if you are using
> a database you could commit these together in a transaction. The HDFS
> loader Camus that LinkedIn wrote does something like this for Hadoop loads.
> The other alternative that doesn't require a transaction is to store the
> offset with the data loaded and deduplicate using the
> topic/partition/offset combination.
>
> I think there are two improvements that would make this a lot easier:
> 1. I think producer idempotence is something that could be done
> automatically and much more cheaply by optionally integrating support for
> this on the server.
> 2. The existing high-level consumer doesn't expose a lot of the more fine
> grained control of offsets (e.g. to reset your position). We will be
> working on that soon.
>
> -Jay
>
>
>
>
>
>
>
> On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington <
> g.turking...@improvedigital.com> wrote:
>
> > Hi,
> >
> > I've been doing some prototyping on Kafka for a few months now and like
> > what I see. It's a good fit for some of my use cases in the areas of data
> > distribution but also for processing - liking a lot of what I see in
> Samza.
> > I'm now working through some of the operational issues and have a
> question
> > to the community.
> >
> > I have several data sources that I want to push into Kafka but some of
> the
> > most important are arriving as a stream of files being dropped either
> into
> > a SFTP location or S3. Conceptually the data is really a stream but its
> > being chunked and made more batch by the deployment model of the
> > operational servers. So pulling the data into Kafka and seeing it more
> as a
> > stream again is a big plus.
> >
> > But, I really don't want duplicate messages. I know Kafka provides at
> > least once semantics and that's fine, I'm happy to have the de-dupe logic
> > external to Kafka. And if I look at my producer I can build up a protocol
> > around adding record metadata and using Zookeeper to give me pretty high
> > confidence that my clients will know if they are reading from a file that
> > was fully published into Kafka or not.
> >
> > I had assumed that this wouldn't be a unique use case but on doing a
> bunch
> > of searches I really don't find much in terms of either tools that help
> or
> > even just best practice patterns for handling this type of need to
> support
> > exactly-once message processing.
> >
> > So now I'm thinking that either I just need better web search skills or
> > that actually this isn't something many others are doing and if so then
> > there's likely a reason for that.
> >
> > Any thoughts?
> >
> > Thanks
> > Garry
> >
> >
>

Reply via email to