Hi Andrew

Did you take a look at
https://github.com/confluentinc/kafka-streams-examples/blob/5.0.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java
?
We are using this for a case like you described.
Growth should be limited with this approach.

Best regards 
Patrik

> Am 02.11.2018 um 01:33 schrieb Andrew Wilcox <andrew.wil...@gmail.com>:
> 
> Suppose I have a producer which is ingesting a data stream with unique keys
> from an external service and sending it to a Kafka topic.  In my producer I
> can set enable.idempotence and get exactly-once delivery in the presence of
> broker crashes.  However my producer might crash after it delivers a batch
> of messages to Kafka but before it records that the batch was delivered.
> After restarting the crashed producer it would re-deliver the same batch,
> resulting in duplicate messages in the topic.
> 
> With a streams transformer I can deduplicate the topic by using a state
> store to record previously seen keys and then only creating an output
> record if the key hasn't been seen before.  However without a mechanism to
> remove old keys the state store will grow without bound.
> 
> Say I only want to deduplicate over a time period such as one day.  (I'm
> confident that I'll be able to restart a crashed producer sooner).  Thus
> I'd like keys older than a day to expire out of the state store, so the
> store only needs to keep track of keys seen in the last day or so.
> 
> Is there a way to do this with Kafka streams?  Or is there another
> recommended mechanism to keep messages with unique keys unduplicated in the
> presence of producer crashes?
> 
> Thanks!
> 
> Andrew

Reply via email to