Hi Wim, One off-the-cuff idea is that you maybe don't need to actually delay anonymizing the data. Instead, you can just create a separate pathway that immediately anonymizes the data. Something like this:
(raw-input topic, GDPR retention period set) |\->[streams apps that needs non-anonymized data] | \->[anonymizer app]->(anonymized-input topic, arbitrary retention period)->[any apps that can handle anonymous data] This way, you can keep your anonymized-input topic forever if you want, using log compaction. You also get a very clean data lineage, so you know for sure which components are viewing non-anonymized data (because they consume the raw-input topic) versus the ones that are "safe" according to GDPR, since they consume only the anonymized-input topic. For downstream apps that are happy using anonymized input, it seems like it wouldn't matter whether the input is anonymized right away or 6 months delayed. And since you know clearly which components may be "dirty" because they are downstream of raw-input, you can audit those components and make sure they are either stateless or that they also have proper retention periods set. Of course, without knowing the details, I can't say whether this actually works for you, but I wanted to share the thought. Thanks, -John On Thu, Mar 8, 2018 at 11:21 PM, adrien ruffie <adriennolar...@hotmail.fr> wrote: > Hi Wim, > this topic (processing order) has been cropping up for a while, several > article, benchmark, and other think on the subject > reaching this conclusion. > > After that you can, you can ask someone else another opinion on the > subject. > > regards, > > Adrien > > ________________________________ > De : Wim Van Leuven <wim.vanleu...@highestpoint.biz> > Envoyé : vendredi 9 mars 2018 08:03:15 > À : users@kafka.apache.org > Objet : Re: Delayed processing > > Hey Adrien, > > thank you for the elaborate explanation! > > We are ingesting call data records here, which due to the nature of a telco > network might not arrive in absolute logical order. > > If I understand your explanation correctly, you are saying that with your > setup, Kafka guarantees the processing in order of ingestion of the > messages. Correct? > > Thanks! > -wim > > On Thu, 8 Mar 2018 at 22:58 adrien ruffie <adriennolar...@hotmail.fr> > wrote: > > > Hello Wim, > > > > > > does it matter (I think), because one of the big and principal features > of > > Kafka is: > > > > Kafka is to do load balancing of messages and guarantee ordering in a > > distributed cluster. > > > > > > The order of the messages should be guaranteed, unless several cases: > > > > 1] Producer can cause data loss when, block.on.buffer.full = false, > > retries are exhausted and sending message without using acks=all > > > > 2] unclean leader election enable: because if one of follower (out of > > sync) become the new leader, messages that were not synced to the new > > > > leader are lost. > > > > > > message reordering might happen when: > > > > 1] max.in.flight.requests.per.connection > 1 and retries are enabled > > > > 2] when a producer is not correclty closed like, without calling .close() > > > > Because close method allowing to ensure that accumulator is closed first > > to guarantee that no more appends are accepted after breaking the send > loop. > > > > > > > > If you wan't to avoir these cases: > > > > - close producer in the callback error > > > > - close producer with close(0) to prevent sending after previous message > > send failed > > > > > > Avoid data loss: > > > > - block.on.buffer.fill=TRUE > > > > - retries=Long.MAX_VALUE > > > > - acks=all > > > > > > Avoid reordering: > > > > max.in.flight.request.per.connection=1 (be aware about latency) > > > > > > take attention about, if your producer is down, messages in buffer will > > still be lost ... (perhaps manage a local storage if you are punctilious) > > > > moreover at least two replicas are nedded at any time to guarantee data > > persistence. example replication factor = 3, min.isr = 2 , unclean leader > > election disabled > > > > > > Also keep in mind that consumer can lose message when offsets are not > > correctly commited. Disable auto.offset.commit and commit offsets only > > after make your job for each message (or commit several processed > messages > > at one time and kept in a local memory buffer) > > > > > > I hope, these previous suggestions help you 😊 > > > > > > Best regards, > > > > Adrien > > > > ________________________________ > > De : Wim Van Leuven <wim.vanleu...@highestpoint.biz> > > Envoyé : jeudi 8 mars 2018 21:35:13 > > À : users@kafka.apache.org > > Objet : Delayed processing > > > > Hello, > > > > I'm wondering how to design a KStreams or regular Kafka application that > > can hold of processing of messages until a future time. > > > > This related to EU's data protection regulation: we can store raw > messages > > for a given time; afterwards we have to store the anonymised message. > So, I > > was thinking about branching the stream, anonymise the messages into a > > waiting topic and than continue from there until the retention time > passes. > > > > But that approach has some caveats: > > > > - This is not an exact solution as order of events is not guaranteed: > we > > might encounter a message that triggers the stop processing while some > > events arriving later should normally still pass > > - how to stop properly stop processing if we encounter a message that > > indicates to not continue? > > - ... > > > > Are there better know solutions or best practices to delay message > > processing with Kafka streams / consumers+producers? > > > > Thanks for any insights/help here! > > -wim > > >