Hi Wim,

One off-the-cuff idea is that you maybe don't need to actually delay
anonymizing the data. Instead, you can just create a separate pathway that
immediately anonymizes the data. Something like this:

(raw-input topic, GDPR retention period set)
 |\->[streams apps that needs non-anonymized data]
 |
 \->[anonymizer app]->(anonymized-input topic, arbitrary retention
period)->[any apps that can handle anonymous data]

This way, you can keep your anonymized-input topic forever if you want,
using log compaction. You also get a very clean data lineage, so you know
for sure which components are viewing non-anonymized data (because they
consume the raw-input topic) versus the ones that are "safe" according to
GDPR, since they consume only the anonymized-input topic.

For downstream apps that are happy using anonymized input, it seems like it
wouldn't matter whether the input is anonymized right away or 6 months
delayed.

And since you know clearly which components may be "dirty" because they are
downstream of raw-input, you can audit those components and make sure they
are either stateless or that they also have proper retention periods set.

Of course, without knowing the details, I can't say whether this actually
works for you, but I wanted to share the thought.

Thanks,
-John

On Thu, Mar 8, 2018 at 11:21 PM, adrien ruffie <adriennolar...@hotmail.fr>
wrote:

> Hi Wim,
> this topic (processing order) has been cropping up for a while, several
> article, benchmark, and other think on the subject
> reaching this conclusion.
>
> After that you can, you can ask someone else another opinion on the
> subject.
>
> regards,
>
> Adrien
>
> ________________________________
> De : Wim Van Leuven <wim.vanleu...@highestpoint.biz>
> Envoyé : vendredi 9 mars 2018 08:03:15
> À : users@kafka.apache.org
> Objet : Re: Delayed processing
>
> Hey Adrien,
>
> thank you for the elaborate explanation!
>
> We are ingesting call data records here, which due to the nature of a telco
> network might not arrive in absolute logical order.
>
> If I understand your explanation correctly, you are saying that with your
> setup, Kafka guarantees the processing in order of ingestion of the
> messages. Correct?
>
> Thanks!
> -wim
>
> On Thu, 8 Mar 2018 at 22:58 adrien ruffie <adriennolar...@hotmail.fr>
> wrote:
>
> > Hello Wim,
> >
> >
> > does it matter (I think), because one of the big and principal features
> of
> > Kafka is:
> >
> > Kafka is to do load balancing of messages and guarantee ordering in a
> > distributed cluster.
> >
> >
> > The order of the messages should be guaranteed, unless several cases:
> >
> > 1] Producer can cause data loss when, block.on.buffer.full = false,
> > retries are exhausted and sending message without using acks=all
> >
> > 2] unclean leader election enable: because if one of follower (out of
> > sync) become the new leader, messages that were not synced to the new
> >
> > leader are lost.
> >
> >
> > message reordering might happen when:
> >
> > 1] max.in.flight.requests.per.connection > 1 and retries are enabled
> >
> > 2] when a producer is not correclty closed like, without calling .close()
> >
> > Because close method allowing to ensure that accumulator is closed first
> > to guarantee that no more appends are accepted after breaking the send
> loop.
> >
> >
> >
> > If you wan't to avoir these cases:
> >
> > - close producer in the callback error
> >
> > - close producer with close(0) to prevent sending after previous message
> > send failed
> >
> >
> > Avoid data loss:
> >
> > - block.on.buffer.fill=TRUE
> >
> > - retries=Long.MAX_VALUE
> >
> > - acks=all
> >
> >
> > Avoid reordering:
> >
> > max.in.flight.request.per.connection=1 (be aware about latency)
> >
> >
> > take attention about, if your producer is down, messages in buffer will
> > still be lost ... (perhaps manage a local storage if you are punctilious)
> >
> > moreover at least two replicas are nedded at any time to guarantee data
> > persistence. example replication factor = 3, min.isr = 2 , unclean leader
> > election disabled
> >
> >
> > Also keep in mind that consumer can lose message when offsets are not
> > correctly commited. Disable auto.offset.commit and commit offsets only
> > after make your job for each message (or commit several processed
> messages
> > at one time and kept in a local memory buffer)
> >
> >
> > I hope, these previous suggestions help you 😊
> >
> >
> > Best regards,
> >
> > Adrien
> >
> > ________________________________
> > De : Wim Van Leuven <wim.vanleu...@highestpoint.biz>
> > Envoyé : jeudi 8 mars 2018 21:35:13
> > À : users@kafka.apache.org
> > Objet : Delayed processing
> >
> > Hello,
> >
> > I'm wondering how to design a KStreams or regular Kafka application that
> > can hold of processing of messages until a future time.
> >
> > This related to EU's data protection regulation: we can store raw
> messages
> > for a given time; afterwards we have to store the anonymised message.
> So, I
> > was thinking about branching the stream, anonymise the messages into a
> > waiting topic and than continue from there until the retention time
> passes.
> >
> > But that approach has some caveats:
> >
> >    - This is not an exact solution as order of events is not guaranteed:
> we
> >    might encounter a message that triggers the stop processing while some
> >    events arriving later should normally still pass
> >    - how to stop properly stop processing if we encounter a message that
> >    indicates to not continue?
> >    - ...
> >
> > Are there better know solutions or best practices to delay message
> > processing with Kafka streams / consumers+producers?
> >
> > Thanks for any insights/help here!
> > -wim
> >
>

Reply via email to