Timestamped-based synchronized reads from multiple input topics?

Adam Bellemare Tue, 02 Jul 2019 18:28:36 -0700

Hi All

The use-case is pretty simple. Lets say we have a history of events with
the following:
key=userId, value = (timestamp, productId)


and we want to remap it to:
key=productId, value=(original_timestamp, userId)

Now, say I have 30 days of backlog, and 2 input topics with the original
events. I spin up 2 instances and let them process the data from the
beginning of time, but one instance is only half as powerful (less CPU,
Mem, etc), such that instance 0 processes X events / sec which instance 1
processes x/2 events /sec.

My question is: How do I allow these two consumer instances to remain in
sync *according to their timestamps* (not offsets) as they consume from
these topics? I don't want to see events with original_timestamps out of
order by more than, say, 60 seconds.  I am looking for any existing tech
that would effectively say *"cap the time difference of events coming out
of each processor at 60 seconds max".* If one processor is too far ahead of
the other, stop and wait for it to catch up.

To me it seems like the shared consumer may *implicitly* offer a close
option, but I am concerned that one input topic would be consumed and
processed much too far ahead of the other input topic, such that the
arbitrary 60s window is not observed. I have been digging through the code
and from my understanding of it it seems that there is no guarantee in
offset consumption between topics, and that it's on a per subscription
basis.

Anyways, if anyone has any information they could lend to helping me solve
this sort of qualitative issue I would be very grateful. I'm coming from
the Kafka world so I apologize if my terminology isn't quite on point.

Thanks
Adam

Timestamped-based synchronized reads from multiple input topics?

Reply via email to