Hi, Chen,
The at-least-once semantics is always guaranteed by committing the offsets
at the last step in the commit. Hence, flushing to local disk, changelog
and output topics always need to succeed before the offsets are committed
to checkpoint. If anything fails in-between, the offset will not b
Thanks Yi.
Another follow up question. If 'checkpoint' and 'changelog' is managed
separately (according to your explanation above), how do we support at
least once processing?
For example, what if the task commits offsets to checkpoint topic but the
producer doesn't send all data in its buffer to
Hi, Chen,
So, is your goal to improve the throughput to the changelog topic or reduce
the size of the changelog topic? If you are targeting for later and your
KV-store truly is of the size of the input log, I don't see how it is
possible. In a lot of use cases, users will only need to retain the
*
Thanks Yan.
Very good explanation on 1).
For 2), I understand that users can tune the size of the batch for Kafka
producer. However, that doesn't change the number of messages sent to the
changelog topic. In our case, we process a high volume log (1.5MM
records/second) will update kv store for e
Hi Chen Song,
There are two different concepts: *checkpoint* and *changelog*. Checkpoint
is for the offset of the messages, while the changelog is for the kv-store.
The code snippet you show is for the checkpoint , not for the changelog.
{quote}
1. When implementing our Samza task, does each call
We are trying to understand the order of commits when processing each
message in a Samza job.
T1: input offset commit
T2: changelog commit
T3: output commit
By looking at the code snippet in
https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/TaskInsta