Hi all,
I have a question how to do a correct caching in KTable-like structure
on application startup. I'm not sure if this belongs to user or dev
maillist, so sorry if I've chosen the bad one. What is my observation so
far:
- if I don't send any data to a kafka partition for a period longer
then the data retention interval, then all data from the partition is
wiped out
- the index file is not cleared (which is obvious, it has to keep
track of the next offset to assign to a new message)
In my scenario on startup, I want to read all data from a topic (or a
subset of its partitions), wait until all the old data has been cached
and then start processing of a different stream (basically I'm doing a
join of KStream and KTable, but I have implemented it manually due to
some special behavior). Now, what is the issue here - when the specific
partition doesn't get any message within the retention period, then I
end up stuck trying to prefetch data to the "KTable" - this is because I
get the offset of the last message (plus 1) from the broker, but I don't
get any data ever (until I send a message to the partition). The problem
I see here is that kafka tells me what the last offset in a partition
is, but there is no upper bound on when a first message will arrive,
even though I reset the offset and start reading from the beginning of a
partition. My question is, is it a possibility not to clear the whole
partition, but to always keep at least the last message? That way, the
client would always get at least the last message, can therefore figure
out it is at the end of the partition (reading the old data) and start
processing. I believe that KTable implementation could have a very
similar issue. Or is there any other way around? I could add a timeout,
but this seems a little fragile.
Thanks in advance for any suggestions and opinions,
Jan