Hi all,

I have a question how to do a correct caching in KTable-like structure on application startup. I'm not sure if this belongs to user or dev maillist, so sorry if I've chosen the bad one. What is my observation so far:

- if I don't send any data to a kafka partition for a period longer then the data retention interval, then all data from the partition is wiped out

- the index file is not cleared (which is obvious, it has to keep track of the next offset to assign to a new message)

In my scenario on startup, I want to read all data from a topic (or a subset of its partitions), wait until all the old data has been cached and then start processing of a different stream (basically I'm doing a join of KStream and KTable, but I have implemented it manually due to some special behavior). Now, what is the issue here - when the specific partition doesn't get any message within the retention period, then I end up stuck trying to prefetch data to the "KTable" - this is because I get the offset of the last message (plus 1) from the broker, but I don't get any data ever (until I send a message to the partition). The problem I see here is that kafka tells me what the last offset in a partition is, but there is no upper bound on when a first message will arrive, even though I reset the offset and start reading from the beginning of a partition. My question is, is it a possibility not to clear the whole partition, but to always keep at least the last message? That way, the client would always get at least the last message, can therefore figure out it is at the end of the partition (reading the old data) and start processing. I believe that KTable implementation could have a very similar issue. Or is there any other way around? I could add a timeout, but this seems a little fragile.

Thanks in advance for any suggestions and opinions,

 Jan

Reply via email to