kafka-streams folder work?

Peter Levart Fri, 28 Dec 2018 07:24:14 -0800

Thanks Matthias for clarifying the change log mechanics. So compactingthe change log is not some kind of "compressing", but actually gettingrid of information that is not needed for reconstruction of local store?Great. And kafka brokers are already knowledgeable enough to make thishappen? How does this work? Is the scheme using message keys to decidewhat to keep (only last message per key is kept and others are discarded)?


Regards, Peter


On 12/28/18 4:13 PM, Matthias J. Sax wrote:

Is it really necessary to keep the whole log topic of a particular local
store? Such log will grow indefinitely. Replay of such log will take
more and more time. Do kafka streams write just changelog to such topic
or do they eventually write a snapshot of the current store too?

Yes, it is necessary to keep the log, however, it won't grow
indefinitely because of log compaction. Also, replying the log is bound
by the number of keys in the log if compacted. Because Kafka Streams
uses change-logging as fault-tolerant mechanism, it's not required to
snapshot the local stores additionally --- basically, the compacted
changelog is the same thing as a snapshot.

Does this make sense?

If I configure 'num.standby.replicas' with number > 0, will such
replicas keep its own synchronized local store on disk? In that case
loosing an active stream processor will fallback to standby processor
and the log will only be replayed from the offset that has not been
applied yet to the local store of the stand-by processor,

Yes, that is the idea of standby replicas.

meaning that
we don't need to keep the whole log. But we need to be sure not to loose
all local stores of all stand-by and active processor(s) then.

No. Standby replicas are independent of the changelog we need to keep
(cf. above).

In case a
stand-by processor that has been chosen to be promoted to active role
does not have local store any more, the whole log will be needed again,
right?

Yes, the whole changelog will be replayed -- note again, that the size
of the changelog is proportional to the size of the state due to log
compaction.

Suppose that the store that is needed is a WindowStore which only keeps
data for a limited number of past windows. Would the "useable" part of
such store be possible to reconstruct from the limited number of past
log records so that full log would not be necessary?

It works the same as for non-windowed stores. The changelog keeps the
latest update for each window. Older windows that pass the retention
time are not maintained any longer and are deleted. However, the
compacted changelog topic keeps the latest update per window and thus,
the full state can be recreated without any data loss.


-Matthias

On 12/28/18 1:42 PM, Peter Levart wrote:

Hi Matthias,

Just a couple of questions about that...

On 12/27/18 9:57 PM, Matthias J. Sax wrote:

All data is backed in the Kafka cluster. Data that is stored locally, is
basically a cache, and Kafka Streams will recreate the local data if you
loose it.

Thus, I am not sure how the KTable data could be stale. One possibility
might be a miss-configuration: I assume that you read the topic directly
as a table (ie, builder.table("topic")). If you do this, the used input
topic must be configured with log compaction --- if it is configured
with retention, you might loose data from the input topic and if you
also loose the local cache, Kafka Streams cannot recreate the local
state because it was deleted from the topic (log compaction will guard
the input topic from data loss).

Is it really necessary to keep the whole log topic of a particular local
store? Such log will grow indefinitely. Replay of such log will take
more and more time. Do kafka streams write just changelog to such topic
or do they eventually write a snapshot of the current store too?

If I configure 'num.standby.replicas' with number > 0, will such
replicas keep its own synchronized local store on disk? In that case
loosing an active stream processor will fallback to standby processor
and the log will only be replayed from the offset that has not been
applied yet to the local store of the stand-by processor, meaning that
we don't need to keep the whole log. But we need to be sure not to loose
all local stores of all stand-by and active processor(s) then. In case a
stand-by processor that has been chosen to be promoted to active role
does not have local store any more, the whole log will be needed again,
right?

Suppose that the store that is needed is a WindowStore which only keeps
data for a limited number of past windows. Would the "useable" part of
such store be possible to reconstruct from the limited number of past
log records so that full log would not be necessary?

Regards, Peter


-Matthias


On 12/24/18 12:22 PM, Edmondo Porcu wrote:

Hello Kafka users,

we are running a Kafka Streams as a fully stateless application, meaning
that we are not persisting /tmp/kafka-streams on a durable volume but we
are rather losing it at each restart. This application is performing a
KTable-KTable join of data coming from Kafka Connect, and sometimes
we want
to force the output to tick so we update records in the right table from
the database, but we see that the left table is "stale".

Is it possible that because of reboots, the application loses some
messages
? How is the state reconstructed when /tmp/kafka-streams is not
available?
Is the state saved in an intermediate topic?

Thanks,
Edmondo

Re: How does the /tmp/kafka-streams folder work?

Reply via email to