Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

Bruno Cadonna Mon, 06 May 2024 06:07:00 -0700

Hi Matthias,

I see what you mean.


To sum up:

With this KIP the .checkpoint file is written when the store closes.That is when:

1. a task moves away from Kafka Streams client
2. Kafka Streams client shuts down

A Kafka Streams client needs the information in the .checkpoint file
1. on startup because it does not have any open stores yet.

2. during rebalances for non-empty state directories of tasks that arenot assigned to the Kafka Streams client.

With hard crashes, i.e., when the Streams client is not able to closeits state stores and write the .checkpoint file, the .checkpoint filemight be quite stale. That influences the next rebalance after failovernegatively.

My conclusion is that Kafka Streams either needs to open the statestores at start up or we write the checkpoint file more often.

Writing the .checkpoint file during processing more often withoutcontrolling the flush to disk would work. However, Kafka Streams wouldcheckpoint offsets that are not yet persisted on disk by the statestore. That is with a hard crash the offsets in the .checkpoint filemight be larger than the offsets checkpointed in the state store. Thatmight not be a problem if Kafka Streams uses the .checkpoint file onlyto compute the task lag. The downside is that it makes the managing ofcheckpoints more complex because now we have to maintain twocheckpoints: one for restoration and one for computing the task lag.I think we should explore the option where Kafka Streams opens the statestores at start up to get the offsets.

I also checked when Kafka Streams needs the checkpointed offsets tocompute the task lag during a rebalance. Turns out Kafka Streams needsthem before sending the join request. Now, I am wondering if opening thestate stores of unassigned tasks whose state directory exists locally isactually such a big issue due to the expected higher latency since ithappens actually before the Kafka Streams client joins the rebalance.


Best,
Bruno







On 5/4/24 12:05 AM, Matthias J. Sax wrote:

That's good questions... I could think of a few approaches, but I admitit might all be a little bit tricky to code up...
However if we don't solve this problem, I think this KIP does not reallysolve the core issue we are facing? In the end, if we rely on the`.checkpoint` file to compute a task assignment, but the `.checkpoint`file can be arbitrary stale after a crash because we only write it on aclean close, there would be still a huge gap that this KIP does not close?
For the case in which we keep the checkpoint file, this KIP would stillhelp for "soft errors" in which KS can recover, and roll back the store.A significant win for sure. -- But hard crashes would still be anproblem? We might assign tasks to "wrong" instance, ie, which are notmost up to date, as the checkpoint information could be very outdated?Would we end up with a half-baked solution? Would this be good enough tojustify the introduced complexity? In the, for soft failures it's stilla win. Just want to make sure we understand the limitations and make aneducated decision.
Or do I miss something?


-Matthias

On 5/3/24 10:20 AM, Bruno Cadonna wrote:
Hi Matthias,


200:
I like the idea in general. However, it is not clear to me how thebehavior should be with multiple stream threads in the same KafkaStreams client. What stream thread opens which store? How can a streamthread pass an open store to another stream thread that got thecorresponding task assigned? How does a stream thread know that a taskwas not assigned to any of the stream threads of the Kafka Streamsclient? I have the feeling we should just keep the .checkpoint file onclose for now to unblock this KIP and try to find a solution to gettotally rid of it later.
Best,
Bruno



On 5/3/24 6:29 PM, Matthias J. Sax wrote:
101: Yes, but what I am saying is, that we don't need to flush the.position file to disk periodically, but only maintain it in mainmemory, and only write it to disk on close() to preserve it acrossrestarts. This way, it would never be ahead, but might only lag? Butwith my better understanding about (102) it might be mood anyway...
102: Thanks for clarifying. Looked into the code now. Makes sense.Might be something to be worth calling out explicitly in the KIPwriteup. -- Now that I realize that the position is tracked insidethe store (not outside as the changelog offsets) it makes much moresense to pull position into RocksDB itself. In the end, it's actuallya "store implementation" detail how it tracks the position (and kindaleaky abstraction currently, that we re-use the checkpoint filemechanism to track it and flush to disk).
200: I was thinking about this a little bit more, and maybe it's nottoo bad? When KS starts up, we could upon all stores we find on localdisk pro-actively, and keep them all open until the first rebalancefinishes: For tasks we get assigned, we hand in the already openedstore (this would amortize the cost to open the store before therebalance) and for non-assigned tasks, we know the offset informationwon't change and we could just cache it in-memory for later reuse(ie, next rebalance) and close the store to free up resources? --Assuming that we would get a large percentage of opened storesassigned as tasks anyway, this could work?
-Matthias

On 5/3/24 1:29 AM, Bruno Cadonna wrote:
Hi Matthias,


101:
Let's assume a RocksDB store, but I think the following might betrue also for other store implementations. With this KIP, if KafkaStreams commits the offsets, the committed offsets will be stored inan in-memory data structure (i.e. the memtable) and stay there untilRocksDB decides that it is time to persist its in-memory datastructure. If Kafka Streams writes its position to the .positionfile during a commit and a crash happens before RocksDB persist thememtable then the position in the .position file is ahead of thepersisted offset. If IQ is done between the crash and the statestore fully restored the changelog, the position might tell IQ thatthe state store is more up-to-date than it actually is.In contrast, if Kafka Streams handles persisting positions the sameas persisting offset, the position should always be consistent withthe offset, because they are persisted together.
102:
I am confused about your confusion which tells me that we aretalking about two different things.
You asked
"Do you intent to add this information [i.e. position] to the mappassed via commit(final Map<TopicPartition, Long> changelogOffsets)?"
and with what I wrote I meant that we do not need to pass theposition into the implementation of the StateStore interface sincethe position is updated within the implementation of the StateStoreinterface (e.g. RocksDBStore [1]). My statement describes thebehavior now, not the change proposed in this KIP, so it does notcontradict what is stated in the KIP.
200:
This is about Matthias' main concern about rebalance metadata.
As far as I understand the KIP, Kafka Streams will only use the.checkpoint files to compute the task lag for unassigned tasks whosestate is locally available. For assigned tasks, it will use theoffsets managed by the open state store.
Best,
Bruno
[1]https://github.com/apache/kafka/blob/fcbfd3412eb746a0c81374eb55ad0f73de6b1e71/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java#L397
On 5/1/24 3:00 AM, Matthias J. Sax wrote:
Thanks Bruno.
101: I think I understand this better now. But just want to makesure I do. What do you mean by "they can diverge" and "Recoveringafter a failure might load inconsistent offsets and positions."
The checkpoint is the offset from the changelog, while the positionis the offset from the upstream source topic, right? -- In the end,the position is about IQ, and if we fail to update it, it onlymeans that there is some gap when we might not be able to query astandby task, because we think it's not up-to-date enough even ifit is, which would resolve itself soon? Ie, the position might"lag", but it's not "inconsistent". Do we believe that this lagwould be highly problematic?
102: I am confused.
The position is maintained inside the state store, but ispersisted in the .position file when the state store closes.
This contradicts the KIP:
these position offsets will be stored in RocksDB, in the samecolumn family as the changelog offsets, instead of the .position file
My main concern is currently about rebalance metadata -- openingRocksDB stores seems to be very expensive, but if we follow the KIP:
We will do this under EOS by updating the .checkpoint filewhenever a store is close()d.
It seems, having the offset inside RocksDB does not help us at all?In the end, when we crash, we don't want to lose the state, butwhen we update the .checkpoint only on a clean close, the.checkpoint might be stale (ie, still contains the checkpoint whenwe opened the store when we got a task assigned).
-Matthias

On 4/30/24 2:40 AM, Bruno Cadonna wrote:
Hi all,

100
I think we already have such a wrapper. It is calledAbstractReadWriteDecorator.
101
Currently, the position is checkpointed when a offset checkpointis written. If we let the state store manage the committedoffsets, we need to also let the state store also manage theposition otherwise they might diverge. State store managed offsetscan get flushed (i.e. checkpointed) to the disk when the statestore decides to flush its in-memory data structures, but theposition is only checkpointed at commit time. Recovering after afailure might load inconsistent offsets and positions.
102
The position is maintained inside the state store, but ispersisted in the .position file when the state store closes. Theonly public interface that uses the position is IQv2 in aread-only mode. So the position is only updated within the statestore and read from IQv2. No need to add anything to the publicStateStore interface.
103
Deprecating managesOffsets() right away might be a good idea.


104
I agree that we should try to support downgrades without wipes. Atleast Nick should state in the KIP why we do not support it.
Best,
Bruno




On 4/23/24 8:13 AM, Matthias J. Sax wrote:
Thanks for splitting out this KIP. The discussion shows, that itis a complex beast by itself, so worth to discuss by its own.
Couple of question / comment:
100 `StateStore#commit()`: The JavaDoc says "must not be calledby users" -- I would propose to put a guard in place for this, byeither throwing an exception (preferable) or adding a no-opimplementation (at least for our own stores, by wrapping them --we cannot enforce it for custom stores I assume), and documentthis contract explicitly.
101 adding `.position` to the store: Why do we actually needthis? The KIP says "To ensure consistency with the committed dataand changelog offsets" but I am not sure if I can follow? Can youelaborate why leaving the `.position` file as-is won't work?
If it's possible at all, it will need to be done by
creating temporary StateManagers and StateStores duringrebalance. I thinkit is possible, and probably not too expensive, but the devilwill be in
the detail.
This sounds like a significant overhead to me. We know thatopening a single RocksDB takes about 500ms, and thus openingRocksDB to get this information might slow down rebalancessignificantly.
102: It's unclear to me, how `.position` information is added.The KIP only says: "position offsets will be stored in RocksDB,in the same column family as the changelog offsets". Do youintent to add this information to the map passed via`commit(final Map<TopicPartition, Long> changelogOffsets)`? TheKIP should describe this in more detail. Also, if my assumptionis correct, we might want to rename the parameter and also have abetter JavaDoc description?
103: Should we make it mandatory (long-term) that all stores(including custom stores) manage their offsets internally?Maintaining both options and thus both code paths puts a burdenon everyone and make the code messy. I would strongly prefer ifwe could have mid-term path to get rid of supporting both. --For this case, we should deprecate the newly added`managesOffsets()` method right away, to point out that we intendto remove it. If it's mandatory to maintain offsets for stores,we won't need this method any longer. In memory stores can justreturn null from #committedOffset().
104 "downgrading": I think it might be worth to add support fordowngrading w/o the need to wipe stores? Leveraging`upgrade.from` parameter, we could build a two rolling bouncedowngrade: (1) the new code is started with `upgrade.from` set toa lower version, telling the runtime to do the cleanup on`close()` -- (ie, ensure that all data is written into`.checkpoint` and `.position` file, and the newly added CL isdeleted). In a second, rolling bounce, the old code would be ableto open RocksDB. -- I understand that this implies much morework, but downgrade seems to be common enough, that it might beworth it? Even if we did not always support this in the past, wehave the face the fact that KS is getting more and more adoptedand as a more mature product should support this?
-Matthias







On 4/21/24 11:58 PM, Bruno Cadonna wrote:
Hi all,

How should we proceed here?

1. with the plain .checkpoint file
2. with a way to use the state store interface on unassigned butlocally existing task state
While I like option 2, I think option 1 is less risky and willgive us the benefits of transactional state stores sooner. Weshould consider the interface approach afterwards, though.
Best,
Bruno



On 4/17/24 3:15 PM, Bruno Cadonna wrote:
Hi Nick and Sophie,
I think the task ID is not enough to create a state store thatcan read the offsets of non-assigned tasks for lag computationduring rebalancing. The state store also needs the statedirectory so that it knows where to find the information thatit needs to return from changelogOffsets().
In general, I think we should proceed with the plain.checkpoint file for now and iterate back to the state storesolution later since it seems it is not that straightforward.Alternatively, Nick could timebox an effort to betterunderstand what would be needed for the state store solution.Nick, let us know your decision.
Regarding your question about the state store instance. I amnot too familiar with that part of the code, but I think thestate store is build when the processor topology is build andthe processor topology is build per stream task. So there isone instance of processor topology and state store per streamtask. Try to follow the call in [1].
Best,
Bruno
[1]https://github.com/apache/kafka/blob/f52575b17225828d2ff11996030ab7304667deab/streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java#L153
On 4/16/24 8:59 PM, Nick Telford wrote:
That does make sense. The one thing I can't figure out is howper-Task
StateStore instances are constructed.
It looks like we construct one StateStore instance for thewhole Topology(in InternalTopologyBuilder), and pass that intoProcessorStateManager (via
StateManagerUtil) for each Task, which then initializes it.
This can't be the case though, otherwise multiple partitionsof the samesub-topology (aka Tasks) would share the same StateStoreinstance, which
they don't.

What am I missing?
On Tue, 16 Apr 2024 at 16:22, Sophie Blee-Goldman<[email protected]>
wrote:
I don't think we need to *require* a constructor accept theTaskId, but wewould definitely make sure that the RocksDB state storechanges itsconstructor to one that accepts the TaskID (which we can dowithoutdeprecation since its an internal API), and custom statestores can justdecide for themselves whether they want to opt-in/use theTaskId paramor not. I mean custom state stores would have to opt-inanyways by
implementing the new StoreSupplier#get(TaskId) API and the only
reason to do that would be to have created a constructor thataccepts
a TaskId
Just to be super clear about the proposal, this is what I hadin mind.It's actually fairly simple and wouldn't add much to thescope of theKIP (I think -- if it turns out to be more complicated thanI'm assuming,we should definitely do whatever has the smallest LOE to getthis done
Anyways, the (only) public API changes would be to add this new
method to the StoreSupplier API:

default T get(final TaskId taskId) {
     return get();
}
We can decide whether or not to deprecate the old #get butit's notreally necessary and might cause a lot of turmoil, so I'dpersonally
say we just leave both APIs in place.
And that's it for public API changes! Internally, we wouldjust adapt
each of the rocksdb StoreSupplier classes to implement this new
API. So for example with the RocksDBKeyValueBytesStoreSupplier,
we just add

@Override
public KeyValueStore<Bytes, byte[]> get(final TaskId taskId) {
     return returnTimestampedStore ?
new RocksDBTimestampedStore(name, metricsScope(),taskId) :
         new RocksDBStore(name, metricsScope(), taskId);
}

And of course add the TaskId parameter to each of the actual
state store constructors returned here.
Does that make sense? It's entirely possible I'm missingsomethingimportant here, but I think this would be a pretty smalladdition that
would solve the problem you mentioned earlier while also being
useful to anyone who uses custom state stores.
On Mon, Apr 15, 2024 at 10:21 AM Nick Telford<[email protected]>
wrote:
Hi Sophie,
Interesting idea! Although what would that mean for theStateStoreinterface? Obviously we can't require that the constructortake the
TaskId.
Is it enough to add the parameter to the StoreSupplier?
Would doing this be in-scope for this KIP, or are weover-complicating
it?
Nick
On Fri, 12 Apr 2024 at 21:30, Sophie Blee-Goldman<[email protected]
wrote:
Somewhat minor point overall, but it actually drives mecrazy that youcan't get access to the taskId of a StateStore until #initis called.
This
has caused me a huge headache personally (since the same istrue forprocessors and I was trying to do something that's probablytoo hacky
to
actually complain about here lol)
Can we just change the StateStoreSupplier to receive andpass along thetaskId when creating a new store? Presumably by adding anew version of
the
#get method that takes in a taskId parameter? We can haveit default toinvoking the old one for compatibility reasons and itshould be
completely
safe to tack on.

Would also prefer the same for a ProcessorSupplier, but that's
definitely
outside the scope of this KIP
On Fri, Apr 12, 2024 at 3:31 AM Nick Telford<[email protected]>
wrote:
On further thought, it's clear that this can't work forone simple
reason:
StateStores don't know their associated TaskId (and hence,theirStateDirectory) until the init() call. Therefore,committedOffset()
can't
be called before init(), unless we also added aStateStoreContext
argument
to committedOffset(), which I think might be trying toshoehorn too
much
into committedOffset().
I still don't like the idea of the Streams enginemaintaining the
cache
of
changelog offsets independently of stores, mostly becauseof themaintenance burden of the code duplication, but it lookslike we'll
have
to
live with it.

Unless you have any better ideas?

Regards,
Nick
On Wed, 10 Apr 2024 at 14:12, Nick Telford<[email protected]>
wrote:
Hi Bruno,
Immediately after I sent my response, I looked at thecodebase and
came
to
the same conclusion. If it's possible at all, it willneed to be
done
by
creating temporary StateManagers and StateStores duringrebalance.
I
think
it is possible, and probably not too expensive, but thedevil will
be
in
the detail.
I'll try to find some time to explore the idea to see ifit's
possible
and
report back, because we'll need to determine this beforewe can
vote
on
the
KIP.

Regards,
Nick
On Wed, 10 Apr 2024 at 11:36, Bruno Cadonna<[email protected]>
wrote:
Hi Nick,

Thanks for reacting on my comments so quickly!


2.
Some thoughts on your proposal.
State managers (and state stores) are parts of tasks. Ifthe task
is
not
assigned locally, we do not create those tasks. To getthe offsets
with
your approach, we would need to either create kind ofinactive
tasks
besides active and standby tasks or store and manage state
managers
of
non-assigned tasks differently than the state managersof assignedtasks. Additionally, the cleanup thread that removesunassigned
task
directories needs to concurrently delete those inactivetasks ortask-less state managers of unassigned tasks. This seemsall quite
messy
to me.
Could we create those state managers (or state stores)for locally
existing but unassigned tasks on demand when
TaskManager#getTaskOffsetSums() is executed? Or have adifferent
encapsulation for the unused task directories?


Best,
Bruno



On 4/10/24 11:31 AM, Nick Telford wrote:
Hi Bruno,

Thanks for the review!

1, 4, 5.
Done

3.
You're right. I've removed the offending paragraph. I had
originally
adapted this from the guarantees outlined in KIP-892.But it's
difficult to
provide these guarantees without the KIP-892 transaction
buffers.
Instead,
we'll add the guarantees back into the JavaDoc whenKIP-892
lands.
2.
Good point! This is the only part of the KIP that was
(significantly)
changed when I extracted it from KIP-892. My prototypecurrently
maintains
this "cache" of changelog offsets in .checkpoint, butdoing so
becomes
very
messy. My intent with this change was to try to better
encapsulate
this
offset "caching", especially for StateStores that cancheaply
provide
the
offsets stored directly in them without needing toduplicate
them
in
this
cache.
It's clear some more work is needed here to betterencapsulate
this.
My
immediate thought is: what if we construct *but don't
initialize*
the
StateManager and StateStores for every Task directoryon-disk?
That
should
still be quite cheap to do, and would enable us toquery the
offsets
for
all on-disk stores, even if they're not open. If the
StateManager
(aka.
ProcessorStateManager/GlobalStateManager) proves tooexpensive
to
hold
open
for closed stores, we could always have a"StubStateManager" in
its
place,
that enables the querying of offsets, but nothing else?

IDK, what do you think?

Regards,

Nick
On Tue, 9 Apr 2024 at 15:00, Bruno Cadonna<[email protected]>
wrote:
Hi Nick,

Thanks for breaking out the KIP from KIP-892!

Here a couple of comments/questions:

1.
In Kafka Streams, we have a design guideline whichsays to not
use
the
"get"-prefix for getters on the public API. Could youplease
change
getCommittedOffsets() to committedOffsets()?


2.
It is not clear to me how TaskManager#getTaskOffsetSums()
should
read
offsets of tasks the stream thread does not own butthat have a
state
directory on the Streams client by calling
StateStore#getCommittedOffsets(). If the thread doesnot own a
task
it
does also not create any state stores for the task,which means
there
is
no state store on which to call getCommittedOffsets().
I would have rather expected that a checkpoint file iswritten
for
all
state stores on close -- not only for the RocksDBStore-- and
that
this
checkpoint file is read inTaskManager#getTaskOffsetSums() for
the
tasks
that have a state directory on the client but are notcurrently
assigned
to any stream thread of the Streams client.


3.
In the javadocs for commit() you write

"... all writes since the last commit(Map), or since
init(StateStore)
*MUST* be available to readers, even after a restart."
This is only true for a clean close before therestart, isn't
it?
If the task fails with a dirty close, Kafka Streamscannot
guarantee
that the in-memory structures of the state store (e.g.memtable
in
the
case of RocksDB) are flushed so that the records and the
committed
offsets are persisted.


4.
The wrapper that provides the legacy checkpointingbehavior is
actually
an implementation detail. I would remove it from theKIP, but
still
state that the legacy checkpointing behavior will besupported
when
the
state store does not manage the checkpoints.


5.
Regarding the metrics, could you please add the tags,and the
recording
level (DEBUG or INFO) as done in KIP-607 or KIP-444.


Best,
Bruno

On 4/7/24 5:35 PM, Nick Telford wrote:
Hi everyone,
Based on some offline discussion, I've split out the"Atomic
Checkpointing"
section from KIP-892: Transactional Semantics forStateStores,
into
its
own
KIP

KIP-1035: StateStore managed changelog offsets
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1035%3A+StateStore+managed+changelog+offsets
While KIP-892 was adopted *with* the changes outlined in
KIP-1035,
these
changes were always the most contentious part, andcontinued
to
spur
discussion even after KIP-892 was adopted.
All the changes introduced in KIP-1035 have beenremoved from
KIP-892,
and
a hard dependency on KIP-1035 has been added toKIP-892 in
their
place.
I'm hopeful that with some more focus on this set ofchanges,
we
can
deliver something that we're all happy with.

Regards,
Nick

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

Reply via email to