Thanks for the clarification Jose, that clears my confusions already :)
Guozhang On Thu, Oct 1, 2020 at 10:51 AM Jose Garcia Sancio <jsan...@confluent.io> wrote: > Thanks for the email Guozhang. > > > Thanks for the replies and the KIP updates. Just want to clarify one more > > thing regarding my previous comment 3): I understand that when a snapshot > > has completed loading, then we can use it in our handling logic of vote > > request. And I understand that: > > > > 1) Before a snapshot has been completely received (e.g. if we've only > > received a subset of the "chunks"), then we just handle vote requests "as > > like" there's no snapshot yet. > > > > 2) After a snapshot has been completely received and loaded into main > > memory, we can handle vote requests "as of" the received snapshot. > > > > What I'm wondering if, in between of these two synchronization barriers, > > after all the snapshot chunks have been received but before it has been > > completely parsed and loaded into the memory's metadata cache, if we > > received a request (note they may be handled by different threads, hence > > concurrently), what should we do? Or are you proposing that the > > fetchSnapshot request would also be handled in that single-threaded raft > > client loop so it is in order with all other requests, if that's the case > > then we do not have any concurrency issues to worry, but then on the > other > > hand the reception of the last snapshot chunk and loading them to main > > memory may also take long time during which a client may not be able to > > handle any other requests. > > Yes. The FetchSnapshot request and response handling will be performed > by the KafkaRaftClient in a single threaded fashion. The > KafkaRaftClient doesn't need to load the snapshot to know what state > it is in. It only needs to scan the "checkpoints" folder, load the > quorum state file and know the LEO of the replicated log. I would > modify 2) above to the following: > > 3) After the snapshot has been validated by > a) Fetching all of the chunks > b) Verifying the CRC of the records in the snapshot > c) Atomically moving the temporary snapshot to the permanent location > > After 3.c), the KafkaRaftClient only needs to scan and parse the > filenames in the directory called "checkpoints" to find the > largest/latest permanent snapshot. > > As you point out in 1) before 3.c) the KafkaRaftClient, in regards to > leader election, will behave as if the temporary snapshot didn't > exists. > > The loading of the snapshot will be done by the state machine (Kafka > Controller or Metadata Cache) and it can perform this on a different > thread. The KafkaRaftClient will provide an API for finding and > reading the latest valid snapshot stored locally. > > Are you also concerned that the snapshot could have been corrupted after > 3.c? > > I also updated the "Changes to leader Election" section to make this a > bit clearer. > > Thanks, > Jose > -- -- Guozhang