zhijiangW commented on issue #11687: [FLINK-16536][network][checkpointing] Implement InputChannel state recovery for unaligned checkpoint URL: https://github.com/apache/flink/pull/11687#issuecomment-611465439 Some thoughts to share during implementation. 1. Which kind of thread should read the channel state? We can not use the task thread, because it will process and consume the filled buffers to recycle them to load more channel states. In other words, the task thread and state loading thread should work in parallel. If we introduce another thread pool for reading state, it has more troubles. First we need to maintain the lifecycle of it. Second it will bring more threads overhead in a TaskExecutor container. Third it is hard to determine how many threads are suitable based on demands. One thread might not be enough for serving many input channels at the same time. Fourth it bring the thread model more complex besides with both task and netty threads. Therefore, the existing netty threads are the most ideally candidate to solve all the above concerns. We do not need to worry about the netty thread is involving in IO operations, because it already exists the similar case on upstream side to read blocking partitions by netty thread. And we can also make use of the default load balance among input channels with netty threads. 2. The current implementation will delay partition request until finishing channel state recovery. It is mostly easy to avoid data order issues. But in future if we want to perform new checkpoints during recovery, the partition request should also be done meanwhile, and it would bring some difficulties. One option is we give 0 credits during partition request to avoid receiving any data during recovery, then we need to adjust the logic when to announce the real credit again after recovery. But another trouble is that we can also receive `EndOfPartitionEvent` besides `CheckpointBarrier` without credits. So how to store this event in buffer queue to avoid mis-order issue. Another potential issue is how to switch between reading channel state and receiving network data in netty thread if we want to support checkpoint during recovery. So I take the simple way in MVP now, to further think through how to support checkpoint during recovery future.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services