zhijiangW commented on issue #11687: [FLINK-16536][network][checkpointing] 
Implement InputChannel state recovery for unaligned checkpoint
URL: https://github.com/apache/flink/pull/11687#issuecomment-611465439
 
 
   Some thoughts to share during implementation. 
   
   1. Which kind of thread should read the channel state? 
   We can not use the task thread, because it will process and consume the 
filled buffers to recycle them to load more channel states. In other words, the 
task thread and state loading thread should work in parallel. 
   
   If we introduce another thread pool  for reading state, it has more 
troubles. First we need to maintain the lifecycle of it. Second it will bring 
more threads overhead in a TaskExecutor container. Third it is hard to 
determine how many threads are suitable based on demands. One thread might not 
be enough for serving many input channels at the same time. Fourth it bring the 
thread model more complex besides with both task and netty threads.
   
   Therefore, the existing netty threads are the most ideally candidate to 
solve all the above concerns. We do not need to worry about the netty thread is 
involving in IO operations, because it already exists the similar case on 
upstream side to read blocking partitions by netty thread. And we can also make 
use of the default load balance among input channels with netty threads.
   
   2. The current implementation will delay partition request until finishing 
channel state recovery. It is mostly easy to avoid data order issues. But in 
future if we want to perform new checkpoints during recovery, the partition 
request should also be done meanwhile, and it would bring some difficulties.
   
   One option is we give 0 credits during partition request to avoid receiving 
any data during recovery, then we need to adjust the logic when to announce the 
real credit again after recovery. But another trouble is that we can also 
receive `EndOfPartitionEvent` besides `CheckpointBarrier` without credits. So 
how to store this event in buffer queue to avoid mis-order issue.
   
   Another potential issue is how to switch between reading channel state and 
receiving network data in netty thread if we want to support checkpoint during 
recovery. So I take the simple way in MVP now, to further think through how to 
support checkpoint during recovery future.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to