Cai Liuyang created FLINK-26080:
-----------------------------------

             Summary: PartitionRequest client use Netty's IdleStateHandler to 
monitor channel's status
                 Key: FLINK-26080
                 URL: https://issues.apache.org/jira/browse/FLINK-26080
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Network
            Reporter: Cai Liuyang


In out production environment, we encounter one abnormal case:
    upstreamTask is backpressured but its donwStreamTask is idle, job will keep 
this status until chk is timeout(use aligned chk); After we analyse this case, 
we found the reason:  (Machine's kernel we used may have bug that will lost 
socket event )
    1. NettyServer encounter ReadTimeoutException when read data from channel, 
then it will release the NetworkSequenceViewReader (which is responsable to 
send data to PartitionRequestClient) and write ErrorResponse to 
PartitionRequestClient;
    2. PartitionRequestClient doesn't receive the ErrorResponse (maybe our 
machine's kernel-bug lead to this)
    3. NettyServer after write ErrorResponse, it will close the channel (socket 
will be transformed to fin_wait1 status), but client machine doesn't receive 
the Server's fin, so it will treat the channel is ok, and will keep waiting for 
server's BufferReponse (But server is already release correlative 
NetworkSequenceViewReader)

    4. Server machine will release the socket if it keep fin_wait1 status for 
two long time, but the socket on client machine is also under established 
status.

To avoid this case,I think there are two methods:
    1. Client enable TCP keep alive(flink is already enabled): this way should 
also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time 
is 7200 seconds, which is two long).
    2. Client use netty‘s IdleStateHandler to detect whether channel is 
idle(read or write), if channel is idle, client will try to write pingMsg to 
server to detect whether channel is really ok.
For the two methods, i recommend the method-2, because adjustment of machine's 
tcp-keep-alive time will have an impact on other service running on the same 
machine

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to