zhougit86 commented on PR #21080:
URL: https://github.com/apache/flink/pull/21080#issuecomment-1288691163

   > Could you elaborate more on the motivation behind this? I'm not sure how 
useful is the idle information provided by this PR. From one hand, if there is 
some data waiting to be sent, and it is not being sent, that's clearly visible 
via a number of metrics (backpressured status, number of bytes sent, queues 
lengths etc). So this is a bit redundant. On the other hand, there can be many 
different reasons behind this timeout being triggered, like for example:
   > 
   > * idling operator not producing any data
   > * operator aggregating for a longer period of time (window)
   > * filtering out all of the records
   > * operator busy doing some very heavy work for a long period
   > * sorted shuffle service
   > * some unhealthy JVM/TM state (long GC pauses, memory swapping, long 
blocking IO)
   > 
   > All of the above would produce a false warning that would be misleading.
   
   Hi Master:
   
   I have updated this PR, I send heartbeat periodically in the netty client 
side. Which can avoid misalarm when situation you listed above happen. And 
trust me, in our k8s environment, we can discover the 
WriteAndFlushNextMessageIfPossibleListener success for a lot of times and the 
netty server idle handler still give the idle alarm.
   
   Please help review, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to