zhougit86 commented on PR #21080: URL: https://github.com/apache/flink/pull/21080#issuecomment-1288691163
> Could you elaborate more on the motivation behind this? I'm not sure how useful is the idle information provided by this PR. From one hand, if there is some data waiting to be sent, and it is not being sent, that's clearly visible via a number of metrics (backpressured status, number of bytes sent, queues lengths etc). So this is a bit redundant. On the other hand, there can be many different reasons behind this timeout being triggered, like for example: > > * idling operator not producing any data > * operator aggregating for a longer period of time (window) > * filtering out all of the records > * operator busy doing some very heavy work for a long period > * sorted shuffle service > * some unhealthy JVM/TM state (long GC pauses, memory swapping, long blocking IO) > > All of the above would produce a false warning that would be misleading. Hi Master: I have updated this PR, I send heartbeat periodically in the netty client side. Which can avoid misalarm when situation you listed above happen. And trust me, in our k8s environment, we can discover the WriteAndFlushNextMessageIfPossibleListener success for a lot of times and the netty server idle handler still give the idle alarm. Please help review, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org