subject:"remote task manager netty exception"

Re: remote task manager netty exception

2021-05-05 Thread Roman Khachatryan

Hi, > Could it be somehow partition info isn't up to date on TM when job is > restarting? Partition info should be up to date or become so eventually - but this is assuming that JM is able to detect the failure. The latter may not be the case, as Sihan You wrote previously: > The strange thing i

Re: remote task manager netty exception

2021-05-04 Thread Yichen Liu

Chime in here since I work with Sihan. Roman, there isn't much logs beyond this WARN, in fact it should be ERROR since it fail our job and job has to restart. Here is a fresh new example of "Sending the partition request to 'null' failed." exception. The only log we see before exception was: tim

Re: remote task manager netty exception

2021-05-03 Thread Roman Khachatryan

Hi, I see that JM and TM failures are different (from TM, it's actually a warning). Could you please share the ERROR message from TM? Have you tried increasing taskmanager.network.retries [1]? [1] https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html#taskmanager-network-r

remote task manager netty exception

2021-04-30 Thread Sihan You

Hi, We are experiencing some netty issue with our Flink cluster, which we couldn't figure the cause. Below is the stack trace of exceptions from TM's and JM's perspectives. we have 85 TMs and one JM in HA mode. The strange thing is that only 23 of the TM are complaining about the connection issu