Hi,
> Could it be somehow partition info isn't up to date on TM when job is
> restarting?
Partition info should be up to date or become so eventually - but this
is assuming that JM is able to detect the failure.
The latter may not be the case, as Sihan You wrote previously:
> The strange thing i
Chime in here since I work with Sihan.
Roman, there isn't much logs beyond this WARN, in fact it should be ERROR
since it fail our job and job has to restart.
Here is a fresh new example of "Sending the partition request to 'null'
failed." exception. The only log we see before exception was:
tim
Hi,
I see that JM and TM failures are different (from TM, it's actually a
warning). Could you please share the ERROR message from TM?
Have you tried increasing taskmanager.network.retries [1]?
[1]
https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html#taskmanager-network-r
Hi,
We are experiencing some netty issue with our Flink cluster, which we
couldn't figure the cause.
Below is the stack trace of exceptions from TM's and JM's perspectives. we
have 85 TMs and one JM in HA mode. The strange thing is that only 23 of the
TM are complaining about the connection issu