tarting the JobManager JVM does successfully recover the Job, but I'd
>> like to avoid having to do that if possible.
>>
>> Caused by: java.net.UnknownHostException: <>.com: Temporary
>> failure in name resolution
>> at java.net.Inet4AddressImpl.lookupAllH
t install a SecurityManager and therefore the
> JVM should only cache invalid name requests for 10 seconds.
>
> Restarting the JobManager JVM does successfully recover the Job, but I'd
> like to avoid having to do that if possible.
>
> Caused by
cessfully recover the Job, but I'd
like to avoid having to do that if possible.
Caused by: java.net.UnknownHostException: <****>.com: Temporary failure
in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.l
Hi,
The issue might be related to garbage collection pauses during which the TM
JVM cannot communicate with the JM.
The metrics contain a stats for memory consumpion [1] and GC activity [2]
that can help to diagnose the problem.
Best, Fabian
[1]
https://ci.apache.org/projects/flink/flink-docs-re
HI ,
i checked the code again the figure out where the problem can be
i just wondered if im implementing the Evictor correctly ?
full code
https://gist.github.com/miko-code/6d7010505c3cb95be122364b29057237
public static class EsbTraceEvictor implements Evictor {
org.slf4j.Logger LOG =
Hi Timo, we do have similar issue, TM got killed by a job. Is there a way
to monitor JVM status? If through the monitor metrics, what metric I should
look after?
We are running Flink on K8S. Is there a possibility that a job consumes too
much network bandwidth, so JM and TM can not connect?
On Tue
Hi Miki,
for me this sounds like your job has a resource leak such that your
memory fills up and the JVM of the TaskManager is killed at some point.
How does your job look like? I see a WindowedStream.apply which might
not be appropriate if you have big/frequent windows where the evaluation
h
rting them again worked without a flaw. My bet is on
something Flink-external because of the "Temporary failure in name
resolution" error message.
Maybe @Patrick (cc'd) has encountered this before and knows more.
Nico
[1]
https://ci.apache.org/projects/flink/flink-docs-r
visor
> - Association with remote system
> [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed,
> address is now gated for [5000] ms. Reason: [Association failed with
> [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by:
> [flink