Re: java.lang.Exception: TaskManager was lost/killed

Ken Krugler Mon, 09 Apr 2018 14:54:53 -0700

Hi Chesnay,

Don’t know if this helps, but I’d run into this as well, though I haven’t 
hooked up YourKit to analyze exactly what’s causing the memory problem.


E.g. after about 3.5 hours running locally, it failed with memory issues.

In the TaskManager logs, I start seeing exceptions in my code….

java.lang.OutOfMemoryError: GC overhead limit exceeded

And then eventually...

2018-04-07 21:55:25,686 WARN  
org.apache.flink.runtime.accumulators.AccumulatorRegistry     - Failed to 
serialize accumulators for task.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Immediately after this, one of my custom functions gets a close() call, and I 
see a log msg about it "switched from RUNNING to FAILED”.

After this, I see messages that the job is being restarted, but the TaskManager 
log output abruptly ends.

In the Job Manager log, this is what is output following the time of the last 
TaskManager logging output:

2018-04-07 21:57:33,702 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 129 @ 1523163453702
2018-04-07 21:58:43,916 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Disassociated] 
2018-04-07 21:58:51,084 WARN  akka.remote.transport.netty.NettyTransport        
            - Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:58:51,086 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:01,047 WARN  akka.remote.transport.netty.NettyTransport        
            - Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:59:01,050 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:11,057 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:11,058 WARN  akka.remote.transport.netty.NettyTransport        
            - Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:59:21,049 WARN  akka.remote.transport.netty.NettyTransport        
            - Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:59:21,049 WARN  akka.remote.ReliableDeliverySupervisor            
            - Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:21,056 WARN  akka.remote.RemoteWatcher                         
            - Detected unreachable: 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]
2018-04-07 21:59:21,063 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Task manager 
akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780/user/taskmanager terminated.
2018-04-07 21:59:21,064 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
FetchUrlsFunction for sitemap -> ParseSiteMapFunction -> 
OutlinkToStateUrlFunction (1/1) (3e9374d1bf5fdb359e3a624a4d5d659b) switched 
from RUNNING to FAILED.
java.lang.Exception: TaskManager was lost/killed: 
c51d3879b6244828eb9fc78c943007ad @ kens-mbp.hsd1.ca.comcast.net (dataPort=63782)

— Ken



> On Apr 9, 2018, at 12:48 PM, Chesnay Schepler <ches...@apache.org> wrote:
> 
> We will need more information to offer any solution. The exception simply 
> means that a TaskManager shut down, for which there are a myriad of possible 
> explanations.
> 
> Please have a look at the TaskManager logs, they may contain a hint as to why 
> it shut down.
> 
> On 09.04.2018 16:01, Javier Lopez wrote:
>> Hi,
>> 
>> "are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes, to 
>> every node in the cluster.
>> 
>> On 9 April 2018 at 15:37, miki haiat <miko5...@gmail.com 
>> <mailto:miko5...@gmail.com>> wrote:
>> Javier 
>> "adding the jar file to the /lib path of every task manager"
>> are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  
>> 
>> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lo...@zalando.de 
>> <mailto:javier.lo...@zalando.de>> wrote:
>> Hi,
>> 
>> We had the same metaspace problem, it was solved by adding the jar file to 
>> the /lib path of every task manager, as explained here 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading>.
>>  As well we added these java options: "-XX:CompressedClassSpaceSize=100M 
>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
>> 
>> From time to time we have the same problem with TaskManagers disconnecting, 
>> but the logs are not useful. We are using 1.3.2.
>> 
>> On 9 April 2018 at 10:41, Alexander Smirnov <alexander.smirn...@gmail.com 
>> <mailto:alexander.smirn...@gmail.com>> wrote:
>> I've seen similar problem, but it was not a heap size, but Metaspace.
>> It was caused by a job restarting in a loop. Looks like for each restart, 
>> Flink loads new instance of classes and very soon in runs out of metaspace.
>> 
>> I've created a JIRA issue for this problem, but got no response from the 
>> development team on it: https://issues.apache.org/jira/browse/FLINK-9132 
>> <https://issues.apache.org/jira/browse/FLINK-9132>
>> 
>> 
>> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangka...@163.com 
>> <mailto:wangka...@163.com>> wrote:
>> thanks a lot,i will try it
>> 
>> 在 2018-04-09 00:06:02，"TechnoMage" <mla...@technomage.com 
>> <mailto:mla...@technomage.com>> 写道：
>> I have seen this when my task manager ran out of RAM.  Increase the heap 
>> size.
>> 
>> flink-conf.yaml:
>> taskmanager.heap.mb 
>> jobmanager.heap.mb
>> 
>> Michael
>> 
>>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangka...@163.com 
>>> <mailto:wangka...@163.com>> wrote:
>>> 
>>> <QQ图片20180408163927.png>
>>> hi all, recently, i found a problem,it runs well when start. But after long 
>>> run,the exception display as above,how can resolve it?

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Re: java.lang.Exception: TaskManager was lost/killed

Reply via email to