Hi Chesnay, Don’t know if this helps, but I’d run into this as well, though I haven’t hooked up YourKit to analyze exactly what’s causing the memory problem.
E.g. after about 3.5 hours running locally, it failed with memory issues. In the TaskManager logs, I start seeing exceptions in my code…. java.lang.OutOfMemoryError: GC overhead limit exceeded And then eventually... 2018-04-07 21:55:25,686 WARN org.apache.flink.runtime.accumulators.AccumulatorRegistry - Failed to serialize accumulators for task. java.lang.OutOfMemoryError: GC overhead limit exceeded Immediately after this, one of my custom functions gets a close() call, and I see a log msg about it "switched from RUNNING to FAILED”. After this, I see messages that the job is being restarted, but the TaskManager log output abruptly ends. In the Job Manager log, this is what is output following the time of the last TaskManager logging output: 2018-04-07 21:57:33,702 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 129 @ 1523163453702 2018-04-07 21:58:43,916 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2018-04-07 21:58:51,084 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780 2018-04-07 21:58:51,086 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780] 2018-04-07 21:59:01,047 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780 2018-04-07 21:59:01,050 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780] 2018-04-07 21:59:11,057 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780] 2018-04-07 21:59:11,058 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780 2018-04-07 21:59:21,049 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780 2018-04-07 21:59:21,049 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780] 2018-04-07 21:59:21,056 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] 2018-04-07 21:59:21,063 INFO org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780/user/taskmanager terminated. 2018-04-07 21:59:21,064 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - FetchUrlsFunction for sitemap -> ParseSiteMapFunction -> OutlinkToStateUrlFunction (1/1) (3e9374d1bf5fdb359e3a624a4d5d659b) switched from RUNNING to FAILED. java.lang.Exception: TaskManager was lost/killed: c51d3879b6244828eb9fc78c943007ad @ kens-mbp.hsd1.ca.comcast.net (dataPort=63782) — Ken > On Apr 9, 2018, at 12:48 PM, Chesnay Schepler <ches...@apache.org> wrote: > > We will need more information to offer any solution. The exception simply > means that a TaskManager shut down, for which there are a myriad of possible > explanations. > > Please have a look at the TaskManager logs, they may contain a hint as to why > it shut down. > > On 09.04.2018 16:01, Javier Lopez wrote: >> Hi, >> >> "are you moving the job jar to the ~/flink-1.4.2/lib path ? " -> Yes, to >> every node in the cluster. >> >> On 9 April 2018 at 15:37, miki haiat <miko5...@gmail.com >> <mailto:miko5...@gmail.com>> wrote: >> Javier >> "adding the jar file to the /lib path of every task manager" >> are you moving the job jar to the ~/flink-1.4.2/lib path ? >> >> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lo...@zalando.de >> <mailto:javier.lo...@zalando.de>> wrote: >> Hi, >> >> We had the same metaspace problem, it was solved by adding the jar file to >> the /lib path of every task manager, as explained here >> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading >> >> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading>. >> As well we added these java options: "-XX:CompressedClassSpaceSize=100M >> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M " >> >> From time to time we have the same problem with TaskManagers disconnecting, >> but the logs are not useful. We are using 1.3.2. >> >> On 9 April 2018 at 10:41, Alexander Smirnov <alexander.smirn...@gmail.com >> <mailto:alexander.smirn...@gmail.com>> wrote: >> I've seen similar problem, but it was not a heap size, but Metaspace. >> It was caused by a job restarting in a loop. Looks like for each restart, >> Flink loads new instance of classes and very soon in runs out of metaspace. >> >> I've created a JIRA issue for this problem, but got no response from the >> development team on it: https://issues.apache.org/jira/browse/FLINK-9132 >> <https://issues.apache.org/jira/browse/FLINK-9132> >> >> >> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangka...@163.com >> <mailto:wangka...@163.com>> wrote: >> thanks a lot,i will try it >> >> 在 2018-04-09 00:06:02,"TechnoMage" <mla...@technomage.com >> <mailto:mla...@technomage.com>> 写道: >> I have seen this when my task manager ran out of RAM. Increase the heap >> size. >> >> flink-conf.yaml: >> taskmanager.heap.mb >> jobmanager.heap.mb >> >> Michael >> >>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangka...@163.com >>> <mailto:wangka...@163.com>> wrote: >>> >>> <QQ图片20180408163927.png> >>> hi all, recently, i found a problem,it runs well when start. But after long >>> run,the exception display as above,how can resolve it? -------------------------------------------- http://about.me/kkrugler +1 530-210-6378