Same story here, 1.3.2 on K8s. Very hard to find reasons on why a TM is killed. Not likely caused by memory leak. If there is a logger I have turn on please let me know.
On Mon, Apr 9, 2018, 13:41 Lasse Nedergaard <lassenederga...@gmail.com> wrote: > We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only > thing I can find in the logs from are SIGTERM with the code 15 or -100. > Today our simple job reading from Kinesis and writing to Cassandra was > killed. The other day in another job I identified a map state.remove > command to cause a task manager lost without and exception > I find it frustrating that it is so hard to find the root cause. > If I look on historical metrics on cpu, heap and non heap I can’t see > anything that should cause a problem. > So any ideas about how to debug this kind of exception is much > appreciated. > > Med venlig hilsen / Best regards > Lasse Nedergaard > > > Den 9. apr. 2018 kl. 21.48 skrev Chesnay Schepler <ches...@apache.org>: > > We will need more information to offer any solution. The exception simply > means that a TaskManager shut down, for which there are a myriad of > possible explanations. > > Please have a look at the TaskManager logs, they may contain a hint as to > why it shut down. > > On 09.04.2018 16:01, Javier Lopez wrote: > > Hi, > > "are you moving the job jar to the ~/flink-1.4.2/lib path ? " -> Yes, > to every node in the cluster. > > On 9 April 2018 at 15:37, miki haiat <miko5...@gmail.com> wrote: > >> Javier >> "adding the jar file to the /lib path of every task manager" >> are you moving the job jar to the* ~/flink-1.4.2/lib path* ? >> >> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lo...@zalando.de> >> wrote: >> >>> Hi, >>> >>> We had the same metaspace problem, it was solved by adding the jar file >>> to the /lib path of every task manager, as explained here >>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading >>> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading>. >>> As well we >>> added these java options: "-XX:CompressedClassSpaceSize=100M >>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M " >>> >>> From time to time we have the same problem with TaskManagers >>> disconnecting, but the logs are not useful. We are using 1.3.2. >>> >>> On 9 April 2018 at 10:41, Alexander Smirnov < >>> alexander.smirn...@gmail.com> wrote: >>> >>>> I've seen similar problem, but it was not a heap size, but Metaspace. >>>> It was caused by a job restarting in a loop. Looks like for each >>>> restart, Flink loads new instance of classes and very soon in runs out of >>>> metaspace. >>>> >>>> I've created a JIRA issue for this problem, but got no response from >>>> the development team on it: >>>> https://issues.apache.org/jira/browse/FLINK-9132 >>>> <https://issues.apache.org/jira/browse/FLINK-9132> >>>> >>>> >>>> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangka...@163.com> wrote: >>>> >>>>> thanks a lot,i will try it >>>>> >>>>> 在 2018-04-09 00:06:02,"TechnoMage" <mla...@technomage.com> 写道: >>>>> >>>>> I have seen this when my task manager ran out of RAM. Increase the >>>>> heap size. >>>>> >>>>> flink-conf.yaml: >>>>> taskmanager.heap.mb >>>>> jobmanager.heap.mb >>>>> >>>>> Michael >>>>> >>>>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangka...@163.com> wrote: >>>>> >>>>> <QQ图片20180408163927.png> >>>>> hi all, recently, i found a problem,it runs well when start. But >>>>> after long run,the exception display as above,how can resolve it? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > >