Hi, OOMs from metaspace probably mean that your jars are not releasing some resources: https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html#unloading-of-dynamically-loaded-classes <https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html#unloading-of-dynamically-loaded-classes>
Regarding second issue (I guess it is probably somehow related to the first one). If it’s indeed a heap space OOM, it should be fairly easy to analyse/debug. This article describes how to track such issues, Especially chapter titled "Using Java VisualVM”: https://www.toptal.com/java/hunting-memory-leaks-in-java <https://www.toptal.com/java/hunting-memory-leaks-in-java> It should allow you to pinpoint the owner and the source of the leak. Piotrek > On 12 Dec 2017, at 14:47, Javier Lopez <javier.lo...@zalando.de> wrote: > > Hi Piotr, > > We found out which one was the problem in the workers. After setting a value > for XX:MaxMetaspaceSize we started to get OOM exceptions from the metaspace. > We found out how Flink manages the User classes here > https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.htm > > <https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.htm>l > and solved the problem by adding the job's jar file in the /lib of the nodes > (master and workers). Now we have a constant memory usage in the workers. > > Unfortunately, we still have an OOM problem in the master node. We are using > the same configuration as in the workers (200MB for MaxMetaspace and 13000MB > for Heap) and after ~6000 jobs, the master runs out of memory. The metaspace > usage is almost constant, around 50MB and the heap usage grows up to 10000MB, > then GC does its work and reduces this usage. But we still have the OOM > problems. Do you have any other idea of what could cause this problem? Our > workaround is to restart the master, but we cannot keep doing this in the > long term. > > Thanks for all your support, it has been helpful. > > On 16 November 2017 at 15:27, Javier Lopez <javier.lo...@zalando.de > <mailto:javier.lo...@zalando.de>> wrote: > Hi Piotr, > > Sorry for the late response, I'm out of the office and with limited access to > the Internet. I think we are on the right path to solve this problem. Some > time ago we did a memory analysis over 3 different cluster we are using, two > of them are running jobs 24/7 and the other is the one deploying thousands of > jobs. All of those clusters have the same behavior for arrays of Chars and > Bytes (as expected), but for this particular Class "java.lang.Class" the > clusters that have 24/7 jobs have less than 20K instances of that class, > whereas the other cluster has 383,120 > instances. I don't know if this could be related. > > I hope that we can test this soon, and will let you know if this fixed the > problem. > > Thanks. > > > On 15 November 2017 at 13:18, Piotr Nowojski <pi...@data-artisans.com > <mailto:pi...@data-artisans.com>> wrote: > Hi, > > I have been able to observe some off heap memory “issues” by submitting Kafka > job provided by Javier Lopez (in different mailing thread). > > TL;DR; > > There was no memory leak, just memory pool “Metaspace” and “Compressed Class > Space” are growing in size over time and are only rarely garbage collected. > In my test case they together were wasting up to ~7GB of memory, while my > test case could use as little as ~100MB. Connect with for example jconsole to > your JVM, check their size and cut their size by half by setting: > > env.java.opts: -XX:CompressedClassSpaceSize=***M -XX:MaxMetaspaceSize=***M > > In flink-conf.yaml. Everything works fine and memory consumption still too > high? Rinse and repeat. > > > Long story: > > In default settings, with max heap size of 1GB, off heap memory consumption, > memory consumption off non-heap memory pools of “Metaspace” and “Compressed > Class Space” was growing in time which seemed like indefinitely, and > Metaspace was always around ~6 times larger compared to compressed class > space. Default max meatspace size is unlimited, while “Compressed class > space” has a default max size of 1GB. > > When I decreased the CompressedClassSpaceSize down to 100MB, memory > consumption grew up to 90MB and then it started bouncing up and down by > couple of MB. “Metaspace” was following the same pattern, but using ~600MB. > When I decreased down MaxMetaspaceSize to 200MB, memory consumption of both > pools was bouncing around ~220MB. > > It seems like there are no general guide lines how to configure those values, > since it’s heavily application dependent. However this seems like the most > likely suspect of the apparent OFF HEAP “memory leak” that was reported > couple of times in use cases where users are submitting hundreds/thousands of > jobs to Flink cluster. For more information please check here: > > https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/considerations.html > > <https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/considerations.html> > > Please let us know if this solves your issues. > > Thanks, Piotrek > >> On 13 Nov 2017, at 16:06, Flavio Pompermaier <pomperma...@okkam.it >> <mailto:pomperma...@okkam.it>> wrote: >> >> Unfortunately the issue I've opened [1] was not a problem of Flink but was >> just caused by an ever increasing job plan. >> So no help from that..Let's hope to find out the real source of the problem. >> Maybe using -Djdk.nio.maxCachedBufferSize could help (but I didn't try it >> yet) >> >> Best, >> Flavio >> >> [1] https://issues.apache.org/jira/browse/FLINK-7845 >> <https://issues.apache.org/jira/browse/FLINK-7845> >> >> On Wed, Oct 18, 2017 at 2:07 PM, Kien Truong <duckientru...@gmail.com >> <mailto:duckientru...@gmail.com>> wrote: >> Hi, >> >> We saw a similar issue in one of our job due to ByteBuffer memory leak[1]. >> We fixed it using the solution in the article, setting >> -Djdk.nio.maxCachedBufferSize >> >> This variable is available for Java > 8u102 >> >> Best regards, >> >> Kien >> [1]http://www.evanjones.ca/java-bytebuffer-leak.html >> <http://www.evanjones.ca/java-bytebuffer-leak.html> >> >> On 10/18/2017 4:06 PM, Flavio Pompermaier wrote: >>> We also faced the same problem, but the number of jobs we can run before >>> restarting the cluster depends on the volume of the data to shuffle around >>> the network. We even had problems with a single job and in order to avoid >>> OOM issues we had to put some configuration to limit Netty memory usage, >>> i.e.: >>> - Add to flink.yaml -> env.java.opts: >>> -Dio.netty.recycler.maxCapacity.default=1 >>> - Edit taskmanager.sh and change TM_MAX_OFFHEAP_SIZE from 8388607T to 5g >>> >>> At this purpose we wrote a small test to reproduce the problem and we >>> opened an issue for that [1]. >>> We still don't know if the problems are related however.. >>> >>> I hope that could be helpful, >>> Flavio >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-7845 >>> <https://issues.apache.org/jira/browse/FLINK-7845> >>> >>> On Wed, Oct 18, 2017 at 10:48 AM, Javier Lopez <javier.lo...@zalando.de >>> <mailto:javier.lo...@zalando.de>> wrote: >>> Hi Robert, >>> >>> Sorry to reply this late. We did a lot of tests, trying to identify if the >>> problem was in our custom sources/sinks. We figured out that none of our >>> custom components is causing this problem. We came up with a small test, >>> and realized that the Flink nodes run out of non-heap JVM memory and crash >>> after deployment of thousands of jobs. >>> >>> When rapidly deploying thousands or hundreds of thousands of Flink jobs - >>> depending on job complexity in terms of resource consumption - Flink nodes >>> non-heap JVM memory consumption grows until there is no more memory left on >>> the machine and the Flink process crashes. Both TaskManagers and JobManager >>> exhibit the same behavior. The TaskManagers die faster though. The memory >>> consumption doesn't decrease after stopping the deployment of new jobs, >>> with the cluster being idle (no running jobs). >>> >>> We could replicate the behavior by the rapid deployment of the WordCount >>> Job provided in the Quickstart with a Python script. We started 24 >>> instances of the deployment script to run in parallel. >>> >>> The non-heap JVM memory consumption grows faster with more complex jobs, >>> i.e. reading from Kafka 10K events and printing to STDOUT( * ). Thus less >>> deployed jobs are needed until the TaskManagers/JobManager dies. >>> >>> We employ Flink 1.3.2 in standalone mode on AWS EC2 t2.large nodes with 4GB >>> RAM inside Docker containers. For the test, we used 2 TaskManagers and 1 >>> JobManager. >>> >>> ( * ) a slightly changed Python script was used, which waited after >>> deployment 15 seconds for the 10K events to be read from Kafka, then it >>> canceled the freshly deployed job via Flink REST API. >>> >>> If you want we can provide the Scripts and Jobs we used for this test. We >>> have a workaround for this, which restarts the Flink nodes once a memory >>> threshold is reached. But this has lowered the availability of our services. >>> >>> Thanks for your help. >>> >>> On 30 August 2017 at 10:39, Robert Metzger <rmetz...@apache.org >>> <mailto:rmetz...@apache.org>> wrote: >>> I just saw that your other email is about the same issue. >>> >>> Since you've done a heapdump already, did you see any pattern in the >>> allocated objects? Ideally none of the classes from your user code should >>> stick around when no job is running. >>> What's the size of the heap dump? I'm happy to take a look at it if it's >>> reasonably small. >>> >>> On Wed, Aug 30, 2017 at 10:27 AM, Robert Metzger <rmetz...@apache.org >>> <mailto:rmetz...@apache.org>> wrote: >>> Hi Javier, >>> >>> I'm not aware of such issues with Flink, but if you could give us some more >>> details on your setup, I might get some more ideas on what to look for. >>> >>> are you using the RocksDBStateBackend? (RocksDB is doing some JNI >>> allocations, that could potentially leak memory) >>> Also, are you passing any special garbage collector options? (Maybe some >>> classes are not unloaded) >>> Are you using anything else that is special (such as protobuf or avro >>> formats, or any other big library)? >>> >>> Regards, >>> Robert >>> >>> >>> >>> On Mon, Aug 28, 2017 at 5:04 PM, Javier Lopez <javier.lo...@zalando.de >>> <mailto:javier.lo...@zalando.de>> wrote: >>> Hi all, >>> >>> we are starting a lot of Flink jobs (streaming), and after we have started >>> 200 or more jobs we see that the non-heap memory in the taskmanagers >>> increases a lot, to the point of killing the instances. We found out that >>> every time we start a new job, the committed non-heap memory increases by 5 >>> to 10MB. Is this an expected behavior? Are there ways to prevent this? >>> >>> >>> >>> >> >> >> >> -- >> Flavio Pompermaier >> Development Department >> >> OKKAM S.r.l. >> Tel. +(39) 0461 041809 <tel:+39%200461%20041809> > >