[ https://issues.apache.org/jira/browse/FLINK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277008#comment-17277008 ]
Till Rohrmann commented on FLINK-20663: --------------------------------------- Thanks for driving the discussion. I feel that I lack a bit of context here. * Which patch did you test and what does it do [~TsReaper]? * Does this test ensure that we don't have a memory leak? * If not, have we checked that we don't have a memory leak somewhere in our code? * Have we been able to reproduce the problem with the {{DataSet/DataStream}} API only? Again, before jumping to conclusions and action plans, I would like to make sure that we don't overlook things here. Concerning the action plan, I agree that option 2) is the best option here. But maybe there is also a fourth option: Try to establish a clear memory ownership where we don't hand ByteBuffers into components and then release the memory in a different component. Concerning the thread-safety I think this might be needed if we hand a {{ByteBuffer}} to a different thread and release the underlying {{MemorySegment}} in another thread. That is the problem of unclear ownership we might have in the code base. Just because the existing code base didn't implement it, does not mean that this is correct. Moreover, just because something didn't crash before, does not mean that it was working either. > Managed memory may not be released in time when operators use managed memory > frequently > --------------------------------------------------------------------------------------- > > Key: FLINK-20663 > URL: https://issues.apache.org/jira/browse/FLINK-20663 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.12.0 > Reporter: Caizhi Weng > Priority: Critical > Fix For: 1.12.2 > > > Some batch operators (like sort merge join or hash aggregate) use managed > memory frequently. When these operators are chained together and the cluster > load is a bit heavy, it is very likely that the following exception occurs: > {code:java} > 2020-12-18 10:04:32 > java.lang.RuntimeException: > org.apache.flink.runtime.memory.MemoryAllocationException: Could not allocate > 512 pages > at > org.apache.flink.table.runtime.util.LazyMemorySegmentPool.nextSegment(LazyMemorySegmentPool.java:85) > at > org.apache.flink.runtime.io.disk.SimpleCollectingOutputView.<init>(SimpleCollectingOutputView.java:49) > at > org.apache.flink.table.runtime.operators.aggregate.BytesHashMap$RecordArea.<init>(BytesHashMap.java:297) > at > org.apache.flink.table.runtime.operators.aggregate.BytesHashMap.<init>(BytesHashMap.java:103) > at > org.apache.flink.table.runtime.operators.aggregate.BytesHashMap.<init>(BytesHashMap.java:90) > at LocalHashAggregateWithKeys$209161.open(Unknown Source) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:401) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:506) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:501) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:530) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > at java.lang.Thread.run(Thread.java:834) > Suppressed: java.lang.NullPointerException > at LocalHashAggregateWithKeys$209161.close(Unknown Source) > at > org.apache.flink.table.runtime.operators.TableStreamOperator.dispose(TableStreamOperator.java:46) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:739) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:719) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:642) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:551) > ... 3 more > Suppressed: java.lang.NullPointerException > at LocalHashAggregateWithKeys$209766.close(Unknown > Source) > ... 8 more > Caused by: org.apache.flink.runtime.memory.MemoryAllocationException: Could > not allocate 512 pages > at > org.apache.flink.runtime.memory.MemoryManager.allocatePages(MemoryManager.java:231) > at > org.apache.flink.table.runtime.util.LazyMemorySegmentPool.nextSegment(LazyMemorySegmentPool.java:83) > ... 13 more > Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could > not allocate 16777216 bytes, only 9961487 bytes are remaining. This usually > indicates that you are requesting more memory than you have reserved. > However, when running an old JVM version it can also be caused by slow > garbage collection. Try to upgrade to Java 8u72 or higher if running on an > old Java version. > at > org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:164) > at > org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:80) > at > org.apache.flink.runtime.memory.MemoryManager.allocatePages(MemoryManager.java:229) > ... 14 more > {code} > It seems that this is caused by relying on GC to release managed memory, as > {{System.gc()}} may not trigger GC in time. See {{UnsafeMemoryBudget.java}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)