[ 
https://issues.apache.org/jira/browse/FLINK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466592#comment-17466592
 ] 

Anton Kalashnikov commented on FLINK-25426:
-------------------------------------------

What exactly happens:
* Inside 
`org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent#incrementMemoryCounter`
 we increment `DIRECT_MEMORY_COUNTER` on each allocation of the direct buffer.  
`DIRECT_MEMORY_COUNTER` is a static field.
* We decrement `DIRECT_MEMORY_COUNTER` in 
`org.apache.flink.shaded.netty4.io.netty.buffer.PoolArena#finalize`. This` 
means decrementing depends on GC(because of `finalize`). So if GC collects all 
PoolArena objects correctly between the parametrized tests we don't have any 
problem but if it doesn't collect we have problems.

The reason why it started to fail recently is the ticket - 
https://issues.apache.org/jira/browse/FLINK-25085.
Unfortunately, this ticket has a bug with closing the thread pool. So because 
threads are alive the GC doesn't collect PoolArena objects and we only 
increment static `DIRECT_MEMORY_COUNTER` until it reaches its maximum and then 
we fail with OOM.
I have fixed the bug but I don't fully agree with my fix so let's discuss it in 
PR(https://github.com/apache/flink/pull/18239). [~zjureel], [~guoyangze], can 
you please take a look at my PR and help with the right solution?

> UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint fails on 
> AZP because it cannot allocate enough network buffers
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25426
>                 URL: https://issues.apache.org/jira/browse/FLINK-25426
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Till Rohrmann
>            Assignee: Anton Kalashnikov
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.15.0
>
>
> The test 
> {{UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint}} fails 
> with
> {code}
> 2021-12-23T02:54:46.2862342Z Dec 23 02:54:46 [ERROR] 
> UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint  Time 
> elapsed: 2.992 s  <<< ERROR!
> 2021-12-23T02:54:46.2865774Z Dec 23 02:54:46 java.lang.OutOfMemoryError: 
> Could not allocate enough memory segments for NetworkBufferPool (required 
> (Mb): 64, allocated (Mb): 14, missing (Mb): 50). Cause: Direct buffer memory. 
> The direct out-of-memory error has occurred. This can mean two things: either 
> job(s) require(s) a larger size of JVM direct memory or there is a direct 
> memory leak. The direct memory can be allocated by user code or some of its 
> dependencies. In this case 'taskmanager.memory.task.off-heap.size' 
> configuration option should be increased. Flink framework and its 
> dependencies also consume the direct memory, mostly for network 
> communication. The most of network memory is managed by Flink and should not 
> result in out-of-memory error. In certain special cases, in particular for 
> jobs with high parallelism, the framework may require more direct memory 
> which is not managed by Flink. In this case 
> 'taskmanager.memory.framework.off-heap.size' configuration option should be 
> increased. If the error persists then there is probably a direct memory leak 
> in user code or some of its dependencies which has to be investigated and 
> fixed. The task executor has to be shutdown...
> 2021-12-23T02:54:46.2868239Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.<init>(NetworkBufferPool.java:138)
> 2021-12-23T02:54:46.2868975Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:140)
> 2021-12-23T02:54:46.2869771Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:94)
> 2021-12-23T02:54:46.2870550Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:79)
> 2021-12-23T02:54:46.2871312Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:58)
> 2021-12-23T02:54:46.2872062Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.taskexecutor.TaskManagerServices.createShuffleEnvironment(TaskManagerServices.java:414)
> 2021-12-23T02:54:46.2872767Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:282)
> 2021-12-23T02:54:46.2873436Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:523)
> 2021-12-23T02:54:46.2877615Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.minicluster.MiniCluster.startTaskManager(MiniCluster.java:645)
> 2021-12-23T02:54:46.2878247Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.minicluster.MiniCluster.startTaskManagers(MiniCluster.java:626)
> 2021-12-23T02:54:46.2878856Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.minicluster.MiniCluster.start(MiniCluster.java:379)
> 2021-12-23T02:54:46.2879487Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.testutils.MiniClusterResource.startMiniCluster(MiniClusterResource.java:209)
> 2021-12-23T02:54:46.2880152Z Dec 23 02:54:46  at 
> org.apache.flink.runtime.testutils.MiniClusterResource.before(MiniClusterResource.java:95)
> 2021-12-23T02:54:46.2880821Z Dec 23 02:54:46  at 
> org.apache.flink.test.util.MiniClusterWithClientResource.before(MiniClusterWithClientResource.java:64)
> 2021-12-23T02:54:46.2881519Z Dec 23 02:54:46  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:151)
> 2021-12-23T02:54:46.2882310Z Dec 23 02:54:46  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:534)
> 2021-12-23T02:54:46.2882978Z Dec 23 02:54:46  at 
> jdk.internal.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
> 2021-12-23T02:54:46.2883574Z Dec 23 02:54:46  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-12-23T02:54:46.2884171Z Dec 23 02:54:46  at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 2021-12-23T02:54:46.2884732Z Dec 23 02:54:46  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> 2021-12-23T02:54:46.2885527Z Dec 23 02:54:46  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-12-23T02:54:46.2886135Z Dec 23 02:54:46  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> 2021-12-23T02:54:46.2886755Z Dec 23 02:54:46  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-12-23T02:54:46.2887387Z Dec 23 02:54:46  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-12-23T02:54:46.2887892Z Dec 23 02:54:46  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> 2021-12-23T02:54:46.2888435Z Dec 23 02:54:46  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> 2021-12-23T02:54:46.2889007Z Dec 23 02:54:46  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-12-23T02:54:46.2889568Z Dec 23 02:54:46  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> 2021-12-23T02:54:46.2890104Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> 2021-12-23T02:54:46.2890686Z Dec 23 02:54:46  at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> 2021-12-23T02:54:46.2891259Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> 2021-12-23T02:54:46.2891819Z Dec 23 02:54:46  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> 2021-12-23T02:54:46.2892421Z Dec 23 02:54:46  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> 2021-12-23T02:54:46.2892978Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> 2021-12-23T02:54:46.2893508Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> 2021-12-23T02:54:46.2894049Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> 2021-12-23T02:54:46.2894588Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> 2021-12-23T02:54:46.2895203Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> 2021-12-23T02:54:46.2895721Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> 2021-12-23T02:54:46.2896304Z Dec 23 02:54:46  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-12-23T02:54:46.2896781Z Dec 23 02:54:46  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-12-23T02:54:46.2897359Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> 2021-12-23T02:54:46.2897892Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> 2021-12-23T02:54:46.2898429Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> 2021-12-23T02:54:46.2898968Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> 2021-12-23T02:54:46.2899487Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> 2021-12-23T02:54:46.2900025Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> 2021-12-23T02:54:46.2900542Z Dec 23 02:54:46  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> 2021-12-23T02:54:46.2901044Z Dec 23 02:54:46  at 
> org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> 2021-12-23T02:54:46.2901540Z Dec 23 02:54:46  at 
> org.junit.runner.JUnitCore.run(JUnitCore.java:115)
> 2021-12-23T02:54:46.2902086Z Dec 23 02:54:46  at 
> org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
> 2021-12-23T02:54:46.2902702Z Dec 23 02:54:46  at 
> org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
> 2021-12-23T02:54:46.2903297Z Dec 23 02:54:46  at 
> org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72)
> 2021-12-23T02:54:46.2903944Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
> 2021-12-23T02:54:46.2904712Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
> 2021-12-23T02:54:46.2905493Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
> 2021-12-23T02:54:46.2906245Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
> 2021-12-23T02:54:46.2906968Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
> 2021-12-23T02:54:46.2907692Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
> 2021-12-23T02:54:46.2908303Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
> 2021-12-23T02:54:46.2908971Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
> 2021-12-23T02:54:46.2909664Z Dec 23 02:54:46  at 
> org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
> 2021-12-23T02:54:46.2910347Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:188)
> 2021-12-23T02:54:46.2911042Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:154)
> 2021-12-23T02:54:46.2911743Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
> 2021-12-23T02:54:46.2912399Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
> 2021-12-23T02:54:46.2913009Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
> 2021-12-23T02:54:46.2913589Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
> 2021-12-23T02:54:46.2914162Z Dec 23 02:54:46  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28502&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=b78d9d30-509a-5cea-1fef-db7abaa325ae&l=14634
> Maybe the test instability is caused by exceeding our available memory on the 
> CI machines by running too many tests concurrently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to