[ https://issues.apache.org/jira/browse/FLINK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695931#comment-17695931 ]
Roman Khachatryan edited comment on FLINK-31278 at 3/2/23 11:47 PM: -------------------------------------------------------------------- Given that MemoryExecutionGraphInfoStoreTest was executed the latest, I'd suppose the problem is there (if not in the environment). Looking at its code, I see that it uses a single-threaded executor per test-class, but creates MemoryExecutionGraphInfoStore per test-method (i.e. shares the executor). So probably, that executor got stuck in one test, which prevented the cleanup in subsequent tests. But this is just a guess without the logs. I'm thinking about making sure that all the previous processes were stopped at the beginning of the Upload stage, so that the Upload doesn't gets killed by OOMKiller (disabling forks completely might affect test run times). WDYT? {quote}[~roman] [~chesnay] can you help with the guessing game based on what was added in 1.17? In the mean time, the only thing I can think of is disabling fork reuse and hoping that we get more insights with future failures. {quote} Could you elaborate, how disabling fork reuse would help? I looked at the memory available at the [beginning|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=60]: {code:java} Mar 01 05:19:57 Memory information Mar 01 05:19:57 MemTotal: 7110656 kB Mar 01 05:19:57 MemFree: 401188 kB Mar 01 05:19:57 MemAvailable: 6089948 kB {code} it doesn't look any smaller than in [successful|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46290&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=60] runs: {code:java} Feb 19 03:54:33 MemTotal: 7110656 kB Feb 19 03:54:33 MemFree: 346696 kB Feb 19 03:54:33 MemAvailable: 6094648 kB {code} So environment (memory) was fine at least at the beginning. was (Author: roman_khachatryan): Given that MemoryExecutionGraphInfoStoreTest was executed the latest, I'd suppose the problem is there (if not in the environment). Looking at its code, I see that it uses a single-threaded executor per test-class, but creates MemoryExecutionGraphInfoStore per test-method (i.e. shares the executor). So probably, that executor got stuck in one test, which prevented the cleanup in subsequent tests. But this is just a guess without the logs. I'm thinking about enforcing at the beginning of the Upload stage that all the previous processes were stopped, so that the Upload doesn't gets killed. (disabling forks completely might affect test run times) WDYT? {quote}[~roman] [~chesnay] can you help with the guessing game based on what was added in 1.17? In the mean time, the only thing I can think of is disabling fork reuse and hoping that we get more insights with future failures. {quote} Could you elaborate, how would disable fork reuse help? I looked at the memory available at the [beginning|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=60]: {code:java} Mar 01 05:19:57 Memory information Mar 01 05:19:57 MemTotal: 7110656 kB Mar 01 05:19:57 MemFree: 401188 kB Mar 01 05:19:57 MemAvailable: 6089948 kB {code} - it doesn't look any smaller than in [successful|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46290&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=60] runs: {code:java} Feb 19 03:54:33 MemTotal: 7110656 kB Feb 19 03:54:33 MemFree: 346696 kB Feb 19 03:54:33 MemAvailable: 6094648 kB {code} So environment was fine at least at the beginning. > exit code 137 (i.e. OutOfMemoryError) in core module > ---------------------------------------------------- > > Key: FLINK-31278 > URL: https://issues.apache.org/jira/browse/FLINK-31278 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.0 > Reporter: Matthias Pohl > Priority: Blocker > Labels: pull-request-available, test-stability > > The following build failed due to a 137 exit code indicating an > OutOfMemoryError: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7847 > {code} > [...] > Mar 01 05:29:06 [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time > elapsed: 0.65 s - in > org.apache.flink.runtime.io.compression.BlockCompressionTest > Mar 01 05:29:06 [INFO] Running > org.apache.flink.runtime.dispatcher.DispatcherCachedOperationsHandlerTest > Mar 01 05:29:07 [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time > elapsed: 1.142 s - in > org.apache.flink.runtime.dispatcher.DispatcherCachedOperationsHandlerTest > Mar 01 05:29:08 [INFO] Running > org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest > ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', > arguments 'exec -i -u 1001 -w /home/vsts_azpcontainer > 5953b171e8ed4caba7af2b326533e249211ed4dcc48640edb3c1b0cbbcdf1a21 > /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'. > Finishing: Test - core > {code} > This build ran on an Azure pipeline machine (Azure Pipelines 9) and, > therefore, cannot be caused by FLINK-18356. That said, there was a concurrent > 137 exit code build failure happening on agent "Azure Pipelines 21" (see > [20230301.3|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7847]) > ~10mins later -- This message was sent by Atlassian Jira (v8.20.10#820010)