[ https://issues.apache.org/jira/browse/FLINK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694943#comment-17694943 ]
Matthias Pohl edited comment on FLINK-31278 at 3/1/23 9:18 AM: --------------------------------------------------------------- There is no heapdump provide due to a failure in the upload step. I extracted the tests that where running while the error happened based on the Maven output: {code} $ grep -e " Tests run: " -e "\[INFO\] Running" 20230301.3.txt | grep -o "org.apache.flink.[a-zA-Z\.]*" | sort | uniq -c | sort -n | head -5 1 org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest 1 org.apache.flink.runtime.io.disk.ChannelViewsTest 1 org.apache.flink.runtime.io.disk.FileChannelManagerImplTest 1 org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannelTest 2 org.apache.flink.api.common.accumulators.AverageAccumulatorTest {code} Although, that's not necessarily an indication for the cause. We see that {{ChannelViewsTest}} operates for a bit longer than the rest before the error occurs: {code} 2023-03-01T05:28:56.0284123Z Mar 01 05:28:56 [INFO] Running org.apache.flink.runtime.io.disk.ChannelViewsTest 2023-03-01T05:29:03.2024639Z Mar 01 05:29:03 [INFO] Running org.apache.flink.runtime.io.disk.FileChannelManagerImplTest 2023-03-01T05:29:03.8510602Z Mar 01 05:29:03 [INFO] Running org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannelTest 2023-03-01T05:29:20.9205409Z Mar 01 05:29:08 [INFO] Running org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest {code} ...but {{ChannelViewsTest}} seems to take longer in general (e.g. build [20230301.4|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46644&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=7194] lists the test with 36s runtime). was (Author: mapohl): There is no heapdump provide due to a failure in the upload step. I extracted the tests that where running while the error happened based on the Maven output: {code} $ grep -e " Tests run: " -e "\[INFO\] Running" 20230301.3.txt | grep -o "org.apache.flink.[a-zA-Z\.]*" | sort | uniq -c | sort -n | head -5 1 org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest 1 org.apache.flink.runtime.io.disk.ChannelViewsTest 1 org.apache.flink.runtime.io.disk.FileChannelManagerImplTest 1 org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannelTest 2 org.apache.flink.api.common.accumulators.AverageAccumulatorTest {code} Although, that's not necessarily an indication for the cause. We see that {{ChannelViewsTest}} operates for a bit longer than the rest before the error occurs: {code} 2023-03-01T05:28:56.0284123Z Mar 01 05:28:56 [INFO] Running org.apache.flink.runtime.io.disk.ChannelViewsTest 2023-03-01T05:29:03.2024639Z Mar 01 05:29:03 [INFO] Running org.apache.flink.runtime.io.disk.FileChannelManagerImplTest 2023-03-01T05:29:03.8510602Z Mar 01 05:29:03 [INFO] Running org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannelTest 2023-03-01T05:29:20.9205409Z Mar 01 05:29:08 [INFO] Running org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest {code} > exit code 137 (i.e. OutOfMemoryError) in core module > ---------------------------------------------------- > > Key: FLINK-31278 > URL: https://issues.apache.org/jira/browse/FLINK-31278 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.0 > Reporter: Matthias Pohl > Priority: Blocker > Labels: test-stability > > The following build failed due to a 137 exit code indicating an > OutOfMemoryError: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7847 > {code} > [...] > Mar 01 05:29:06 [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time > elapsed: 0.65 s - in > org.apache.flink.runtime.io.compression.BlockCompressionTest > Mar 01 05:29:06 [INFO] Running > org.apache.flink.runtime.dispatcher.DispatcherCachedOperationsHandlerTest > Mar 01 05:29:07 [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time > elapsed: 1.142 s - in > org.apache.flink.runtime.dispatcher.DispatcherCachedOperationsHandlerTest > Mar 01 05:29:08 [INFO] Running > org.apache.flink.runtime.dispatcher.MemoryExecutionGraphInfoStoreTest > ##[error]Exit code 137 returned from process: file name '/usr/bin/docker', > arguments 'exec -i -u 1001 -w /home/vsts_azpcontainer > 5953b171e8ed4caba7af2b326533e249211ed4dcc48640edb3c1b0cbbcdf1a21 > /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'. > Finishing: Test - core > {code} > This build ran on an Azure pipeline machine (Azure Pipelines 9) and, > therefore, cannot be caused by FLINK-18356. That said, there was a concurrent > 137 exit code build failure happening on agent "Azure Pipelines 21" (see > [20230301.3|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46643&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7847]) > ~10mins later -- This message was sent by Atlassian Jira (v8.20.10#820010)