[ https://issues.apache.org/jira/browse/FLINK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
EMERSON WANG updated FLINK-36182: --------------------------------- Description: PyFlink SQL job was running in the AWS EKS cluster. When the task manager pods were scaled down, Preconditions.checkState in the class DefaultJobBundleFactory throwing "Caused by: java.lang.IllegalStateException: Reference count must not be negative.". Since refCount should be always >= 0, it should never happen. Please look into what root cause was. One task manager log is as follows: { "message": "2024-08-29 17:28:11,630 WARN org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] - Expiring environment urn: \"beam:env:process:v1\"", "time": "2024-08-29T17:28:11+00:00" } \{ "message": " with 6 remaining bundle references. Taking note to clean it up during shutdown if the references are not removed by then.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "2024-08-29 17:28:11,635 WARN org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] - Hanged up for unknown endpoint.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting down.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "2024-08-29 17:28:11,634 WARN org.apache.flink.runtime.taskmanager.Task [] - WindowAggregate[34] -> Calc[35] -> (PythonCalc[36] -> Calc[37] -> StreamRecordTimestampInserter[38] -> StreamingFileWriter, PythonCalc[88] -> Calc[89]) (1/6)#14 (ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14) switched from RUNNING to FAILED with failure cause:", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught exception while processing timer.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: java.lang.RuntimeException: Failed to close remote bundle", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: java.lang.IllegalStateException: Reference count must not be negative.", "time": "2024-08-29T17:28:11+00:00" }{ "message": "2024-08-29 17:28:11,636 ERROR org.apache.beam.runners.fnexecution.control.FnApiControlClient [] - FnApiControlClient closed, clearing outstanding requests {12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 dependents], 15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 1 dependents], 18=java.util.concurrent.CompletableFuture@46a954c2[Not completed, 1 dependents], 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 1 dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, 1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not completed, 1 dependents]} ", "time": "2024-08-29T17:28:11+00:00" }\{ "message": "2024-08-29 17:28:11,637 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: utlb_res_vasn[7] -> Calc[8] -> LocalWindowAggregate[9] (1/4)#14 (ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) switched from RUNNING to FAILED with failure cause:", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } was: PyFlink SQL job was running in the AWS EKS cluster. When the task manager pods were scaled down, Preconditions.checkState in the class DefaultJobBundleFactory throwing "Caused by: java.lang.IllegalStateException: Reference count must not be negative.". Since refCount should be always >= 0, it should never happen. Please look into what root cause was. One task manager log is as follows: { "message": "2024-08-29 17:28:11,630 WARN org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] - Expiring environment urn: \"beam:env:process:v1\"", "time": "2024-08-29T17:28:11+00:00" } { "message": " with 6 remaining bundle references. Taking note to clean it up during shutdown if the references are not removed by then.", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,635 WARN org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] - Hanged up for unknown endpoint.", "time": "2024-08-29T17:28:11+00:00" } { "message": "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting down.", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,634 WARN org.apache.flink.runtime.taskmanager.Task [] - WindowAggregate[34] -> Calc[35] -> (PythonCalc[36] -> Calc[37] -> StreamRecordTimestampInserter[38] -> StreamingFileWriter, PythonCalc[88] -> Calc[89]) (1/6)#14 (ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14) switched from RUNNING to FAILED with failure cause:", "time": "2024-08-29T17:28:11+00:00" } { "message": "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught exception while processing timer.", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner flush", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner flush", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: java.lang.RuntimeException: Failed to close remote bundle", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: java.lang.IllegalStateException: Reference count must not be negative.", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,636 ERROR org.apache.beam.runners.fnexecution.control.FnApiControlClient [] - FnApiControlClient closed, clearing outstanding requests {12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 dependents], 15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 1 dependents], 18=java.util.concurrent.CompletableFuture@46a954c2[Not completed, 1 dependents], 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 1 dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, 1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not completed, 1 dependents]}", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,637 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: utlb_res_vasn[7] -> Calc[8] -> LocalWindowAggregate[9] (1/4)#14 (ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) switched from RUNNING to FAILED with failure cause:", "time": "2024-08-29T17:28:11+00:00" } { "message": "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } > PyFlink SQL Job Got IllegalStateException During Task Manager Shutdown > ---------------------------------------------------------------------- > > Key: FLINK-36182 > URL: https://issues.apache.org/jira/browse/FLINK-36182 > Project: Flink > Issue Type: Improvement > Components: API / Python > Affects Versions: 1.18.1 > Environment: EKS prod cluster > Reporter: EMERSON WANG > Priority: Major > > PyFlink SQL job was running in the AWS EKS cluster. When the task manager > pods were scaled down, Preconditions.checkState in the class > DefaultJobBundleFactory throwing "Caused by: java.lang.IllegalStateException: > Reference count must not be negative.". Since refCount should be always >= 0, > it should never happen. Please look into what root cause was. One task > manager log is as follows: > { "message": "2024-08-29 17:28:11,630 WARN > org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] - > Expiring environment urn: \"beam:env:process:v1\"", "time": > "2024-08-29T17:28:11+00:00" } \{ "message": " with 6 remaining bundle > references. Taking note to clean it up during shutdown if the references are > not removed by then.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": > "2024-08-29 17:28:11,635 WARN > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] - Hanged up for > unknown endpoint.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": > "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible > for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } > \{ "message": "Caused by: org.apache.flink.util.FlinkExpectedException: The > TaskExecutor is shutting down.", "time": "2024-08-29T17:28:11+00:00" } \{ > "message": "2024-08-29 17:28:11,634 WARN > org.apache.flink.runtime.taskmanager.Task [] - WindowAggregate[34] -> > Calc[35] -> (PythonCalc[36] -> Calc[37] -> StreamRecordTimestampInserter[38] > -> StreamingFileWriter, PythonCalc[88] -> Calc[89]) (1/6)#14 > (ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14) > switched from RUNNING to FAILED with failure cause:", "time": > "2024-08-29T17:28:11+00:00" } \{ "message": > "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught exception > while processing timer.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": > "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: > java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner > flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: > java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner > flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: > java.lang.RuntimeException: Failed to close remote bundle", "time": > "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: > java.lang.IllegalStateException: Reference count must not be negative.", > "time": "2024-08-29T17:28:11+00:00" }{ > "message": "2024-08-29 17:28:11,636 ERROR > org.apache.beam.runners.fnexecution.control.FnApiControlClient [] - > FnApiControlClient closed, clearing outstanding requests > {12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 > dependents], 15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, > 1 dependents], 18=java.util.concurrent.CompletableFuture@46a954c2[Not > completed, 1 dependents], > 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 1 > dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, > 1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not > completed, 1 dependents]} > ", > "time": "2024-08-29T17:28:11+00:00" > }\{ "message": "2024-08-29 17:28:11,637 WARN > org.apache.flink.runtime.taskmanager.Task [] - Source: utlb_res_vasn[7] -> > Calc[8] -> LocalWindowAggregate[9] (1/4)#14 > (ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) > switched from RUNNING to FAILED with failure cause:", "time": > "2024-08-29T17:28:11+00:00" } \{ "message": > "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible > for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } -- This message was sent by Atlassian Jira (v8.20.10#820010)