EMERSON WANG created FLINK-36182: ------------------------------------ Summary: PyFlink SQL Job Got IllegalStateException During Task Manager Shutdown Key: FLINK-36182 URL: https://issues.apache.org/jira/browse/FLINK-36182 Project: Flink Issue Type: Improvement Components: API / Python Affects Versions: 1.18.1 Environment: EKS prod cluster Reporter: EMERSON WANG
PyFlink SQL job was running in the AWS EKS cluster. When the task manager pods were scaled down, Preconditions.checkState in the class DefaultJobBundleFactory throwing "Caused by: java.lang.IllegalStateException: Reference count must not be negative.". Since refCount should be always >= 0, it should never happen. Please look into what root cause was. One task manager log is as follows: { "message": "2024-08-29 17:28:11,630 WARN org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] - Expiring environment urn: \"beam:env:process:v1\"", "time": "2024-08-29T17:28:11+00:00" } { "message": " with 6 remaining bundle references. Taking note to clean it up during shutdown if the references are not removed by then.", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,635 WARN org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] - Hanged up for unknown endpoint.", "time": "2024-08-29T17:28:11+00:00" } { "message": "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting down.", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,634 WARN org.apache.flink.runtime.taskmanager.Task [] - WindowAggregate[34] -> Calc[35] -> (PythonCalc[36] -> Calc[37] -> StreamRecordTimestampInserter[38] -> StreamingFileWriter, PythonCalc[88] -> Calc[89]) (1/6)#14 (ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14) switched from RUNNING to FAILED with failure cause:", "time": "2024-08-29T17:28:11+00:00" } { "message": "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught exception while processing timer.", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner flush", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner flush", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: java.lang.RuntimeException: Failed to close remote bundle", "time": "2024-08-29T17:28:11+00:00" } { "message": "Caused by: java.lang.IllegalStateException: Reference count must not be negative.", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,636 ERROR org.apache.beam.runners.fnexecution.control.FnApiControlClient [] - FnApiControlClient closed, clearing outstanding requests {12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 dependents], 15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 1 dependents], 18=java.util.concurrent.CompletableFuture@46a954c2[Not completed, 1 dependents], 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 1 dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, 1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not completed, 1 dependents]}", "time": "2024-08-29T17:28:11+00:00" } { "message": "2024-08-29 17:28:11,637 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: utlb_res_vasn[7] -> Calc[8] -> LocalWindowAggregate[9] (1/4)#14 (ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) switched from RUNNING to FAILED with failure cause:", "time": "2024-08-29T17:28:11+00:00" } { "message": "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } -- This message was sent by Atlassian Jira (v8.20.10#820010)