EMERSON WANG created FLINK-36182:
------------------------------------

             Summary: PyFlink SQL Job Got IllegalStateException During Task 
Manager Shutdown
                 Key: FLINK-36182
                 URL: https://issues.apache.org/jira/browse/FLINK-36182
             Project: Flink
          Issue Type: Improvement
          Components: API / Python
    Affects Versions: 1.18.1
         Environment: EKS prod cluster
            Reporter: EMERSON WANG


PyFlink SQL job was running in the AWS EKS cluster. When the task manager pods 
were scaled down, Preconditions.checkState in the class DefaultJobBundleFactory 
throwing "Caused by: java.lang.IllegalStateException: Reference count must not 
be negative.". Since refCount should be always >= 0, it should never happen. 
Please look into what root cause was. One task manager log is as follows:
{
"message": "2024-08-29 17:28:11,630 WARN 
org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] -
Expiring environment urn: \"beam:env:process:v1\"",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": " with 6 remaining bundle references. Taking note to clean it up 
during shutdown if the
references are not removed by then.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,635 WARN 
org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] -
Hanged up for unknown endpoint.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "org.apache.flink.util.FlinkException: Disconnect from JobManager 
responsible for
d1c6852e22e8553ea2e13e19b5c60954.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: org.apache.flink.util.FlinkExpectedException: The 
TaskExecutor is shutting down.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,634 WARN 
org.apache.flink.runtime.taskmanager.Task [] -
WindowAggregate[34] -> Calc[35] -> (PythonCalc[36] -> Calc[37] -> 
StreamRecordTimestampInserter[38] -> StreamingFileWriter, PythonCalc[88] -> 
Calc[89]) (1/6)#14 
(ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14)
switched from RUNNING to FAILED with failure cause:",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught 
exception while processing timer.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: 
java.lang.RuntimeException:
Error while waiting for BeamPythonFunctionRunner flush",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: java.lang.RuntimeException: Error while waiting for 
BeamPythonFunctionRunner flush",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: java.lang.RuntimeException: Failed to close remote 
bundle",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: java.lang.IllegalStateException: Reference count must 
not be negative.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,636 ERROR 
org.apache.beam.runners.fnexecution.control.FnApiControlClient [] -
FnApiControlClient closed, clearing outstanding requests 
{12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 dependents],
15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 1 dependents], 
18=java.util.concurrent.CompletableFuture@46a954c2[Not completed, 1 
dependents], 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 
1 dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, 
1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not completed, 
1 dependents]}",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,637 WARN 
org.apache.flink.runtime.taskmanager.Task [] -
Source: utlb_res_vasn[7] -> Calc[8] -> LocalWindowAggregate[9] (1/4)#14 
(ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) 
switched from RUNNING to FAILED with failure cause:",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "org.apache.flink.util.FlinkException: Disconnect from JobManager 
responsible for d1c6852e22e8553ea2e13e19b5c60954.",
"time": "2024-08-29T17:28:11+00:00"
}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to