[ 
https://issues.apache.org/jira/browse/FLINK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EMERSON WANG updated FLINK-36182:
---------------------------------
    Description: 
PyFlink SQL job was running in the AWS EKS cluster. When the task manager pods 
were scaled down, Preconditions.checkState in the class DefaultJobBundleFactory 
throwing "Caused by: java.lang.IllegalStateException: Reference count must not 
be negative.". Since refCount should be always >= 0, it should never happen. 
Please look into what root cause was. One task manager log is as follows:

{ "message": "2024-08-29 17:28:11,630 WARN 
org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] - 
Expiring environment urn: \"beam:env:process:v1\"", "time": 
"2024-08-29T17:28:11+00:00" } \{ "message": " with 6 remaining bundle 
references. Taking note to clean it up during shutdown if the references are 
not removed by then.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": 
"2024-08-29 17:28:11,635 WARN 
org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] - Hanged up for 
unknown endpoint.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": 
"org.apache.flink.util.FlinkException: Disconnect from JobManager responsible 
for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } 
\{ "message": "Caused by: org.apache.flink.util.FlinkExpectedException: The 
TaskExecutor is shutting down.", "time": "2024-08-29T17:28:11+00:00" } \{ 
"message": "2024-08-29 17:28:11,634 WARN 
org.apache.flink.runtime.taskmanager.Task [] - WindowAggregate[34] -> Calc[35] 
-> (PythonCalc[36] -> Calc[37] -> StreamRecordTimestampInserter[38] -> 
StreamingFileWriter, PythonCalc[88] -> Calc[89]) (1/6)#14 
(ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14) 
switched from RUNNING to FAILED with failure cause:", "time": 
"2024-08-29T17:28:11+00:00" } \{ "message": 
"org.apache.flink.runtime.taskmanager.AsynchronousException: Caught exception 
while processing timer.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": 
"Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: 
java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner 
flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: 
java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner 
flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: 
java.lang.RuntimeException: Failed to close remote bundle", "time": 
"2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: 
java.lang.IllegalStateException: Reference count must not be negative.", 
"time": "2024-08-29T17:28:11+00:00" }{
"message": "2024-08-29 17:28:11,636 ERROR 
org.apache.beam.runners.fnexecution.control.FnApiControlClient [] -
FnApiControlClient closed, clearing outstanding requests

{12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 
dependents], 15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 1 
dependents], 18=java.util.concurrent.CompletableFuture@46a954c2[Not completed, 
1 dependents], 19=java.util.concurrent.CompletableFuture@52632b1d[Not 
completed, 1 dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not 
completed, 1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not 
completed, 1 dependents]}

",
"time": "2024-08-29T17:28:11+00:00"
}\{ "message": "2024-08-29 17:28:11,637 WARN 
org.apache.flink.runtime.taskmanager.Task [] - Source: utlb_res_vasn[7] -> 
Calc[8] -> LocalWindowAggregate[9] (1/4)#14 
(ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) 
switched from RUNNING to FAILED with failure cause:", "time": 
"2024-08-29T17:28:11+00:00" } \{ "message": 
"org.apache.flink.util.FlinkException: Disconnect from JobManager responsible 
for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" }

  was:
PyFlink SQL job was running in the AWS EKS cluster. When the task manager pods 
were scaled down, Preconditions.checkState in the class DefaultJobBundleFactory 
throwing "Caused by: java.lang.IllegalStateException: Reference count must not 
be negative.". Since refCount should be always >= 0, it should never happen. 
Please look into what root cause was. One task manager log is as follows:
{
"message": "2024-08-29 17:28:11,630 WARN 
org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] -
Expiring environment urn: \"beam:env:process:v1\"",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": " with 6 remaining bundle references. Taking note to clean it up 
during shutdown if the
references are not removed by then.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,635 WARN 
org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] -
Hanged up for unknown endpoint.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "org.apache.flink.util.FlinkException: Disconnect from JobManager 
responsible for
d1c6852e22e8553ea2e13e19b5c60954.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: org.apache.flink.util.FlinkExpectedException: The 
TaskExecutor is shutting down.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,634 WARN 
org.apache.flink.runtime.taskmanager.Task [] -
WindowAggregate[34] -> Calc[35] -> (PythonCalc[36] -> Calc[37] -> 
StreamRecordTimestampInserter[38] -> StreamingFileWriter, PythonCalc[88] -> 
Calc[89]) (1/6)#14 
(ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14)
switched from RUNNING to FAILED with failure cause:",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught 
exception while processing timer.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: 
java.lang.RuntimeException:
Error while waiting for BeamPythonFunctionRunner flush",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: java.lang.RuntimeException: Error while waiting for 
BeamPythonFunctionRunner flush",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: java.lang.RuntimeException: Failed to close remote 
bundle",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "Caused by: java.lang.IllegalStateException: Reference count must 
not be negative.",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,636 ERROR 
org.apache.beam.runners.fnexecution.control.FnApiControlClient [] -
FnApiControlClient closed, clearing outstanding requests 
{12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 dependents],
15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 1 dependents], 
18=java.util.concurrent.CompletableFuture@46a954c2[Not completed, 1 
dependents], 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 
1 dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, 
1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not completed, 
1 dependents]}",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "2024-08-29 17:28:11,637 WARN 
org.apache.flink.runtime.taskmanager.Task [] -
Source: utlb_res_vasn[7] -> Calc[8] -> LocalWindowAggregate[9] (1/4)#14 
(ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) 
switched from RUNNING to FAILED with failure cause:",
"time": "2024-08-29T17:28:11+00:00"
}
{
"message": "org.apache.flink.util.FlinkException: Disconnect from JobManager 
responsible for d1c6852e22e8553ea2e13e19b5c60954.",
"time": "2024-08-29T17:28:11+00:00"
}


> PyFlink SQL Job Got IllegalStateException During Task Manager Shutdown
> ----------------------------------------------------------------------
>
>                 Key: FLINK-36182
>                 URL: https://issues.apache.org/jira/browse/FLINK-36182
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / Python
>    Affects Versions: 1.18.1
>         Environment: EKS prod cluster
>            Reporter: EMERSON WANG
>            Priority: Major
>
> PyFlink SQL job was running in the AWS EKS cluster. When the task manager 
> pods were scaled down, Preconditions.checkState in the class 
> DefaultJobBundleFactory throwing "Caused by: java.lang.IllegalStateException: 
> Reference count must not be negative.". Since refCount should be always >= 0, 
> it should never happen. Please look into what root cause was. One task 
> manager log is as follows:
> { "message": "2024-08-29 17:28:11,630 WARN 
> org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory [] - 
> Expiring environment urn: \"beam:env:process:v1\"", "time": 
> "2024-08-29T17:28:11+00:00" } \{ "message": " with 6 remaining bundle 
> references. Taking note to clean it up during shutdown if the references are 
> not removed by then.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": 
> "2024-08-29 17:28:11,635 WARN 
> org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer [] - Hanged up for 
> unknown endpoint.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": 
> "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible 
> for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" } 
> \{ "message": "Caused by: org.apache.flink.util.FlinkExpectedException: The 
> TaskExecutor is shutting down.", "time": "2024-08-29T17:28:11+00:00" } \{ 
> "message": "2024-08-29 17:28:11,634 WARN 
> org.apache.flink.runtime.taskmanager.Task [] - WindowAggregate[34] -> 
> Calc[35] -> (PythonCalc[36] -> Calc[37] -> StreamRecordTimestampInserter[38] 
> -> StreamingFileWriter, PythonCalc[88] -> Calc[89]) (1/6)#14 
> (ee3e0e638b2bc15ec5ad42a94435f170_7145381c4bbb09912ff683559937e2f3_0_14) 
> switched from RUNNING to FAILED with failure cause:", "time": 
> "2024-08-29T17:28:11+00:00" } \{ "message": 
> "org.apache.flink.runtime.taskmanager.AsynchronousException: Caught exception 
> while processing timer.", "time": "2024-08-29T17:28:11+00:00" } \{ "message": 
> "Caused by: org.apache.flink.streaming.runtime.tasks.TimerException: 
> java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner 
> flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: 
> java.lang.RuntimeException: Error while waiting for BeamPythonFunctionRunner 
> flush", "time": "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: 
> java.lang.RuntimeException: Failed to close remote bundle", "time": 
> "2024-08-29T17:28:11+00:00" } \{ "message": "Caused by: 
> java.lang.IllegalStateException: Reference count must not be negative.", 
> "time": "2024-08-29T17:28:11+00:00" }{
> "message": "2024-08-29 17:28:11,636 ERROR 
> org.apache.beam.runners.fnexecution.control.FnApiControlClient [] -
> FnApiControlClient closed, clearing outstanding requests
> {12=java.util.concurrent.CompletableFuture@7945ef9[Not completed, 1 
> dependents], 15=java.util.concurrent.CompletableFuture@3abdad3[Not completed, 
> 1 dependents], 18=java.util.concurrent.CompletableFuture@46a954c2[Not 
> completed, 1 dependents], 
> 19=java.util.concurrent.CompletableFuture@52632b1d[Not completed, 1 
> dependents], 5=java.util.concurrent.CompletableFuture@77a5d43e[Not completed, 
> 1 dependents], 6=java.util.concurrent.CompletableFuture@2a85f55b[Not 
> completed, 1 dependents]}
> ",
> "time": "2024-08-29T17:28:11+00:00"
> }\{ "message": "2024-08-29 17:28:11,637 WARN 
> org.apache.flink.runtime.taskmanager.Task [] - Source: utlb_res_vasn[7] -> 
> Calc[8] -> LocalWindowAggregate[9] (1/4)#14 
> (ee3e0e638b2bc15ec5ad42a94435f170_6cdc5bb954874d922eaee11a8e7b5dd5_0_14) 
> switched from RUNNING to FAILED with failure cause:", "time": 
> "2024-08-29T17:28:11+00:00" } \{ "message": 
> "org.apache.flink.util.FlinkException: Disconnect from JobManager responsible 
> for d1c6852e22e8553ea2e13e19b5c60954.", "time": "2024-08-29T17:28:11+00:00" }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to