[ 
https://issues.apache.org/jira/browse/FLINK-37639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selman Kayrancioglu updated FLINK-37639:
----------------------------------------
    Description: 
KeyedStateBootstrapFunction hangs if an exception is thrown within

When an exception occurs within 
`KeyedStateBootstrapFunction<>.processElements`, the job fails to terminate 
properly. Instead of failing with an error, the operator appears to hang 
indefinitely when viewed from the UI or monitoring tools.

I've created a minimal reproducer demonstrating this issue at: 
[https://github.com/seruman/flink-bootstrap-function-hangs-on-exception-reproducer]

This behavior is consistent across multiple environments - I've confirmed it 
occurs when:
 - Running locally with `start-cluster.sh` and `flink run ...`
 - Deploying to Kubernetes with flink-kubernetes-operator
 - Executing unit tests with `MiniCluster` (as shown in the repository)

I would expect it to to fail with the appropriate exception rather than 
becoming unresponsive.

Notably, there were no error logs generated in either the JobManager or 
TaskManager.

So far I've tried versions 1.19.1, 1.19.2, 1.20.0.

Sample config;
```
pipeline.max-parallelism=10
parallelism.default=2
execution.runtime-mode=BATCH
execution.batch-shuffle-mode=ALL_EXCHANGES_PIPELINED
jobmanager.scheduler=Default
```

Please let me know if you need any additional information.

  was:
KeyedStateBootstrapFunction hangs if an exception is thrown within

When an exception occurs within 
`KeyedStateBootstrapFunction<>.processElements`, the job fails to terminate 
properly. Instead of failing with an error, the operator appears to hang 
indefinitely when viewed from the UI or monitoring tools.

I've created a minimal reproducer demonstrating this issue at: 
https://github.com/seruman/flink-bootstrap-function-hangs-on-exception-reproducer

This behavior is consistent across multiple environments - I've confirmed it 
occurs when:
- Running locally with `start-cluster.sh`
- Deploying to Kubernetes with flink-kubernetes-operator
- Executing unit tests with `MiniCluster` (as shown in the repository)

I would expect it to to fail with the appropriate exception rather than 
becoming unresponsive.

Notably, there were no error logs generated in either the JobManager or 
TaskManager.

So far I've tried versions 1.19.1, 1.19.2, 1.20.0.

Sample config;
```
pipeline.max-parallelism=10
parallelism.default=2
execution.runtime-mode=BATCH
execution.batch-shuffle-mode=ALL_EXCHANGES_PIPELINED
jobmanager.scheduler=Default
```

Please let me know if you need any additional information.


> KeyedStateBootstrapFunction hangs if an exception is thrown within
> ------------------------------------------------------------------
>
>                 Key: FLINK-37639
>                 URL: https://issues.apache.org/jira/browse/FLINK-37639
>             Project: Flink
>          Issue Type: Bug
>          Components: API / State Processor
>    Affects Versions: 1.20.0, 1.19.1, 1.19.2
>            Reporter: Selman Kayrancioglu
>            Priority: Major
>
> KeyedStateBootstrapFunction hangs if an exception is thrown within
> When an exception occurs within 
> `KeyedStateBootstrapFunction<>.processElements`, the job fails to terminate 
> properly. Instead of failing with an error, the operator appears to hang 
> indefinitely when viewed from the UI or monitoring tools.
> I've created a minimal reproducer demonstrating this issue at: 
> [https://github.com/seruman/flink-bootstrap-function-hangs-on-exception-reproducer]
> This behavior is consistent across multiple environments - I've confirmed it 
> occurs when:
>  - Running locally with `start-cluster.sh` and `flink run ...`
>  - Deploying to Kubernetes with flink-kubernetes-operator
>  - Executing unit tests with `MiniCluster` (as shown in the repository)
> I would expect it to to fail with the appropriate exception rather than 
> becoming unresponsive.
> Notably, there were no error logs generated in either the JobManager or 
> TaskManager.
> So far I've tried versions 1.19.1, 1.19.2, 1.20.0.
> Sample config;
> ```
> pipeline.max-parallelism=10
> parallelism.default=2
> execution.runtime-mode=BATCH
> execution.batch-shuffle-mode=ALL_EXCHANGES_PIPELINED
> jobmanager.scheduler=Default
> ```
> Please let me know if you need any additional information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to