[ 
https://issues.apache.org/jira/browse/FLINK-39692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Zeger updated FLINK-39692:
--------------------------------
    Description: 
*Where*

One site where `catch (Exception e)` is followed by a `LOG.warn` / `LOG.error` 
that does not pass `e` as the last argument, so the stack trace is silently 
discarded:

`flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/snapshot/StateSnapshotReconciler.java`
 line 173 - failure to dispose savepoint.

When something abnormal happens - a savepoint can't be disposed, scaling data 
can't be decompressed, a memory config is rejected - operators want to see the 
*{*}root cause{*}*, not just "something failed."

Today, when an operator team gets paged for one of these warnings, they have to:

1. Read the code to figure out what `Exception` types could have been thrown.
2. Re-derive (often incorrectly) what the underlying problem might be.
3. Hope it reproduces while they have a debugger attached.

Passing the exception costs nothing and saves debugging time.

*Proposed fix*

Mechanical, one-line change at each site:

```diff
-LOG.warn("Error while decompressing scaling data, treating as uncompressed");
+LOG.warn("Error while decompressing scaling data, treating as uncompressed", 
e);
```

  was:
*Where*

Five sites where `catch (Exception e)` is followed by a `LOG.warn` / 
`LOG.error` that does not pass `e` as the last argument, so the stack trace is 
silently discarded:

1. 
`flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/snapshot/StateSnapshotReconciler.java`
 line 173 - failure to dispose savepoint.
2. 
`flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/state/KubernetesAutoScalerStateStore.java`
 line 393 - failure to decompress scaling data.
3. 
`flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/IngressUtils.java`
 line 369 - failure to parse Kubernetes server version.
4. 
`flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigManager.java`
 lines 264 and 266 - failure to parse a Flink-version-prefixed config key.
5. 
`flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java`
 line 90 - failure to parse memory configuration.

When something abnormal happens - a savepoint can't be disposed, scaling data 
can't be decompressed, a memory config is rejected - operators want to see the 
**root cause**, not just "something failed."

Today, when an operator team gets paged for one of these warnings, they have to:

1. Read the code to figure out what `Exception` types could have been thrown.
2. Re-derive (often incorrectly) what the underlying problem might be.
3. Hope it reproduces while they have a debugger attached.

Passing the exception costs nothing and saves debugging time.

*Proposed fix*

Mechanical, one-line change at each site:

```diff
-LOG.warn("Error while decompressing scaling data, treating as uncompressed");
+LOG.warn("Error while decompressing scaling data, treating as uncompressed", 
e);
```


> Include caught exceptions in warn/error log statements
> ------------------------------------------------------
>
>                 Key: FLINK-39692
>                 URL: https://issues.apache.org/jira/browse/FLINK-39692
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Pavel Zeger
>            Priority: Major
>              Labels: pull-request-available
>
> *Where*
> One site where `catch (Exception e)` is followed by a `LOG.warn` / 
> `LOG.error` that does not pass `e` as the last argument, so the stack trace 
> is silently discarded:
> `flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/snapshot/StateSnapshotReconciler.java`
>  line 173 - failure to dispose savepoint.
> When something abnormal happens - a savepoint can't be disposed, scaling data 
> can't be decompressed, a memory config is rejected - operators want to see 
> the *{*}root cause{*}*, not just "something failed."
> Today, when an operator team gets paged for one of these warnings, they have 
> to:
> 1. Read the code to figure out what `Exception` types could have been thrown.
> 2. Re-derive (often incorrectly) what the underlying problem might be.
> 3. Hope it reproduces while they have a debugger attached.
> Passing the exception costs nothing and saves debugging time.
> *Proposed fix*
> Mechanical, one-line change at each site:
> ```diff
> -LOG.warn("Error while decompressing scaling data, treating as uncompressed");
> +LOG.warn("Error while decompressing scaling data, treating as uncompressed", 
> e);
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to