Lorenzo Nicora created FLINK-36319: -------------------------------------- Summary: FAIL behavior on non-retriable write errors causes an infinite loop when restarting from checkpoint Key: FLINK-36319 URL: https://issues.apache.org/jira/browse/FLINK-36319 Project: Flink Issue Type: Sub-task Reporter: Lorenzo Nicora
The {{FAIL}} (default) error handling behavior when a write request is rejected as non-retriable ({{onPrometheusNonRetriableError}}), causes the job to fail and restart. Restarting from checkpoint causes some out-of-order (duplicate) writes, that by default Prometheus rejects as non-retrable. As a consequence, when {{onPrometheusNonRetriableError}} = {{FAIL}} any restarts from checkpoint puts the job in an infinite loop. Changes: 1. default {{onPrometheusNonRetriableError}} should be {{DISCARD_AND_CONTINUE}} 2. {{onPrometheusNonRetriableError}} cannot be set to {{FAIL}} We can keep the rest of the implementation as-is for the moment, and just prevent from setting {{FAIL}} for this behaviour, as we may expand handling this error with a different behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010)