Lorenzo Nicora created FLINK-36319:
--------------------------------------

             Summary: FAIL behavior on non-retriable write errors causes an 
infinite loop when restarting from checkpoint
                 Key: FLINK-36319
                 URL: https://issues.apache.org/jira/browse/FLINK-36319
             Project: Flink
          Issue Type: Sub-task
            Reporter: Lorenzo Nicora


The {{FAIL}} (default) error handling behavior when a write request is rejected 
as non-retriable ({{onPrometheusNonRetriableError}}), causes the job to fail 
and restart.


Restarting from checkpoint causes some out-of-order (duplicate) writes, that by 
default Prometheus rejects as non-retrable.

As a consequence, when {{onPrometheusNonRetriableError}} = {{FAIL}} any 
restarts from checkpoint puts the job in an infinite loop.

Changes:

1. default  {{onPrometheusNonRetriableError}} should be {{DISCARD_AND_CONTINUE}}
2.  {{onPrometheusNonRetriableError}} cannot be set to {{FAIL}}

We can keep the rest of the implementation as-is for the moment, and just 
prevent from setting {{FAIL}} for this behaviour, as we may expand handling 
this error with a different behaviour



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to