[ https://issues.apache.org/jira/browse/FLINK-36319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hong Liang Teoh reassigned FLINK-36319: --------------------------------------- Assignee: Lorenzo Nicora > FAIL behavior on non-retriable write errors causes an infinite loop when > restarting from checkpoint > --------------------------------------------------------------------------------------------------- > > Key: FLINK-36319 > URL: https://issues.apache.org/jira/browse/FLINK-36319 > Project: Flink > Issue Type: Sub-task > Reporter: Lorenzo Nicora > Assignee: Lorenzo Nicora > Priority: Major > > The {{FAIL}} (default) error handling behavior when a write request is > rejected as non-retriable ({{{}onPrometheusNonRetriableError{}}}), causes the > job to fail and restart. > Restarting from checkpoint causes some out-of-order (duplicate) writes, that > by default Prometheus rejects as non-retrable. > As a consequence, when {{onPrometheusNonRetriableError}} = {{FAIL}} any > restarts from checkpoint puts the job in an infinite loop. > Changes: > 1. default {{onPrometheusNonRetriableError}} should be > {{DISCARD_AND_CONTINUE}} > 2. {{onPrometheusNonRetriableError}} cannot be set to {{FAIL}} > 3. Amend docs > We can keep the rest of the implementation as-is for the moment, and just > prevent from setting {{FAIL}} for this behaviour, as we may expand handling > this error with a different behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010)