[jira] [Assigned] (FLINK-36319) FAIL behavior on non-retriable write errors causes an infinite loop when restarting from checkpoint

Hong Liang Teoh (Jira) Wed, 18 Sep 2024 09:00:30 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-36319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hong Liang Teoh reassigned FLINK-36319:
---------------------------------------

    Assignee: Lorenzo Nicora

> FAIL behavior on non-retriable write errors causes an infinite loop when 
> restarting from checkpoint
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-36319
>                 URL: https://issues.apache.org/jira/browse/FLINK-36319
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Lorenzo Nicora
>            Assignee: Lorenzo Nicora
>            Priority: Major
>
> The {{FAIL}} (default) error handling behavior when a write request is 
> rejected as non-retriable ({{{}onPrometheusNonRetriableError{}}}), causes the 
> job to fail and restart.
> Restarting from checkpoint causes some out-of-order (duplicate) writes, that 
> by default Prometheus rejects as non-retrable.
> As a consequence, when {{onPrometheusNonRetriableError}} = {{FAIL}} any 
> restarts from checkpoint puts the job in an infinite loop.
> Changes:
> 1. default {{onPrometheusNonRetriableError}} should be 
> {{DISCARD_AND_CONTINUE}}
> 2. {{onPrometheusNonRetriableError}} cannot be set to {{FAIL}}
> 3. Amend docs
> We can keep the rest of the implementation as-is for the moment, and just 
> prevent from setting {{FAIL}} for this behaviour, as we may expand handling 
> this error with a different behaviour



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36319) FAIL behavior on non-retriable write errors causes an infinite loop when restarting from checkpoint

Reply via email to