featzhang created FLINK-39225:
---------------------------------
Summary: Add retry with default value fallback for triton
inference failures
Key: FLINK-39225
URL: https://issues.apache.org/jira/browse/FLINK-39225
Project: Flink
Issue Type: Sub-task
Components: Table SQL / Runtime
Reporter: featzhang
Adds retry mechanism with default value fallback for Triton model inference
failures, enabling robust error handling and downstream filtering.
h2. Brief change log
h3. 1. New Configuration Options (TritonOptions.java)
* {{{}max-retries{}}}: Maximum retry attempts (default: 0)
* {{{}retry-backoff{}}}: Initial backoff duration with exponential strategy
(default: 100ms)
* {{{}default-value{}}}: Fallback value when all retries fail
h3. 2. Retry Logic (TritonInferenceModelFunction.java)
* Implements exponential backoff retry strategy
* Retries on network errors and 5xx server errors (503, 504)
* Fails immediately on 4xx client errors (configuration issues)
* Detailed logging for each retry attempt
h3. 3. Default Value Fallback
* Returns configured default value after exhausting all retries
* Supports all output types: STRING, numeric, ARRAY
* Enables downstream view-based routing for success/failure cases
* Backward compatible: throws exceptions if no default value configured
h3. 4. AbstractTritonModelFunction.java
* Added fields and getters for retry configuration
h2. Use Cases
{*}Scenario{*}: After N consecutive failures, return a default value that
downstream can use to route records to success/failure paths.
{*}Example Configuration{*}:
CREATE MODEL my_triton_model
WITH ( 'provider' = 'triton', 'endpoint' = 'http://triton:8000/v2/models',
'model-name' = 'my-model', 'max-retries' = '3', -- Retry up to 3
times'retry-backoff' = '100ms', -- 100ms, 200ms, 400ms
backoff'default-value' = 'FAILED' -- Return 'FAILED' on all failures);
{*}Downstream Processing{*}:
-- Route based on prediction resultINSERT INTO success_tableSELECT * FROM
predictions WHERE result != 'FAILED';INSERT INTO failure_tableSELECT * FROM
predictions WHERE result = 'FAILED';
--
This message was sent by Atlassian Jira
(v8.20.10#820010)