Hi Kui.Yuan, This is a great idea to expose the timeout method of AsyncFunction to the user-facing UDF. I have a couple comments/questions about the design:
1) The FLIP mentions lookup joins in particular, but the correlate query codepath also exists: https://cwiki.apache.org/confluence/display/FLINK/FLIP-498%3A+AsyncTableFunction+for+async+table+function+support . Is this intended to be supported as well? I think it might require updating AsyncCorrelateRunner. > The `CompletableFuture` must be completed synchronously; 2) User code could always set a "soft" timeout, and use some of the remaining time budget for a subsequent async call to handle the soft timeout, if necessary. So this seems like a reasonable limitation, even if they needed async support. This is similar to the more general case of a recoverable error scenario where you might want to handle the Nth attempt at a request differently, but that's probably beyond this FLIP's scope. Thanks, Alan On Tue, May 26, 2026 at 2:21 AM Kui Yuan <[email protected]> wrote: > Hi Lincoln, > > Thanks a lot for the great questions! I'll incorporate the relevant > clarifications into the FLIP. Let me share some quick thoughts here: > > Q1 (interaction with retry): The timeout reuses the existing flow in > AsyncWaitOperator. The configured timeout is the total budget measured > from the first asyncInvoke. When the async-lookup timer expires, any > pending delayed retry is cancelled and timeout(...) is invoked exactly > once. The retry result/exception predicates are not consulted on the > timeout path. > > Q2 ("CompletableFuture must be completed synchronously"): This is a > runtime guarantee, not just a javadoc convention. The codegen'd code > checks whether the future is completed; if it isn't, an error is > raised and the job fails fast. > > Q3 (failure modes of the user timeout method): > > If the timeout method itself throws, we surface the exception and fail > the job fast — no silent swallowing. > If timeout blocks synchronously and never returns, it would block the > mailbox thread and effectively freeze the task. This is admittedly a > serious situation, and we don't have a mechanism for it today. Adding > one would require a separate timeout-on-timeout, which could break the > single-threaded mailbox assumption. So for now the simpler approach is > to document the contract in the javadoc — timeout must be > non-blocking. Open to suggestions if you have a better idea here. > > Q4 (var-args matching): For var-args, the signatures must mirror each > other. If eval is eval(CompletableFuture<T>, String...), then timeout > must be timeout(CompletableFuture<T>, String...). Any mismatch is > rejected at startup with a clear error and I'll add unit tests to > cover this. > > Q5 (AsyncScalarFunction): This was an intentional scoping choice. The > use cases driving the custom-timeout requirement today are mostly LLM > calls, where the output is rarely a single scalar — so users tend to > reach for AsyncTableFunction instead of AsyncScalarFunction (the > built-in AsyncPredictFunction also extends AsyncTableFunction). Until > we hear a concrete user need on the scalar side, supporting timeout > only on AsyncTableFunction feels sufficient. Happy to revisit if/when > such a need surfaces. > > Thanks again for the careful review! > > Best, > Kui.Yuan > > > > Lincoln Lee <[email protected]> 于2026年5月26日周二 15:16写道: > > > > Thanks for the FLIP, nice addition! A few questions before the vote: > > > > 1. Interaction with the retry strategy: the FLIP doesn't say where the > new > > timeout hook sits in that flow, worth spelling out explicitly. > > > > 2. "CompletableFuture must be completed synchronously." Is this a javadoc > > convention or a runtime guarantee? If it's only a convention, it would > help > > to > > document what happens when users break it. If it's enforced, a brief note > > would > > be useful. > > > > 3. Failure modes of the user timeout method: two cases aren't covered: > > The method itself throws, do we swallow and degrade the Exception, or > fail > > the job? > > The method never completes (bug or hang), does the operator stall > > indefinitely? > > > > 4. On the public interface: since timeout's parameter list mirrors eval, > > how is > > a var-args eval (e.g. eval(CompletableFuture<T>, String...)) expected to > be > > matched? > > > > 5. How has AsyncScalarFunction been considered here? It shares the same > > timeout-prone remote-call pattern, so it seems natural to extend the same > > mechanism there as well, is that in scope, a follow-up, or intentionally > > left out? > > > > > > Best, > > Lincoln Lee > > > > > > Gen Luo <[email protected]> 于2026年5月26日周二 11:02写道: > > > > > Thanks for driving this proposal forward! This addresses a real pain > point > > > we've been hearing about for a while. > > > > > > Many of our users rely on AsyncTableFunction or Lookup Join to > implement > > > custom external service calls and data fetching, typically for RAG or > LLM > > > inference scenarios. Due to the inherent instability of these external > > > services, timeouts occur occasionally, and users want to apply fallback > > > strategies (e.g., falling back to a local rule-based model) rather than > > > failing the entire job. However, this hasn't been achievable so far — > the > > > hard-coded TimeoutException behavior introduces stability risks, > forcing > > > users to keep increasing the timeout value to absurd levels and work > around > > > the issue in various hacky ways. Worse, each user tends to hit this > pitfall > > > independently before realizing the limitation. > > > > > > Adding a timeout interface not only addresses this pain point, but also > > > aligns the API contract between AsyncTableFunction and AsyncFunction, > > > avoiding unnecessary confusion for users. > > > > > > Big +1 from our side — looking forward to seeing this land. > > > > > > On Mon, May 25, 2026 at 7:53 PM Xia Sun <[email protected]> wrote: > > > > > > > Hi Kui.Yuan, > > > > > > > > Thanks for driving this! > > > > > > > > In our production practice, the asynchronous I/O capability of > > > > AsyncTableFunction has shown excellent performance in > > > > batch LLM inference scenarios. We urgently need a custom timeout UDF > > > > for this use case. It would help us handle inference requests that > > > > time out—especially long-context requests—more precisely, and avoid > > > > excessive retries that could otherwise block downstream data. > > > > > > > > +1 to this proposal. > > > > > > > > Best, > > > > > > > > Xia > > > > > > > > Kui Yuan <[email protected]> 于2026年5月22日周五 11:21写道: > > > > > > > > > Hi All, > > > > > > > > > > I'd like to open a discussion for FLIP-580: AsyncTableFunction > supports > > > > > user-defined timeout handling logic [1]. > > > > > > > > > > An increasing number of users are leveraging AsyncTableFunction to > > > invoke > > > > > remote inference clusters. Such invocations are essentially remote > > > > > inference requests, which are far more prone to timeouts than > regular > > > I/O > > > > > operations. Users expect to be able to define custom handling logic > > > when > > > > a > > > > > timeout occurs — for example, falling back to default data or > > > > accumulating > > > > > failure statistics — rather than having a TimeoutException thrown > > > > directly > > > > > and causing the entire job to fail. > > > > > > > > > > > > > > > This FLIP proposal allow users to define custom timeout handling > logic > > > > > inside AsyncTableFunction. > > > > > > > > > > I've already discussed the implementation details with @Luogen > offline, > > > > and > > > > > there's a POC attached [2]. > > > > > > > > > > > > > > > Looking forward to your feedback. > > > > > > > > > > Bests, > > > > > Kui.Yuan > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-580*3A*AsyncTableFunction*supports*user-defined*timeout*handling*logic__;JSsrKysrKw!!Ayb5sqE7!uTEOgL-aoWOoxNwvQHWKbSUTZgVn_AjMT8-6Xc7Ub9Aj0UccuvazQdHffMt3DnSo-WJu42YJ04xcyKPDAYb8Sg$ > > > > > > > > > > [2]: > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://github.com/yuchengxin/flink/commit/5a46cd05c48e41a582271dcb9d9842e330871a0b__;!!Ayb5sqE7!uTEOgL-aoWOoxNwvQHWKbSUTZgVn_AjMT8-6Xc7Ub9Aj0UccuvazQdHffMt3DnSo-WJu42YJ04xcyKO3xGyfxQ$ > > > > > > > > > > > > >
