Hi, Eleanore 1. AFAIK I think only the job could "pause" itself. For example the "query" external system could pause when the external system is down. 2. Maybe you could try the "iterate" and send the failed message back to retry if you use the DataStream api.
Best, Guowei On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <eleanore....@gmail.com> wrote: > Hi experts, > > Here is my use case, it's a flink stateless streaming job for message > validation. > 1. read from a kafka topic > 2. perform validation of message, which requires query external system > 2a. the metadata from the external system will be cached in memory > for 15minutes > 2b. there is another stream that will send updates to update the > cache if metadata changed within 15 minutes > 3. if message is valid, publish to valid topic > 4. if message is invalid, publish to error topic > 5. if the external system is down, the message is marked as invalid with > different error code, and published to the same error topic. > > Ask: > For those messages that failed due to external system failures, it > requires manual replay of those messages. > > Is there a way to pause the job if there is an external system failure, > and resume once the external system is online? > > Or are there any other suggestions to allow auto retry such error? > > Thanks a lot! > Eleanore >