Hi Guowei and Arvid, Thanks for the suggestion. I wonder if it makes sense and possible that the operator will produce a side output message telling the source to 'pause', and the same side output as the side input to the source, based on which, the source would pause and resume?
Thanks a lot! Eleanore On Sun, Nov 29, 2020 at 11:33 PM Arvid Heise <ar...@ververica.com> wrote: > Hi Eleanore, > > if the external system is down, you could simply fail the job after a > given timeout (for example, using asyncIO). Then the job would restart > using the restarting policies. > > If your state is rather small (and thus recovery time okay), you would > pretty much get your desired behavior. The job would stop to make progress > until eventually the external system is responding again. > > On Mon, Nov 30, 2020 at 7:39 AM Guowei Ma <guowei....@gmail.com> wrote: > >> Hi, Eleanore >> >> 1. AFAIK I think only the job could "pause" itself. For example the >> "query" external system could pause when the external system is down. >> 2. Maybe you could try the "iterate" and send the failed message back to >> retry if you use the DataStream api. >> >> Best, >> Guowei >> >> >> On Mon, Nov 30, 2020 at 1:01 PM Eleanore Jin <eleanore....@gmail.com> >> wrote: >> >>> Hi experts, >>> >>> Here is my use case, it's a flink stateless streaming job for message >>> validation. >>> 1. read from a kafka topic >>> 2. perform validation of message, which requires query external system >>> 2a. the metadata from the external system will be cached in >>> memory for 15minutes >>> 2b. there is another stream that will send updates to update the >>> cache if metadata changed within 15 minutes >>> 3. if message is valid, publish to valid topic >>> 4. if message is invalid, publish to error topic >>> 5. if the external system is down, the message is marked as invalid with >>> different error code, and published to the same error topic. >>> >>> Ask: >>> For those messages that failed due to external system failures, it >>> requires manual replay of those messages. >>> >>> Is there a way to pause the job if there is an external system failure, >>> and resume once the external system is online? >>> >>> Or are there any other suggestions to allow auto retry such error? >>> >>> Thanks a lot! >>> Eleanore >>> >> > > -- > > Arvid Heise | Senior Java Developer > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > (Toni) Cheng >