Hi Igal, Thank you for your response, understood the strategies.
Best, Hady On Wed, Nov 3, 2021 at 9:06 PM Igal Shilman <i...@apache.org> wrote: > Hello Hady, > Glad to see that you are testing StateFun! > > Regarding that exception, I think that this is not the root cause. The > root cause is as you wrote that the StateFun job failed because it wasn't > able to deliver a message to a remote function in the given time frame. > If you look at the logs you most likely see a > StateFulFunctionInvocationException. > > However the Flink Job should recover if checkpointing is enabled or a > restart strategy is set in your flink-conf.yaml. > You can refer the default flink-conf.yaml[1] that we ship in the > official Docker image[2] > > One more thing that I'd like to point out is that if the remote functions > are not able to keep up with the stream of request, > you have few strategies to deal with that: > > * increase the total call timeout per invocation [3] > * reduce the maxNumberOfBatch requests to a lower value [4] > * reduce the total number of asynchronous in-flight requests: set the > following key in your flink-conf.yaml to something lower than 1024. Try and > experiment with different values. > statefun.async.max-per-task: 256 # for example 256, 128,... > * enable autoscaling of your remote containers (if your > infrastructure allows) > > [1] > https://github.com/apache/flink-statefun/blob/56c2f036a6831af885eb15539bc7962bb730b060/tools/docker/flink-distribution-template/conf/flink-conf.yaml > [2] https://hub.docker.com/r/apache/flink-statefun > [3] > https://nightlies.apache.org/flink/flink-statefun-docs-master/docs/modules/http-endpoint/#asynchronous-http-transport-beta > [4] > https://github.com/apache/flink-statefun/blob/master/statefun-e2e-tests/statefun-smoke-e2e-js/src/test/resources/remote-module/module.yaml#L20 > > Good luck, > Igal. > > On Wed, Nov 3, 2021 at 10:59 AM Hady Januar Willi <hady.wi...@gojek.com> > wrote: > >> Hi everyone, >> >> When testing Flink statefun, the job eventually throws the following >> exception after failing to reach the endpoint or if the endpoint fails >> after the exponentially increasing delay. >> >> java.util.concurrent.RejectedExecutionException: >> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException: >> Mailbox is in state CLOSED, but is required to be in state OPEN for put >> operations. >> >> How do I recover from this state? >> >> Thank you. >> >> Regards, >> Hady >> >