Hi,
I am running some benchmarks using StateFun and have encountered a problem with backpressure and slow checkpoints that I can't figure out the reason for, and was hoping that someone might have an idea of what is causing it. My setup is the following: I am running the Shopping Cart application from the StateFun playground. The job is submitted as an uber jar to an existing Flink Cluster with 3 TaskManagers and 1 JobManager. The functions are served using the Undertow example from the documentation and I am using Kafka ingresses and egresses. My workload is only at 1000 events/s. Everything is run in separate GCP VMs. The issue is with very long checkpoints, which I assume is caused by a backpressured ingress caused by the function dispatcher operator not being able to handle the workload. The only thing that has helped so far is to increase the parallelism of the job, but it feels like the still is some other bottleneck that is causing the issues. I have seen other benchmarks reaching much higher throughput than 1000 events/s, without more CPU or memory resources than I am using. Any ideas of bottlenecks or ways to figure them out are greatly appreciated. Best Regards, Christopher Gustafson