Hi Team, We have a session cluster running on K8 where multiple stateless jobs are running fine. We observed that once we submit a stateful job (state size per checkpoint is 1GB) to the same session cluster other jobs are impacted because this job starts to utilise more memory and CPU and eventually terminates the pod.
To mitigate this issue and provide better resource isolation we have created multiple session clusters where we will launch a high throughput (stateful) job in one cluster and club low throughput jobs in another cluster. This seems to work fine but managing this will be painful once we start to create more session cluster for high throughput jobs (10 plus jobs) as we will not have a single flink endpoint to submit the job ( as we have it in YARN where we submit directly to RM ) Can you please provide me inputs on how we should handle this better in Kubernetes Regards, Vinay Patil