Re: OperatorStateFromBackend can't complete initialisation because of high number of savepoint files reads

2024-10-15 Thread Gabor Somogyi
> > Could you please let us know if you see anything wrong when using > `execution.checkpointing.snapshot-compression: true` since for us this > seems to have solved the multiple S3 reads issue. > When something is working it's never wrong. The question is why is has been resolved. Are you still ha

Re: OperatorStateFromBackend can't complete initialisation because of high number of savepoint files reads

2024-10-15 Thread William Wallace
Thank you for the recommendation and the help. Could you please let us know if you see anything wrong when using `execution.checkpointing.snapshot-compression: true` since for us this seems to have solved the multiple S3 reads issue. In debug we see: `in.delegate = ClosingFSDataInputStream(org.apa

Re: OperatorStateFromBackend can't complete initialisation because of high number of savepoint files reads

2024-10-15 Thread Gabor Somogyi
My recommendation is to cherry-pick this PR [1] at top of your Flink distro when possible. Additionally turn off state compression. These should do the trick... [1] https://github.com/apache/flink/pull/25509 G On Tue, Oct 15, 2024 at 1:03 PM William Wallace wrote: > Thank you Gabor for your r

Re: OperatorStateFromBackend can't complete initialisation because of high number of savepoint files reads

2024-10-15 Thread William Wallace
Thank you Gabor for your reply. I'm sharing below more findings for both uncompressed and compressed state with the hope it helps. I'm looking further to your thoughts. 1. uncompressed state - observe the `stateHandle=RelativeFileStateHandle` ``` org.apache.flink.runtime.state.restore.FullSnapsho

Re: Flink job can't complete initialisation because of millions of savepoint file reads

2024-10-15 Thread Gabor Somogyi
Hi Alex, Please see my comment here [1]. [1] https://lists.apache.org/thread/h5mv6ld4l2g4hsjszfdos9f365nh7ctf BR, G On Mon, Sep 2, 2024 at 11:02 AM Alex K. wrote: > We have an issue where a savepoint file containing Kafka topic partitions > offsets is requested millions of times from AWS S3.

Re: OperatorStateFromBackend can't complete initialisation because of high number of savepoint files reads

2024-10-15 Thread Gabor Somogyi
Hi William, It's a bit old question but I think now we know why this is happening. Please see [1] for further details. It's an important requirement to use uncompressed state because even with the fix compressed state is still problematic. We've already tested the PR with load but if you can repo