Using EBS as checkpoint storage doesn't work in a distributed environment if you need to move the state between TMs (e.g., for rescaling and non-local recovery). You'd need something along the lines of RW multi-attach and set up the volumes in a smart way; it won't be easy to set up; I'm not aware of anyone doing that.
Best, D. On Wed, Jul 19, 2023 at 11:10 AM Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Thanks for sharing the information. > > I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task > Local Recovery) performs better than EBS as Primary Checkpoint storage. > > > > On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf <kna...@apache.org> > wrote: > >> Hi Prabhu, >> >> this should be possible, but is quite expensive in comparison to AWS S3 >> and you have to remount the EBS volumes to the new Taskmanagers in case of >> a failure which takes some non-trivial time, which slows down recovery. So, >> overall I don't think its peferrable compared to S3. >> >> We do use EBS volumes, though, for the local RocksDB working directory. >> We don't remount them on failure though right now due to the additional >> latency that is introduced by that. >> >> Cheers, >> >> Konstantin >> >> Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph < >> prabhujose.ga...@gmail.com>: >> >>> Hi, >>> >>> We are investigating the feasibility of setting up an Elastic Block >>> Store (EBS) as checkpoint storage by mounting a volume (a shared local file >>> system path) to JobManager and all the TaskManager pods. I want to hear any >>> feedback on this approach if anyone has already tried it. >>> >>> >>> Thanks, >>> Prabhu Joseph >>> >> >> >> -- >> https://twitter.com/snntrable >> https://github.com/knaufk >> >