Using EBS as checkpoint storage doesn't work in a distributed environment
if you need to move the state between TMs (e.g., for rescaling and
non-local recovery). You'd need something along the lines of RW
multi-attach and set up the volumes in a smart way; it won't be easy to set
up; I'm not aware
Thanks for sharing the information.
I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task Local
Recovery) performs better than EBS as Primary Checkpoint storage.
On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf wrote:
> Hi Prabhu,
>
> this should be possible, but is quite exp
Hi Prabhu,
this should be possible, but is quite expensive in comparison to AWS S3 and
you have to remount the EBS volumes to the new Taskmanagers in case of a
failure which takes some non-trivial time, which slows down recovery. So,
overall I don't think its peferrable compared to S3.
We do use
Hi,
We are investigating the feasibility of setting up an Elastic Block Store
(EBS) as checkpoint storage by mounting a volume (a shared local file
system path) to JobManager and all the TaskManager pods. I want to hear any
feedback on this approach if anyone has already tried it.
Thanks,
Prabhu