Using EBS as checkpoint storage doesn't work in a distributed environment
if you need to move the state between TMs (e.g., for rescaling and
non-local recovery). You'd need something along the lines of RW
multi-attach and set up the volumes in a smart way; it won't be easy to set
up; I'm not aware of anyone doing that.


On Wed, Jul 19, 2023 at 11:10 AM Prabhu Joseph <>

> Thanks for sharing the information.
> I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task
> Local Recovery) performs better than EBS as Primary Checkpoint storage.
> On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf <>
> wrote:
>> Hi Prabhu,
>> this should be possible, but is quite expensive in comparison to AWS S3
>> and you have to remount the EBS volumes to the new Taskmanagers in case of
>> a failure which takes some non-trivial time, which slows down recovery. So,
>> overall I don't think its peferrable compared to S3.
>> We do use EBS volumes, though, for the local RocksDB working directory.
>> We don't remount them on failure though right now due to the additional
>> latency that is introduced by that.
>> Cheers,
>> Konstantin
>> Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph <
>>> Hi,
>>> We are investigating the feasibility of setting up an Elastic Block
>>> Store (EBS) as checkpoint storage by mounting a volume (a shared local file
>>> system path) to JobManager and all the TaskManager pods. I want to hear any
>>> feedback on this approach if anyone has already tried it.
>>> Thanks,
>>> Prabhu Joseph
