Re: Elastic Block Store as checkpoint storage

David Morávek Thu, 20 Jul 2023 00:35:22 -0700

Using EBS as checkpoint storage doesn't work in a distributed environment
if you need to move the state between TMs (e.g., for rescaling and
non-local recovery). You'd need something along the lines of RW
multi-attach and set up the volumes in a smart way; it won't be easy to set
up; I'm not aware of anyone doing that.


Best,
D.

On Wed, Jul 19, 2023 at 11:10 AM Prabhu Joseph <[email protected]>
wrote:

> Thanks for sharing the information.
>
> I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task
> Local Recovery) performs better than EBS as Primary Checkpoint storage.
>
>
>
> On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf <[email protected]>
> wrote:
>
>> Hi Prabhu,
>>
>> this should be possible, but is quite expensive in comparison to AWS S3
>> and you have to remount the EBS volumes to the new Taskmanagers in case of
>> a failure which takes some non-trivial time, which slows down recovery. So,
>> overall I don't think its peferrable compared to S3.
>>
>> We do use EBS volumes, though, for the local RocksDB working directory.
>> We don't remount them on failure though right now due to the additional
>> latency that is introduced by that.
>>
>> Cheers,
>>
>> Konstantin
>>
>> Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph <
>> [email protected]>:
>>
>>> Hi,
>>>
>>> We are investigating the feasibility of setting up an Elastic Block
>>> Store (EBS) as checkpoint storage by mounting a volume (a shared local file
>>> system path) to JobManager and all the TaskManager pods. I want to hear any
>>> feedback on this approach if anyone has already tried it.
>>>
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>
>>
>> --
>> https://twitter.com/snntrable
>> https://github.com/knaufk
>>
>

Re: Elastic Block Store as checkpoint storage

Reply via email to