Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Khachatryan Roman Tue, 15 Nov 2022 01:46:04 -0800

Thanks for your reply Xingong Song,

Could you please elaborate on the following:


> The proposed changes break several good properties that we designed for
> managed memory.
> 1. It's isolated across slots
Just to clarify, any way to manage the memory efficiently while capping its
usage
will break the isolation. It's just a matter of whether it's managed memory
or not.
Do you see any reasons why unmanaged memory can be shared, and managed
memory can not?

> 2. It should never be wasted (unless there's nothing in the job that needs
> managed memory)
If I understand correctly, the managed memory can already be wasted because
it is divided evenly between slots, regardless of the existence of its
consumers in a particular slot.
And in general, even if every slot has RocksDB / python, it's not
guaranteed equal consumption.
So this property would rather be fixed in the current proposal.

> In addition, it further complicates configuration / computation logics of
> managed memory.
I think having multiple options overriding each other increases the
complexity for the user. As for the computation, I think it's desirable to
let Flink do it rather than users.

Both approaches need some help from TM for:
- storing the shared resources (static field in a class might be too
dangerous because if the backend is loaded by the user-class-loader then
memory will leak silently).
- reading the configuration

Regards,
Roman


On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <tonysong...@gmail.com> wrote:

> I like the idea of sharing RocksDB memory across slots. However, I'm quite
> concerned by the current proposed approach.
>
> The proposed changes break several good properties that we designed for
> managed memory.
> 1. It's isolated across slots
> 2. It should never be wasted (unless there's nothing in the job that needs
> managed memory)
> In addition, it further complicates configuration / computation logics of
> managed memory.
>
> As an alternative, I'd suggest introducing a variant of
> RocksDBStateBackend, that shares memory across slots and does not use
> managed memory. This basically means the shared memory is not considered as
> part of managed memory. For users of this new feature, they would need to
> configure how much memory the variant state backend should use, and
> probably also a larger framework-off-heap / jvm-overhead memory. The latter
> might require a bit extra user effort compared to the current approach, but
> would avoid significant complexity in the managed memory configuration and
> calculation logics which affects broader users.
>
>
> Best,
>
> Xintong
>
>
>
> On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <ro...@apache.org>
> wrote:
>
> > Hi John, Yun,
> >
> > Thank you for your feedback
> >
> > @John
> >
> > > It seems like operators would either choose isolation for the cluster’s
> > jobs
> > > or they would want to share the memory between jobs.
> > > I’m not sure I see the motivation to reserve only part of the memory
> for
> > sharing
> > > and allowing jobs to choose whether they will share or be isolated.
> >
> > I see two related questions here:
> >
> > 1) Whether to allow mixed workloads within the same cluster.
> > I agree that most likely all the jobs will have the same "sharing"
> > requirement.
> > So we can drop "state.backend.memory.share-scope" from the proposal.
> >
> > 2) Whether to allow different memory consumers to use shared or exclusive
> > memory.
> > Currently, only RocksDB is proposed to use shared memory. For python,
> it's
> > non-trivial because it is job-specific.
> > So we have to partition managed memory into shared/exclusive and
> therefore
> > can NOT replace "taskmanager.memory.managed.shared-fraction" with some
> > boolean flag.
> >
> > I think your question was about (1), just wanted to clarify why the
> > shared-fraction is needed.
> >
> > @Yun
> >
> > > I am just curious whether this could really bring benefits to our users
> > with such complex configuration logic.
> > I agree, and configuration complexity seems a common concern.
> > I hope that removing "state.backend.memory.share-scope" (as proposed
> above)
> > reduces the complexity.
> > Please share any ideas of how to simplify it further.
> >
> > > Could you share some real experimental results?
> > I did an experiment to verify that the approach is feasible,
> > i.e. multilple jobs can share the same memory/block cache.
> > But I guess that's not what you mean here? Do you have any experiments in
> > mind?
> >
> > > BTW, as talked before, I am not sure whether different lifecycles of
> > RocksDB state-backends
> > > would affect the memory usage of block cache & write buffer manager in
> > RocksDB.
> > > Currently, all instances would start and destroy nearly simultaneously,
> > > this would change after we introduce this feature with jobs running at
> > different scheduler times.
> > IIUC, the concern is that closing a RocksDB instance might close the
> > BlockCache.
> > I checked that manually and it seems to work as expected.
> > And I think that would contradict the sharing concept, as described in
> the
> > documentation [1].
> >
> > [1]
> > https://github.com/facebook/rocksdb/wiki/Block-Cache
> >
> > Regards,
> > Roman
> >
> >
> > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com> wrote:
> >
> > > Hi Roman,
> > > Thanks for the proposal, this allows State Backend to make better use
> of
> > > memory.
> > >
> > > After reading the ticket, I'm curious about some points:
> > >
> > > 1. Is shared-memory only for the state backend? If both
> > > "taskmanager.memory.managed.shared-fraction: >0" and
> > > "state.backend.rocksdb.memory.managed: false" are set at the same time,
> > > will the shared-memory be wasted?
> > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
> > memory
> > > and will compete with each other" in the example, how is the memory
> size
> > of
> > > unmanaged part calculated?
> > > 3. For fine-grained-resource-management, the control
> > > of cpuCores, taskHeapMemory can still work, right?  And I am a little
> > > worried that too many memory-about configuration options are
> complicated
> > > for users to understand.
> > >
> > > Regards,
> > > Yanfei
> > >
> > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道：
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to discuss sharing RocksDB memory across slots as proposed
> in
> > > > FLINK-29928 [1].
> > > >
> > > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > > - share these objects among RocksDB instances of the same slot
> > > > - bound the total memory usage by all RocksDB instances of a TM
> > > >
> > > > However, the memory is divided between the slots equally (unless
> using
> > > > fine-grained resource control). This is sub-optimal if some slots
> > contain
> > > > more memory intensive tasks than the others.
> > > > Using fine-grained resource control is also often not an option
> because
> > > the
> > > > workload might not be known in advance.
> > > >
> > > > The proposal is to widen the scope of sharing memory to TM, so that
> it
> > > can
> > > > be shared across all RocksDB instances of that TM. That would reduce
> > the
> > > > overall memory consumption in exchange for resource isolation.
> > > >
> > > > Please see FLINK-29928 [1] for more details.
> > > >
> > > > Looking forward to feedback on that proposal.
> > > >
> > > > [1]
> > > > https://issues.apache.org/jira/browse/FLINK-29928
> > > > [2]
> > > > https://issues.apache.org/jira/browse/FLINK-7289
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > >
> >
>

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Reply via email to