I like the idea of sharing RocksDB memory across slots. However, I'm quite
concerned by the current proposed approach.

The proposed changes break several good properties that we designed for
managed memory.
1. It's isolated across slots
2. It should never be wasted (unless there's nothing in the job that needs
managed memory)
In addition, it further complicates configuration / computation logics of
managed memory.

As an alternative, I'd suggest introducing a variant of
RocksDBStateBackend, that shares memory across slots and does not use
managed memory. This basically means the shared memory is not considered as
part of managed memory. For users of this new feature, they would need to
configure how much memory the variant state backend should use, and
probably also a larger framework-off-heap / jvm-overhead memory. The latter
might require a bit extra user effort compared to the current approach, but
would avoid significant complexity in the managed memory configuration and
calculation logics which affects broader users.


Best,

Xintong



On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <ro...@apache.org> wrote:

> Hi John, Yun,
>
> Thank you for your feedback
>
> @John
>
> > It seems like operators would either choose isolation for the cluster’s
> jobs
> > or they would want to share the memory between jobs.
> > I’m not sure I see the motivation to reserve only part of the memory for
> sharing
> > and allowing jobs to choose whether they will share or be isolated.
>
> I see two related questions here:
>
> 1) Whether to allow mixed workloads within the same cluster.
> I agree that most likely all the jobs will have the same "sharing"
> requirement.
> So we can drop "state.backend.memory.share-scope" from the proposal.
>
> 2) Whether to allow different memory consumers to use shared or exclusive
> memory.
> Currently, only RocksDB is proposed to use shared memory. For python, it's
> non-trivial because it is job-specific.
> So we have to partition managed memory into shared/exclusive and therefore
> can NOT replace "taskmanager.memory.managed.shared-fraction" with some
> boolean flag.
>
> I think your question was about (1), just wanted to clarify why the
> shared-fraction is needed.
>
> @Yun
>
> > I am just curious whether this could really bring benefits to our users
> with such complex configuration logic.
> I agree, and configuration complexity seems a common concern.
> I hope that removing "state.backend.memory.share-scope" (as proposed above)
> reduces the complexity.
> Please share any ideas of how to simplify it further.
>
> > Could you share some real experimental results?
> I did an experiment to verify that the approach is feasible,
> i.e. multilple jobs can share the same memory/block cache.
> But I guess that's not what you mean here? Do you have any experiments in
> mind?
>
> > BTW, as talked before, I am not sure whether different lifecycles of
> RocksDB state-backends
> > would affect the memory usage of block cache & write buffer manager in
> RocksDB.
> > Currently, all instances would start and destroy nearly simultaneously,
> > this would change after we introduce this feature with jobs running at
> different scheduler times.
> IIUC, the concern is that closing a RocksDB instance might close the
> BlockCache.
> I checked that manually and it seems to work as expected.
> And I think that would contradict the sharing concept, as described in the
> documentation [1].
>
> [1]
> https://github.com/facebook/rocksdb/wiki/Block-Cache
>
> Regards,
> Roman
>
>
> On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com> wrote:
>
> > Hi Roman,
> > Thanks for the proposal, this allows State Backend to make better use of
> > memory.
> >
> > After reading the ticket, I'm curious about some points:
> >
> > 1. Is shared-memory only for the state backend? If both
> > "taskmanager.memory.managed.shared-fraction: >0" and
> > "state.backend.rocksdb.memory.managed: false" are set at the same time,
> > will the shared-memory be wasted?
> > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
> memory
> > and will compete with each other" in the example, how is the memory size
> of
> > unmanaged part calculated?
> > 3. For fine-grained-resource-management, the control
> > of cpuCores, taskHeapMemory can still work, right?  And I am a little
> > worried that too many memory-about configuration options are complicated
> > for users to understand.
> >
> > Regards,
> > Yanfei
> >
> > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道:
> >
> > > Hi everyone,
> > >
> > > I'd like to discuss sharing RocksDB memory across slots as proposed in
> > > FLINK-29928 [1].
> > >
> > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > - share these objects among RocksDB instances of the same slot
> > > - bound the total memory usage by all RocksDB instances of a TM
> > >
> > > However, the memory is divided between the slots equally (unless using
> > > fine-grained resource control). This is sub-optimal if some slots
> contain
> > > more memory intensive tasks than the others.
> > > Using fine-grained resource control is also often not an option because
> > the
> > > workload might not be known in advance.
> > >
> > > The proposal is to widen the scope of sharing memory to TM, so that it
> > can
> > > be shared across all RocksDB instances of that TM. That would reduce
> the
> > > overall memory consumption in exchange for resource isolation.
> > >
> > > Please see FLINK-29928 [1] for more details.
> > >
> > > Looking forward to feedback on that proposal.
> > >
> > > [1]
> > > https://issues.apache.org/jira/browse/FLINK-29928
> > > [2]
> > > https://issues.apache.org/jira/browse/FLINK-7289
> > >
> > > Regards,
> > > Roman
> > >
> >
>

Reply via email to