Thanks for your reply Xingong Song, Could you please elaborate on the following:
> The proposed changes break several good properties that we designed for > managed memory. > 1. It's isolated across slots Just to clarify, any way to manage the memory efficiently while capping its usage will break the isolation. It's just a matter of whether it's managed memory or not. Do you see any reasons why unmanaged memory can be shared, and managed memory can not? > 2. It should never be wasted (unless there's nothing in the job that needs > managed memory) If I understand correctly, the managed memory can already be wasted because it is divided evenly between slots, regardless of the existence of its consumers in a particular slot. And in general, even if every slot has RocksDB / python, it's not guaranteed equal consumption. So this property would rather be fixed in the current proposal. > In addition, it further complicates configuration / computation logics of > managed memory. I think having multiple options overriding each other increases the complexity for the user. As for the computation, I think it's desirable to let Flink do it rather than users. Both approaches need some help from TM for: - storing the shared resources (static field in a class might be too dangerous because if the backend is loaded by the user-class-loader then memory will leak silently). - reading the configuration Regards, Roman On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <tonysong...@gmail.com> wrote: > I like the idea of sharing RocksDB memory across slots. However, I'm quite > concerned by the current proposed approach. > > The proposed changes break several good properties that we designed for > managed memory. > 1. It's isolated across slots > 2. It should never be wasted (unless there's nothing in the job that needs > managed memory) > In addition, it further complicates configuration / computation logics of > managed memory. > > As an alternative, I'd suggest introducing a variant of > RocksDBStateBackend, that shares memory across slots and does not use > managed memory. This basically means the shared memory is not considered as > part of managed memory. For users of this new feature, they would need to > configure how much memory the variant state backend should use, and > probably also a larger framework-off-heap / jvm-overhead memory. The latter > might require a bit extra user effort compared to the current approach, but > would avoid significant complexity in the managed memory configuration and > calculation logics which affects broader users. > > > Best, > > Xintong > > > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <ro...@apache.org> > wrote: > > > Hi John, Yun, > > > > Thank you for your feedback > > > > @John > > > > > It seems like operators would either choose isolation for the cluster’s > > jobs > > > or they would want to share the memory between jobs. > > > I’m not sure I see the motivation to reserve only part of the memory > for > > sharing > > > and allowing jobs to choose whether they will share or be isolated. > > > > I see two related questions here: > > > > 1) Whether to allow mixed workloads within the same cluster. > > I agree that most likely all the jobs will have the same "sharing" > > requirement. > > So we can drop "state.backend.memory.share-scope" from the proposal. > > > > 2) Whether to allow different memory consumers to use shared or exclusive > > memory. > > Currently, only RocksDB is proposed to use shared memory. For python, > it's > > non-trivial because it is job-specific. > > So we have to partition managed memory into shared/exclusive and > therefore > > can NOT replace "taskmanager.memory.managed.shared-fraction" with some > > boolean flag. > > > > I think your question was about (1), just wanted to clarify why the > > shared-fraction is needed. > > > > @Yun > > > > > I am just curious whether this could really bring benefits to our users > > with such complex configuration logic. > > I agree, and configuration complexity seems a common concern. > > I hope that removing "state.backend.memory.share-scope" (as proposed > above) > > reduces the complexity. > > Please share any ideas of how to simplify it further. > > > > > Could you share some real experimental results? > > I did an experiment to verify that the approach is feasible, > > i.e. multilple jobs can share the same memory/block cache. > > But I guess that's not what you mean here? Do you have any experiments in > > mind? > > > > > BTW, as talked before, I am not sure whether different lifecycles of > > RocksDB state-backends > > > would affect the memory usage of block cache & write buffer manager in > > RocksDB. > > > Currently, all instances would start and destroy nearly simultaneously, > > > this would change after we introduce this feature with jobs running at > > different scheduler times. > > IIUC, the concern is that closing a RocksDB instance might close the > > BlockCache. > > I checked that manually and it seems to work as expected. > > And I think that would contradict the sharing concept, as described in > the > > documentation [1]. > > > > [1] > > https://github.com/facebook/rocksdb/wiki/Block-Cache > > > > Regards, > > Roman > > > > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com> wrote: > > > > > Hi Roman, > > > Thanks for the proposal, this allows State Backend to make better use > of > > > memory. > > > > > > After reading the ticket, I'm curious about some points: > > > > > > 1. Is shared-memory only for the state backend? If both > > > "taskmanager.memory.managed.shared-fraction: >0" and > > > "state.backend.rocksdb.memory.managed: false" are set at the same time, > > > will the shared-memory be wasted? > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged > > memory > > > and will compete with each other" in the example, how is the memory > size > > of > > > unmanaged part calculated? > > > 3. For fine-grained-resource-management, the control > > > of cpuCores, taskHeapMemory can still work, right? And I am a little > > > worried that too many memory-about configuration options are > complicated > > > for users to understand. > > > > > > Regards, > > > Yanfei > > > > > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道: > > > > > > > Hi everyone, > > > > > > > > I'd like to discuss sharing RocksDB memory across slots as proposed > in > > > > FLINK-29928 [1]. > > > > > > > > Since 1.10 / FLINK-7289 [2], it is possible to: > > > > - share these objects among RocksDB instances of the same slot > > > > - bound the total memory usage by all RocksDB instances of a TM > > > > > > > > However, the memory is divided between the slots equally (unless > using > > > > fine-grained resource control). This is sub-optimal if some slots > > contain > > > > more memory intensive tasks than the others. > > > > Using fine-grained resource control is also often not an option > because > > > the > > > > workload might not be known in advance. > > > > > > > > The proposal is to widen the scope of sharing memory to TM, so that > it > > > can > > > > be shared across all RocksDB instances of that TM. That would reduce > > the > > > > overall memory consumption in exchange for resource isolation. > > > > > > > > Please see FLINK-29928 [1] for more details. > > > > > > > > Looking forward to feedback on that proposal. > > > > > > > > [1] > > > > https://issues.apache.org/jira/browse/FLINK-29928 > > > > [2] > > > > https://issues.apache.org/jira/browse/FLINK-7289 > > > > > > > > Regards, > > > > Roman > > > > > > > > > >