I agree, the current calculation logic is already complicated. I just think that not using managed memory complicates the memory model even further.
But as I mentioned earlier, both approaches have their pros and cons, so I'll update the proposal to use unmanaged memory. Thanks! Regards, Roman On Thu, Nov 17, 2022 at 3:25 AM Xintong Song <tonysong...@gmail.com> wrote: > Agreed that the documentation regarding managed memory could be improved. > > Just to clarify, breaking isolation is just one of the concerns. I think my > biggest concern is the extra complexity in managed memory calculations. > I've been reached out by many users, online or offline, asking about the > managed memory calculation. Even with my help, it's not easy for all users > to understand, giving me the impression that the current calculation logics > are already quite complicated. That's why I'd be hesitant to further > complicate it. Treating the shared rocksdb memory as something > independent from managed memory would help in that sense. > > - there's one more memory type, in addition to 8 existing types [2] > > > - its size needs to be calculated manually > > Not necessarily. We may consider it as part of the framework.off-heap > memory. And we can still have some automatic calculations, e.g., allowing > using up to a certain fraction of the off-heap framework memory. > > - flink code needs to take this into account and "adjust" the weights > > > We already have Memory/FsStateBackend that does not use managed memory. To > exclude the state backend from the managed memory calculations, you just > need to return `false` for `StateBackend#useManagedMemory`. That's why I > suggest a variant of the rocksdb state backend, where you can reuse most of > the original rocksdb state backend codes. > > Best, > > Xintong > > > > On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan <ro...@apache.org> > wrote: > > > > I think not being able to isolate all kinds of memory does not mean we > > > should give up the isolation on all kinds of memory. And I believe > > "managed > > > memory is isolated and others are not" is much easier for the users to > > > understand compared to "part of the managed memory is isolated and > others > > > are not". > > > > It looks like the users can not infer that managed memory has the > isolation > > property: > > neither documentation [1] nor the FILP don't mention that. I guess this > is > > because isolation is not its most important property (see below). > > > > An explicit option would be self-documenting and will let the users know > > what memory is shared and what isn't. > > > > From my perspective, the most important property is shared budget which > > allows > > to avoid: > > 1) OOM errors when there are too many consumers (i.e. tasks) > > 2) manual calculation of memory for each type of consumer > > > > Both these properties are desirable for shared and non-shared memory; > > as well as not wasting memory if there is no consumer as you described. > > > > OTH, using *unmanaged* shared memory for RocksDB implies: > > - there's one more memory type, in addition to 8 existing types [2] > > - its size needs to be calculated manually > > - flink code needs to take this into account and "adjust" the weights > > > > Having said that, I'd be fine with either approach as both seem to have > > pros and cons. > > > > What do you think? > > > > [1] > > > > > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory > > [2] > > > > > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model > > > > Regards, > > Roman > > > > > > On Wed, Nov 16, 2022 at 4:01 AM Xintong Song <tonysong...@gmail.com> > > wrote: > > > > > Concerning isolation, I think ideally we want everything to be isolated > > > between jobs running in the same cluster (i.e., slots in the same TM). > > > Unfortunately, this is impractical. > > > - Heap / Off-heap memory are directly allocated / deallocated through > > JVM / > > > OS. Flink does not have a good way to cap their usages per slot. > > > - Network memory does not have the good property of managed memory, > that > > a > > > job can adapt to any given amount of managed memory (with a very small > > min > > > limitation). We are trying to improve network memory towards that > > > direction, and once achieved it can be isolated as well. > > > I think not being able to isolate all kinds of memory does not mean we > > > should give up the isolation on all kinds of memory. And I believe > > "managed > > > memory is isolated and others are not" is much easier for the users to > > > understand compared to "part of the managed memory is isolated and > others > > > are not". > > > > > > By waste, I meant reserving a certain amount of memory that is only > used > > by > > > certain use cases that do not always exist. This is exactly what we > want > > to > > > avoid with managed memory in FLIP-49 [1]. We used to have managed > memory > > > only used for batch operators, and a containerized-cut-off memory > > > (something similar to framework.off-heap) for rocksdb state backend. > The > > > problem was that, if the user does not change the configuration when > > > switching between streaming / batch jobs, there would always be some > > memory > > > (managed or cut-off) wasted. Similarly, introducing a shared managed > > memory > > > zone means reserving one more dedicated part of memory that can get > > wasted > > > in many cases. This is probably a necessary price for this new feature, > > but > > > let's not break the concept / properties of managed memory for it. > > > > > > In your proposal, the fraction for the share managed memory is by > default > > > 0. That means to enable the rocksdb memory sharing, users need to > > manually > > > increase the fraction anyway. Thus, having the memory sharing rocksdb > use > > > managed memory or off-heap memory does not make a significant > difference > > > for the new feature users. I'd think of this as "extra operational > > overhead > > > for users of a certain new feature" vs. "significant learning cost and > > > potential behavior change for pretty much all users". I'd be fine with > > > having some shortcuts to simplify the configuration on the user side > for > > > this new feature, but not to invade the managed memory. > > > > > > Best, > > > > > > Xintong > > > > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman < > > > khachatryan.ro...@gmail.com> wrote: > > > > > > > Thanks for your reply Xingong Song, > > > > > > > > Could you please elaborate on the following: > > > > > > > > > The proposed changes break several good properties that we designed > > for > > > > > managed memory. > > > > > 1. It's isolated across slots > > > > Just to clarify, any way to manage the memory efficiently while > capping > > > its > > > > usage > > > > will break the isolation. It's just a matter of whether it's managed > > > memory > > > > or not. > > > > Do you see any reasons why unmanaged memory can be shared, and > managed > > > > memory can not? > > > > > > > > > 2. It should never be wasted (unless there's nothing in the job > that > > > > needs > > > > > managed memory) > > > > If I understand correctly, the managed memory can already be wasted > > > because > > > > it is divided evenly between slots, regardless of the existence of > its > > > > consumers in a particular slot. > > > > And in general, even if every slot has RocksDB / python, it's not > > > > guaranteed equal consumption. > > > > So this property would rather be fixed in the current proposal. > > > > > > > > > In addition, it further complicates configuration / computation > > logics > > > of > > > > > managed memory. > > > > I think having multiple options overriding each other increases the > > > > complexity for the user. As for the computation, I think it's > desirable > > > to > > > > let Flink do it rather than users. > > > > > > > > Both approaches need some help from TM for: > > > > - storing the shared resources (static field in a class might be too > > > > dangerous because if the backend is loaded by the user-class-loader > > then > > > > memory will leak silently). > > > > - reading the configuration > > > > > > > > Regards, > > > > Roman > > > > > > > > > > > > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <tonysong...@gmail.com > > > > > > wrote: > > > > > > > > > I like the idea of sharing RocksDB memory across slots. However, > I'm > > > > quite > > > > > concerned by the current proposed approach. > > > > > > > > > > The proposed changes break several good properties that we designed > > for > > > > > managed memory. > > > > > 1. It's isolated across slots > > > > > 2. It should never be wasted (unless there's nothing in the job > that > > > > needs > > > > > managed memory) > > > > > In addition, it further complicates configuration / computation > > logics > > > of > > > > > managed memory. > > > > > > > > > > As an alternative, I'd suggest introducing a variant of > > > > > RocksDBStateBackend, that shares memory across slots and does not > use > > > > > managed memory. This basically means the shared memory is not > > > considered > > > > as > > > > > part of managed memory. For users of this new feature, they would > > need > > > to > > > > > configure how much memory the variant state backend should use, and > > > > > probably also a larger framework-off-heap / jvm-overhead memory. > The > > > > latter > > > > > might require a bit extra user effort compared to the current > > approach, > > > > but > > > > > would avoid significant complexity in the managed memory > > configuration > > > > and > > > > > calculation logics which affects broader users. > > > > > > > > > > > > > > > Best, > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan < > ro...@apache.org> > > > > > wrote: > > > > > > > > > > > Hi John, Yun, > > > > > > > > > > > > Thank you for your feedback > > > > > > > > > > > > @John > > > > > > > > > > > > > It seems like operators would either choose isolation for the > > > > cluster’s > > > > > > jobs > > > > > > > or they would want to share the memory between jobs. > > > > > > > I’m not sure I see the motivation to reserve only part of the > > > memory > > > > > for > > > > > > sharing > > > > > > > and allowing jobs to choose whether they will share or be > > isolated. > > > > > > > > > > > > I see two related questions here: > > > > > > > > > > > > 1) Whether to allow mixed workloads within the same cluster. > > > > > > I agree that most likely all the jobs will have the same > "sharing" > > > > > > requirement. > > > > > > So we can drop "state.backend.memory.share-scope" from the > > proposal. > > > > > > > > > > > > 2) Whether to allow different memory consumers to use shared or > > > > exclusive > > > > > > memory. > > > > > > Currently, only RocksDB is proposed to use shared memory. For > > python, > > > > > it's > > > > > > non-trivial because it is job-specific. > > > > > > So we have to partition managed memory into shared/exclusive and > > > > > therefore > > > > > > can NOT replace "taskmanager.memory.managed.shared-fraction" with > > > some > > > > > > boolean flag. > > > > > > > > > > > > I think your question was about (1), just wanted to clarify why > the > > > > > > shared-fraction is needed. > > > > > > > > > > > > @Yun > > > > > > > > > > > > > I am just curious whether this could really bring benefits to > our > > > > users > > > > > > with such complex configuration logic. > > > > > > I agree, and configuration complexity seems a common concern. > > > > > > I hope that removing "state.backend.memory.share-scope" (as > > proposed > > > > > above) > > > > > > reduces the complexity. > > > > > > Please share any ideas of how to simplify it further. > > > > > > > > > > > > > Could you share some real experimental results? > > > > > > I did an experiment to verify that the approach is feasible, > > > > > > i.e. multilple jobs can share the same memory/block cache. > > > > > > But I guess that's not what you mean here? Do you have any > > > experiments > > > > in > > > > > > mind? > > > > > > > > > > > > > BTW, as talked before, I am not sure whether different > lifecycles > > > of > > > > > > RocksDB state-backends > > > > > > > would affect the memory usage of block cache & write buffer > > manager > > > > in > > > > > > RocksDB. > > > > > > > Currently, all instances would start and destroy nearly > > > > simultaneously, > > > > > > > this would change after we introduce this feature with jobs > > running > > > > at > > > > > > different scheduler times. > > > > > > IIUC, the concern is that closing a RocksDB instance might close > > the > > > > > > BlockCache. > > > > > > I checked that manually and it seems to work as expected. > > > > > > And I think that would contradict the sharing concept, as > described > > > in > > > > > the > > > > > > documentation [1]. > > > > > > > > > > > > [1] > > > > > > https://github.com/facebook/rocksdb/wiki/Block-Cache > > > > > > > > > > > > Regards, > > > > > > Roman > > > > > > > > > > > > > > > > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com> > > > wrote: > > > > > > > > > > > > > Hi Roman, > > > > > > > Thanks for the proposal, this allows State Backend to make > better > > > use > > > > > of > > > > > > > memory. > > > > > > > > > > > > > > After reading the ticket, I'm curious about some points: > > > > > > > > > > > > > > 1. Is shared-memory only for the state backend? If both > > > > > > > "taskmanager.memory.managed.shared-fraction: >0" and > > > > > > > "state.backend.rocksdb.memory.managed: false" are set at the > same > > > > time, > > > > > > > will the shared-memory be wasted? > > > > > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of > > > unmanaged > > > > > > memory > > > > > > > and will compete with each other" in the example, how is the > > memory > > > > > size > > > > > > of > > > > > > > unmanaged part calculated? > > > > > > > 3. For fine-grained-resource-management, the control > > > > > > > of cpuCores, taskHeapMemory can still work, right? And I am a > > > little > > > > > > > worried that too many memory-about configuration options are > > > > > complicated > > > > > > > for users to understand. > > > > > > > > > > > > > > Regards, > > > > > > > Yanfei > > > > > > > > > > > > > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道: > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > I'd like to discuss sharing RocksDB memory across slots as > > > proposed > > > > > in > > > > > > > > FLINK-29928 [1]. > > > > > > > > > > > > > > > > Since 1.10 / FLINK-7289 [2], it is possible to: > > > > > > > > - share these objects among RocksDB instances of the same > slot > > > > > > > > - bound the total memory usage by all RocksDB instances of a > TM > > > > > > > > > > > > > > > > However, the memory is divided between the slots equally > > (unless > > > > > using > > > > > > > > fine-grained resource control). This is sub-optimal if some > > slots > > > > > > contain > > > > > > > > more memory intensive tasks than the others. > > > > > > > > Using fine-grained resource control is also often not an > option > > > > > because > > > > > > > the > > > > > > > > workload might not be known in advance. > > > > > > > > > > > > > > > > The proposal is to widen the scope of sharing memory to TM, > so > > > that > > > > > it > > > > > > > can > > > > > > > > be shared across all RocksDB instances of that TM. That would > > > > reduce > > > > > > the > > > > > > > > overall memory consumption in exchange for resource > isolation. > > > > > > > > > > > > > > > > Please see FLINK-29928 [1] for more details. > > > > > > > > > > > > > > > > Looking forward to feedback on that proposal. > > > > > > > > > > > > > > > > [1] > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29928 > > > > > > > > [2] > > > > > > > > https://issues.apache.org/jira/browse/FLINK-7289 > > > > > > > > > > > > > > > > Regards, > > > > > > > > Roman > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >