Thanks, Roman~
Best, Xintong On Thu, Nov 17, 2022 at 10:56 PM Khachatryan Roman < khachatryan.ro...@gmail.com> wrote: > I agree, the current calculation logic is already complicated. > I just think that not using managed memory complicates the memory model > even further. > > But as I mentioned earlier, both approaches have their pros and cons, so > I'll update the proposal to use unmanaged memory. > > Thanks! > > Regards, > Roman > > > On Thu, Nov 17, 2022 at 3:25 AM Xintong Song <tonysong...@gmail.com> > wrote: > > > Agreed that the documentation regarding managed memory could be improved. > > > > Just to clarify, breaking isolation is just one of the concerns. I think > my > > biggest concern is the extra complexity in managed memory calculations. > > I've been reached out by many users, online or offline, asking about the > > managed memory calculation. Even with my help, it's not easy for all > users > > to understand, giving me the impression that the current calculation > logics > > are already quite complicated. That's why I'd be hesitant to further > > complicate it. Treating the shared rocksdb memory as something > > independent from managed memory would help in that sense. > > > > - there's one more memory type, in addition to 8 existing types [2] > > > > > - its size needs to be calculated manually > > > > Not necessarily. We may consider it as part of the framework.off-heap > > memory. And we can still have some automatic calculations, e.g., allowing > > using up to a certain fraction of the off-heap framework memory. > > > > - flink code needs to take this into account and "adjust" the weights > > > > > We already have Memory/FsStateBackend that does not use managed memory. > To > > exclude the state backend from the managed memory calculations, you just > > need to return `false` for `StateBackend#useManagedMemory`. That's why I > > suggest a variant of the rocksdb state backend, where you can reuse most > of > > the original rocksdb state backend codes. > > > > Best, > > > > Xintong > > > > > > > > On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan <ro...@apache.org> > > wrote: > > > > > > I think not being able to isolate all kinds of memory does not mean > we > > > > should give up the isolation on all kinds of memory. And I believe > > > "managed > > > > memory is isolated and others are not" is much easier for the users > to > > > > understand compared to "part of the managed memory is isolated and > > others > > > > are not". > > > > > > It looks like the users can not infer that managed memory has the > > isolation > > > property: > > > neither documentation [1] nor the FILP don't mention that. I guess this > > is > > > because isolation is not its most important property (see below). > > > > > > An explicit option would be self-documenting and will let the users > know > > > what memory is shared and what isn't. > > > > > > From my perspective, the most important property is shared budget which > > > allows > > > to avoid: > > > 1) OOM errors when there are too many consumers (i.e. tasks) > > > 2) manual calculation of memory for each type of consumer > > > > > > Both these properties are desirable for shared and non-shared memory; > > > as well as not wasting memory if there is no consumer as you described. > > > > > > OTH, using *unmanaged* shared memory for RocksDB implies: > > > - there's one more memory type, in addition to 8 existing types [2] > > > - its size needs to be calculated manually > > > - flink code needs to take this into account and "adjust" the weights > > > > > > Having said that, I'd be fine with either approach as both seem to have > > > pros and cons. > > > > > > What do you think? > > > > > > [1] > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory > > > [2] > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model > > > > > > Regards, > > > Roman > > > > > > > > > On Wed, Nov 16, 2022 at 4:01 AM Xintong Song <tonysong...@gmail.com> > > > wrote: > > > > > > > Concerning isolation, I think ideally we want everything to be > isolated > > > > between jobs running in the same cluster (i.e., slots in the same > TM). > > > > Unfortunately, this is impractical. > > > > - Heap / Off-heap memory are directly allocated / deallocated through > > > JVM / > > > > OS. Flink does not have a good way to cap their usages per slot. > > > > - Network memory does not have the good property of managed memory, > > that > > > a > > > > job can adapt to any given amount of managed memory (with a very > small > > > min > > > > limitation). We are trying to improve network memory towards that > > > > direction, and once achieved it can be isolated as well. > > > > I think not being able to isolate all kinds of memory does not mean > we > > > > should give up the isolation on all kinds of memory. And I believe > > > "managed > > > > memory is isolated and others are not" is much easier for the users > to > > > > understand compared to "part of the managed memory is isolated and > > others > > > > are not". > > > > > > > > By waste, I meant reserving a certain amount of memory that is only > > used > > > by > > > > certain use cases that do not always exist. This is exactly what we > > want > > > to > > > > avoid with managed memory in FLIP-49 [1]. We used to have managed > > memory > > > > only used for batch operators, and a containerized-cut-off memory > > > > (something similar to framework.off-heap) for rocksdb state backend. > > The > > > > problem was that, if the user does not change the configuration when > > > > switching between streaming / batch jobs, there would always be some > > > memory > > > > (managed or cut-off) wasted. Similarly, introducing a shared managed > > > memory > > > > zone means reserving one more dedicated part of memory that can get > > > wasted > > > > in many cases. This is probably a necessary price for this new > feature, > > > but > > > > let's not break the concept / properties of managed memory for it. > > > > > > > > In your proposal, the fraction for the share managed memory is by > > default > > > > 0. That means to enable the rocksdb memory sharing, users need to > > > manually > > > > increase the fraction anyway. Thus, having the memory sharing rocksdb > > use > > > > managed memory or off-heap memory does not make a significant > > difference > > > > for the new feature users. I'd think of this as "extra operational > > > overhead > > > > for users of a certain new feature" vs. "significant learning cost > and > > > > potential behavior change for pretty much all users". I'd be fine > with > > > > having some shortcuts to simplify the configuration on the user side > > for > > > > this new feature, but not to invade the managed memory. > > > > > > > > Best, > > > > > > > > Xintong > > > > > > > > > > > > [1] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > > > On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman < > > > > khachatryan.ro...@gmail.com> wrote: > > > > > > > > > Thanks for your reply Xingong Song, > > > > > > > > > > Could you please elaborate on the following: > > > > > > > > > > > The proposed changes break several good properties that we > designed > > > for > > > > > > managed memory. > > > > > > 1. It's isolated across slots > > > > > Just to clarify, any way to manage the memory efficiently while > > capping > > > > its > > > > > usage > > > > > will break the isolation. It's just a matter of whether it's > managed > > > > memory > > > > > or not. > > > > > Do you see any reasons why unmanaged memory can be shared, and > > managed > > > > > memory can not? > > > > > > > > > > > 2. It should never be wasted (unless there's nothing in the job > > that > > > > > needs > > > > > > managed memory) > > > > > If I understand correctly, the managed memory can already be wasted > > > > because > > > > > it is divided evenly between slots, regardless of the existence of > > its > > > > > consumers in a particular slot. > > > > > And in general, even if every slot has RocksDB / python, it's not > > > > > guaranteed equal consumption. > > > > > So this property would rather be fixed in the current proposal. > > > > > > > > > > > In addition, it further complicates configuration / computation > > > logics > > > > of > > > > > > managed memory. > > > > > I think having multiple options overriding each other increases the > > > > > complexity for the user. As for the computation, I think it's > > desirable > > > > to > > > > > let Flink do it rather than users. > > > > > > > > > > Both approaches need some help from TM for: > > > > > - storing the shared resources (static field in a class might be > too > > > > > dangerous because if the backend is loaded by the user-class-loader > > > then > > > > > memory will leak silently). > > > > > - reading the configuration > > > > > > > > > > Regards, > > > > > Roman > > > > > > > > > > > > > > > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song < > tonysong...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > I like the idea of sharing RocksDB memory across slots. However, > > I'm > > > > > quite > > > > > > concerned by the current proposed approach. > > > > > > > > > > > > The proposed changes break several good properties that we > designed > > > for > > > > > > managed memory. > > > > > > 1. It's isolated across slots > > > > > > 2. It should never be wasted (unless there's nothing in the job > > that > > > > > needs > > > > > > managed memory) > > > > > > In addition, it further complicates configuration / computation > > > logics > > > > of > > > > > > managed memory. > > > > > > > > > > > > As an alternative, I'd suggest introducing a variant of > > > > > > RocksDBStateBackend, that shares memory across slots and does not > > use > > > > > > managed memory. This basically means the shared memory is not > > > > considered > > > > > as > > > > > > part of managed memory. For users of this new feature, they would > > > need > > > > to > > > > > > configure how much memory the variant state backend should use, > and > > > > > > probably also a larger framework-off-heap / jvm-overhead memory. > > The > > > > > latter > > > > > > might require a bit extra user effort compared to the current > > > approach, > > > > > but > > > > > > would avoid significant complexity in the managed memory > > > configuration > > > > > and > > > > > > calculation logics which affects broader users. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan < > > ro...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Hi John, Yun, > > > > > > > > > > > > > > Thank you for your feedback > > > > > > > > > > > > > > @John > > > > > > > > > > > > > > > It seems like operators would either choose isolation for the > > > > > cluster’s > > > > > > > jobs > > > > > > > > or they would want to share the memory between jobs. > > > > > > > > I’m not sure I see the motivation to reserve only part of the > > > > memory > > > > > > for > > > > > > > sharing > > > > > > > > and allowing jobs to choose whether they will share or be > > > isolated. > > > > > > > > > > > > > > I see two related questions here: > > > > > > > > > > > > > > 1) Whether to allow mixed workloads within the same cluster. > > > > > > > I agree that most likely all the jobs will have the same > > "sharing" > > > > > > > requirement. > > > > > > > So we can drop "state.backend.memory.share-scope" from the > > > proposal. > > > > > > > > > > > > > > 2) Whether to allow different memory consumers to use shared or > > > > > exclusive > > > > > > > memory. > > > > > > > Currently, only RocksDB is proposed to use shared memory. For > > > python, > > > > > > it's > > > > > > > non-trivial because it is job-specific. > > > > > > > So we have to partition managed memory into shared/exclusive > and > > > > > > therefore > > > > > > > can NOT replace "taskmanager.memory.managed.shared-fraction" > with > > > > some > > > > > > > boolean flag. > > > > > > > > > > > > > > I think your question was about (1), just wanted to clarify why > > the > > > > > > > shared-fraction is needed. > > > > > > > > > > > > > > @Yun > > > > > > > > > > > > > > > I am just curious whether this could really bring benefits to > > our > > > > > users > > > > > > > with such complex configuration logic. > > > > > > > I agree, and configuration complexity seems a common concern. > > > > > > > I hope that removing "state.backend.memory.share-scope" (as > > > proposed > > > > > > above) > > > > > > > reduces the complexity. > > > > > > > Please share any ideas of how to simplify it further. > > > > > > > > > > > > > > > Could you share some real experimental results? > > > > > > > I did an experiment to verify that the approach is feasible, > > > > > > > i.e. multilple jobs can share the same memory/block cache. > > > > > > > But I guess that's not what you mean here? Do you have any > > > > experiments > > > > > in > > > > > > > mind? > > > > > > > > > > > > > > > BTW, as talked before, I am not sure whether different > > lifecycles > > > > of > > > > > > > RocksDB state-backends > > > > > > > > would affect the memory usage of block cache & write buffer > > > manager > > > > > in > > > > > > > RocksDB. > > > > > > > > Currently, all instances would start and destroy nearly > > > > > simultaneously, > > > > > > > > this would change after we introduce this feature with jobs > > > running > > > > > at > > > > > > > different scheduler times. > > > > > > > IIUC, the concern is that closing a RocksDB instance might > close > > > the > > > > > > > BlockCache. > > > > > > > I checked that manually and it seems to work as expected. > > > > > > > And I think that would contradict the sharing concept, as > > described > > > > in > > > > > > the > > > > > > > documentation [1]. > > > > > > > > > > > > > > [1] > > > > > > > https://github.com/facebook/rocksdb/wiki/Block-Cache > > > > > > > > > > > > > > Regards, > > > > > > > Roman > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > Thanks for the proposal, this allows State Backend to make > > better > > > > use > > > > > > of > > > > > > > > memory. > > > > > > > > > > > > > > > > After reading the ticket, I'm curious about some points: > > > > > > > > > > > > > > > > 1. Is shared-memory only for the state backend? If both > > > > > > > > "taskmanager.memory.managed.shared-fraction: >0" and > > > > > > > > "state.backend.rocksdb.memory.managed: false" are set at the > > same > > > > > time, > > > > > > > > will the shared-memory be wasted? > > > > > > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of > > > > unmanaged > > > > > > > memory > > > > > > > > and will compete with each other" in the example, how is the > > > memory > > > > > > size > > > > > > > of > > > > > > > > unmanaged part calculated? > > > > > > > > 3. For fine-grained-resource-management, the control > > > > > > > > of cpuCores, taskHeapMemory can still work, right? And I am > a > > > > little > > > > > > > > worried that too many memory-about configuration options are > > > > > > complicated > > > > > > > > for users to understand. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Yanfei > > > > > > > > > > > > > > > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道: > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > I'd like to discuss sharing RocksDB memory across slots as > > > > proposed > > > > > > in > > > > > > > > > FLINK-29928 [1]. > > > > > > > > > > > > > > > > > > Since 1.10 / FLINK-7289 [2], it is possible to: > > > > > > > > > - share these objects among RocksDB instances of the same > > slot > > > > > > > > > - bound the total memory usage by all RocksDB instances of > a > > TM > > > > > > > > > > > > > > > > > > However, the memory is divided between the slots equally > > > (unless > > > > > > using > > > > > > > > > fine-grained resource control). This is sub-optimal if some > > > slots > > > > > > > contain > > > > > > > > > more memory intensive tasks than the others. > > > > > > > > > Using fine-grained resource control is also often not an > > option > > > > > > because > > > > > > > > the > > > > > > > > > workload might not be known in advance. > > > > > > > > > > > > > > > > > > The proposal is to widen the scope of sharing memory to TM, > > so > > > > that > > > > > > it > > > > > > > > can > > > > > > > > > be shared across all RocksDB instances of that TM. That > would > > > > > reduce > > > > > > > the > > > > > > > > > overall memory consumption in exchange for resource > > isolation. > > > > > > > > > > > > > > > > > > Please see FLINK-29928 [1] for more details. > > > > > > > > > > > > > > > > > > Looking forward to feedback on that proposal. > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29928 > > > > > > > > > [2] > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-7289 > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Roman > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >