Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Khachatryan Roman Thu, 17 Nov 2022 06:56:03 -0800

I agree, the current calculation logic is already complicated.
I just think that not using managed memory complicates the memory model
even further.


But as I mentioned earlier, both approaches have their pros and cons, so
I'll update the proposal to use unmanaged memory.

Thanks!

Regards,
Roman


On Thu, Nov 17, 2022 at 3:25 AM Xintong Song <tonysong...@gmail.com> wrote:

> Agreed that the documentation regarding managed memory could be improved.
>
> Just to clarify, breaking isolation is just one of the concerns. I think my
> biggest concern is the extra complexity in managed memory calculations.
> I've been reached out by many users, online or offline, asking about the
> managed memory calculation. Even with my help, it's not easy for all users
> to understand, giving me the impression that the current calculation logics
> are already quite complicated. That's why I'd be hesitant to further
> complicate it. Treating the shared rocksdb memory as something
> independent from managed memory would help in that sense.
>
> - there's one more memory type, in addition to 8 existing types [2]
> >
> - its size needs to be calculated manually
>
> Not necessarily. We may consider it as part of the framework.off-heap
> memory. And we can still have some automatic calculations, e.g., allowing
> using up to a certain fraction of the off-heap framework memory.
>
> - flink code needs to take this into account and "adjust" the weights
> >
> We already have Memory/FsStateBackend that does not use managed memory. To
> exclude the state backend from the managed memory calculations, you just
> need to return `false` for `StateBackend#useManagedMemory`. That's why I
> suggest a variant of the rocksdb state backend, where you can reuse most of
> the original rocksdb state backend codes.
>
> Best,
>
> Xintong
>
>
>
> On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan <ro...@apache.org>
> wrote:
>
> > > I think not being able to isolate all kinds of memory does not mean we
> > > should give up the isolation on all kinds of memory. And I believe
> > "managed
> > > memory is isolated and others are not" is much easier for the users to
> > > understand compared to "part of the managed memory is isolated and
> others
> > > are not".
> >
> > It looks like the users can not infer that managed memory has the
> isolation
> > property:
> > neither documentation [1] nor the FILP don't mention that. I guess this
> is
> > because isolation is not its most important property (see below).
> >
> > An explicit option would be self-documenting and will let the users know
> > what memory is shared and what isn't.
> >
> > From my perspective, the most important property is shared budget which
> > allows
> > to avoid:
> > 1) OOM errors when there are too many consumers (i.e. tasks)
> > 2) manual calculation of memory for each type of consumer
> >
> > Both these properties are desirable for shared and non-shared memory;
> > as well as not wasting memory if there is no consumer as you described.
> >
> > OTH, using *unmanaged* shared memory for RocksDB implies:
> > - there's one more memory type, in addition to 8 existing types [2]
> > - its size needs to be calculated manually
> > - flink code needs to take this into account and "adjust" the weights
> >
> > Having said that, I'd be fine with either approach as both seem to have
> > pros and cons.
> >
> > What do you think?
> >
> > [1]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
> >
> > Regards,
> > Roman
> >
> >
> > On Wed, Nov 16, 2022 at 4:01 AM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > Concerning isolation, I think ideally we want everything to be isolated
> > > between jobs running in the same cluster (i.e., slots in the same TM).
> > > Unfortunately, this is impractical.
> > > - Heap / Off-heap memory are directly allocated / deallocated through
> > JVM /
> > > OS. Flink does not have a good way to cap their usages per slot.
> > > - Network memory does not have the good property of managed memory,
> that
> > a
> > > job can adapt to any given amount of managed memory (with a very small
> > min
> > > limitation). We are trying to improve network memory towards that
> > > direction, and once achieved it can be isolated as well.
> > > I think not being able to isolate all kinds of memory does not mean we
> > > should give up the isolation on all kinds of memory. And I believe
> > "managed
> > > memory is isolated and others are not" is much easier for the users to
> > > understand compared to "part of the managed memory is isolated and
> others
> > > are not".
> > >
> > > By waste, I meant reserving a certain amount of memory that is only
> used
> > by
> > > certain use cases that do not always exist. This is exactly what we
> want
> > to
> > > avoid with managed memory in FLIP-49 [1]. We used to have managed
> memory
> > > only used for batch operators, and a containerized-cut-off memory
> > > (something similar to framework.off-heap) for rocksdb state backend.
> The
> > > problem was that, if the user does not change the configuration when
> > > switching between streaming / batch jobs, there would always be some
> > memory
> > > (managed or cut-off) wasted. Similarly, introducing a shared managed
> > memory
> > > zone means reserving one more dedicated part of memory that can get
> > wasted
> > > in many cases. This is probably a necessary price for this new feature,
> > but
> > > let's not break the concept / properties of managed memory for it.
> > >
> > > In your proposal, the fraction for the share managed memory is by
> default
> > > 0. That means to enable the rocksdb memory sharing, users need to
> > manually
> > > increase the fraction anyway. Thus, having the memory sharing rocksdb
> use
> > > managed memory or off-heap memory does not make a significant
> difference
> > > for the new feature users. I'd think of this as "extra operational
> > overhead
> > > for users of a certain new feature" vs. "significant learning cost and
> > > potential behavior change for pretty much all users". I'd be fine with
> > > having some shortcuts to simplify the configuration on the user side
> for
> > > this new feature, but not to invade the managed memory.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > >
> > > On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
> > > khachatryan.ro...@gmail.com> wrote:
> > >
> > > > Thanks for your reply Xingong Song,
> > > >
> > > > Could you please elaborate on the following:
> > > >
> > > > > The proposed changes break several good properties that we designed
> > for
> > > > > managed memory.
> > > > > 1. It's isolated across slots
> > > > Just to clarify, any way to manage the memory efficiently while
> capping
> > > its
> > > > usage
> > > > will break the isolation. It's just a matter of whether it's managed
> > > memory
> > > > or not.
> > > > Do you see any reasons why unmanaged memory can be shared, and
> managed
> > > > memory can not?
> > > >
> > > > > 2. It should never be wasted (unless there's nothing in the job
> that
> > > > needs
> > > > > managed memory)
> > > > If I understand correctly, the managed memory can already be wasted
> > > because
> > > > it is divided evenly between slots, regardless of the existence of
> its
> > > > consumers in a particular slot.
> > > > And in general, even if every slot has RocksDB / python, it's not
> > > > guaranteed equal consumption.
> > > > So this property would rather be fixed in the current proposal.
> > > >
> > > > > In addition, it further complicates configuration / computation
> > logics
> > > of
> > > > > managed memory.
> > > > I think having multiple options overriding each other increases the
> > > > complexity for the user. As for the computation, I think it's
> desirable
> > > to
> > > > let Flink do it rather than users.
> > > >
> > > > Both approaches need some help from TM for:
> > > > - storing the shared resources (static field in a class might be too
> > > > dangerous because if the backend is loaded by the user-class-loader
> > then
> > > > memory will leak silently).
> > > > - reading the configuration
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > > >
> > > > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <tonysong...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I like the idea of sharing RocksDB memory across slots. However,
> I'm
> > > > quite
> > > > > concerned by the current proposed approach.
> > > > >
> > > > > The proposed changes break several good properties that we designed
> > for
> > > > > managed memory.
> > > > > 1. It's isolated across slots
> > > > > 2. It should never be wasted (unless there's nothing in the job
> that
> > > > needs
> > > > > managed memory)
> > > > > In addition, it further complicates configuration / computation
> > logics
> > > of
> > > > > managed memory.
> > > > >
> > > > > As an alternative, I'd suggest introducing a variant of
> > > > > RocksDBStateBackend, that shares memory across slots and does not
> use
> > > > > managed memory. This basically means the shared memory is not
> > > considered
> > > > as
> > > > > part of managed memory. For users of this new feature, they would
> > need
> > > to
> > > > > configure how much memory the variant state backend should use, and
> > > > > probably also a larger framework-off-heap / jvm-overhead memory.
> The
> > > > latter
> > > > > might require a bit extra user effort compared to the current
> > approach,
> > > > but
> > > > > would avoid significant complexity in the managed memory
> > configuration
> > > > and
> > > > > calculation logics which affects broader users.
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <
> ro...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi John, Yun,
> > > > > >
> > > > > > Thank you for your feedback
> > > > > >
> > > > > > @John
> > > > > >
> > > > > > > It seems like operators would either choose isolation for the
> > > > cluster’s
> > > > > > jobs
> > > > > > > or they would want to share the memory between jobs.
> > > > > > > I’m not sure I see the motivation to reserve only part of the
> > > memory
> > > > > for
> > > > > > sharing
> > > > > > > and allowing jobs to choose whether they will share or be
> > isolated.
> > > > > >
> > > > > > I see two related questions here:
> > > > > >
> > > > > > 1) Whether to allow mixed workloads within the same cluster.
> > > > > > I agree that most likely all the jobs will have the same
> "sharing"
> > > > > > requirement.
> > > > > > So we can drop "state.backend.memory.share-scope" from the
> > proposal.
> > > > > >
> > > > > > 2) Whether to allow different memory consumers to use shared or
> > > > exclusive
> > > > > > memory.
> > > > > > Currently, only RocksDB is proposed to use shared memory. For
> > python,
> > > > > it's
> > > > > > non-trivial because it is job-specific.
> > > > > > So we have to partition managed memory into shared/exclusive and
> > > > > therefore
> > > > > > can NOT replace "taskmanager.memory.managed.shared-fraction" with
> > > some
> > > > > > boolean flag.
> > > > > >
> > > > > > I think your question was about (1), just wanted to clarify why
> the
> > > > > > shared-fraction is needed.
> > > > > >
> > > > > > @Yun
> > > > > >
> > > > > > > I am just curious whether this could really bring benefits to
> our
> > > > users
> > > > > > with such complex configuration logic.
> > > > > > I agree, and configuration complexity seems a common concern.
> > > > > > I hope that removing "state.backend.memory.share-scope" (as
> > proposed
> > > > > above)
> > > > > > reduces the complexity.
> > > > > > Please share any ideas of how to simplify it further.
> > > > > >
> > > > > > > Could you share some real experimental results?
> > > > > > I did an experiment to verify that the approach is feasible,
> > > > > > i.e. multilple jobs can share the same memory/block cache.
> > > > > > But I guess that's not what you mean here? Do you have any
> > > experiments
> > > > in
> > > > > > mind?
> > > > > >
> > > > > > > BTW, as talked before, I am not sure whether different
> lifecycles
> > > of
> > > > > > RocksDB state-backends
> > > > > > > would affect the memory usage of block cache & write buffer
> > manager
> > > > in
> > > > > > RocksDB.
> > > > > > > Currently, all instances would start and destroy nearly
> > > > simultaneously,
> > > > > > > this would change after we introduce this feature with jobs
> > running
> > > > at
> > > > > > different scheduler times.
> > > > > > IIUC, the concern is that closing a RocksDB instance might close
> > the
> > > > > > BlockCache.
> > > > > > I checked that manually and it seems to work as expected.
> > > > > > And I think that would contradict the sharing concept, as
> described
> > > in
> > > > > the
> > > > > > documentation [1].
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/facebook/rocksdb/wiki/Block-Cache
> > > > > >
> > > > > > Regards,
> > > > > > Roman
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Roman,
> > > > > > > Thanks for the proposal, this allows State Backend to make
> better
> > > use
> > > > > of
> > > > > > > memory.
> > > > > > >
> > > > > > > After reading the ticket, I'm curious about some points:
> > > > > > >
> > > > > > > 1. Is shared-memory only for the state backend? If both
> > > > > > > "taskmanager.memory.managed.shared-fraction: >0" and
> > > > > > > "state.backend.rocksdb.memory.managed: false" are set at the
> same
> > > > time,
> > > > > > > will the shared-memory be wasted?
> > > > > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of
> > > unmanaged
> > > > > > memory
> > > > > > > and will compete with each other" in the example, how is the
> > memory
> > > > > size
> > > > > > of
> > > > > > > unmanaged part calculated?
> > > > > > > 3. For fine-grained-resource-management, the control
> > > > > > > of cpuCores, taskHeapMemory can still work, right?  And I am a
> > > little
> > > > > > > worried that too many memory-about configuration options are
> > > > > complicated
> > > > > > > for users to understand.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Yanfei
> > > > > > >
> > > > > > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道：
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I'd like to discuss sharing RocksDB memory across slots as
> > > proposed
> > > > > in
> > > > > > > > FLINK-29928 [1].
> > > > > > > >
> > > > > > > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > > > > > > - share these objects among RocksDB instances of the same
> slot
> > > > > > > > - bound the total memory usage by all RocksDB instances of a
> TM
> > > > > > > >
> > > > > > > > However, the memory is divided between the slots equally
> > (unless
> > > > > using
> > > > > > > > fine-grained resource control). This is sub-optimal if some
> > slots
> > > > > > contain
> > > > > > > > more memory intensive tasks than the others.
> > > > > > > > Using fine-grained resource control is also often not an
> option
> > > > > because
> > > > > > > the
> > > > > > > > workload might not be known in advance.
> > > > > > > >
> > > > > > > > The proposal is to widen the scope of sharing memory to TM,
> so
> > > that
> > > > > it
> > > > > > > can
> > > > > > > > be shared across all RocksDB instances of that TM. That would
> > > > reduce
> > > > > > the
> > > > > > > > overall memory consumption in exchange for resource
> isolation.
> > > > > > > >
> > > > > > > > Please see FLINK-29928 [1] for more details.
> > > > > > > >
> > > > > > > > Looking forward to feedback on that proposal.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > > https://issues.apache.org/jira/browse/FLINK-29928
> > > > > > > > [2]
> > > > > > > > https://issues.apache.org/jira/browse/FLINK-7289
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Roman
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Reply via email to