Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Xintong Song Thu, 17 Nov 2022 18:51:06 -0800

Thanks, Roman~


Best,

Xintong



On Thu, Nov 17, 2022 at 10:56 PM Khachatryan Roman <
khachatryan.ro...@gmail.com> wrote:

> I agree, the current calculation logic is already complicated.
> I just think that not using managed memory complicates the memory model
> even further.
>
> But as I mentioned earlier, both approaches have their pros and cons, so
> I'll update the proposal to use unmanaged memory.
>
> Thanks!
>
> Regards,
> Roman
>
>
> On Thu, Nov 17, 2022 at 3:25 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> > Agreed that the documentation regarding managed memory could be improved.
> >
> > Just to clarify, breaking isolation is just one of the concerns. I think
> my
> > biggest concern is the extra complexity in managed memory calculations.
> > I've been reached out by many users, online or offline, asking about the
> > managed memory calculation. Even with my help, it's not easy for all
> users
> > to understand, giving me the impression that the current calculation
> logics
> > are already quite complicated. That's why I'd be hesitant to further
> > complicate it. Treating the shared rocksdb memory as something
> > independent from managed memory would help in that sense.
> >
> > - there's one more memory type, in addition to 8 existing types [2]
> > >
> > - its size needs to be calculated manually
> >
> > Not necessarily. We may consider it as part of the framework.off-heap
> > memory. And we can still have some automatic calculations, e.g., allowing
> > using up to a certain fraction of the off-heap framework memory.
> >
> > - flink code needs to take this into account and "adjust" the weights
> > >
> > We already have Memory/FsStateBackend that does not use managed memory.
> To
> > exclude the state backend from the managed memory calculations, you just
> > need to return `false` for `StateBackend#useManagedMemory`. That's why I
> > suggest a variant of the rocksdb state backend, where you can reuse most
> of
> > the original rocksdb state backend codes.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan <ro...@apache.org>
> > wrote:
> >
> > > > I think not being able to isolate all kinds of memory does not mean
> we
> > > > should give up the isolation on all kinds of memory. And I believe
> > > "managed
> > > > memory is isolated and others are not" is much easier for the users
> to
> > > > understand compared to "part of the managed memory is isolated and
> > others
> > > > are not".
> > >
> > > It looks like the users can not infer that managed memory has the
> > isolation
> > > property:
> > > neither documentation [1] nor the FILP don't mention that. I guess this
> > is
> > > because isolation is not its most important property (see below).
> > >
> > > An explicit option would be self-documenting and will let the users
> know
> > > what memory is shared and what isn't.
> > >
> > > From my perspective, the most important property is shared budget which
> > > allows
> > > to avoid:
> > > 1) OOM errors when there are too many consumers (i.e. tasks)
> > > 2) manual calculation of memory for each type of consumer
> > >
> > > Both these properties are desirable for shared and non-shared memory;
> > > as well as not wasting memory if there is no consumer as you described.
> > >
> > > OTH, using *unmanaged* shared memory for RocksDB implies:
> > > - there's one more memory type, in addition to 8 existing types [2]
> > > - its size needs to be calculated manually
> > > - flink code needs to take this into account and "adjust" the weights
> > >
> > > Having said that, I'd be fine with either approach as both seem to have
> > > pros and cons.
> > >
> > > What do you think?
> > >
> > > [1]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
> > > [2]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
> > >
> > > Regards,
> > > Roman
> > >
> > >
> > > On Wed, Nov 16, 2022 at 4:01 AM Xintong Song <tonysong...@gmail.com>
> > > wrote:
> > >
> > > > Concerning isolation, I think ideally we want everything to be
> isolated
> > > > between jobs running in the same cluster (i.e., slots in the same
> TM).
> > > > Unfortunately, this is impractical.
> > > > - Heap / Off-heap memory are directly allocated / deallocated through
> > > JVM /
> > > > OS. Flink does not have a good way to cap their usages per slot.
> > > > - Network memory does not have the good property of managed memory,
> > that
> > > a
> > > > job can adapt to any given amount of managed memory (with a very
> small
> > > min
> > > > limitation). We are trying to improve network memory towards that
> > > > direction, and once achieved it can be isolated as well.
> > > > I think not being able to isolate all kinds of memory does not mean
> we
> > > > should give up the isolation on all kinds of memory. And I believe
> > > "managed
> > > > memory is isolated and others are not" is much easier for the users
> to
> > > > understand compared to "part of the managed memory is isolated and
> > others
> > > > are not".
> > > >
> > > > By waste, I meant reserving a certain amount of memory that is only
> > used
> > > by
> > > > certain use cases that do not always exist. This is exactly what we
> > want
> > > to
> > > > avoid with managed memory in FLIP-49 [1]. We used to have managed
> > memory
> > > > only used for batch operators, and a containerized-cut-off memory
> > > > (something similar to framework.off-heap) for rocksdb state backend.
> > The
> > > > problem was that, if the user does not change the configuration when
> > > > switching between streaming / batch jobs, there would always be some
> > > memory
> > > > (managed or cut-off) wasted. Similarly, introducing a shared managed
> > > memory
> > > > zone means reserving one more dedicated part of memory that can get
> > > wasted
> > > > in many cases. This is probably a necessary price for this new
> feature,
> > > but
> > > > let's not break the concept / properties of managed memory for it.
> > > >
> > > > In your proposal, the fraction for the share managed memory is by
> > default
> > > > 0. That means to enable the rocksdb memory sharing, users need to
> > > manually
> > > > increase the fraction anyway. Thus, having the memory sharing rocksdb
> > use
> > > > managed memory or off-heap memory does not make a significant
> > difference
> > > > for the new feature users. I'd think of this as "extra operational
> > > overhead
> > > > for users of a certain new feature" vs. "significant learning cost
> and
> > > > potential behavior change for pretty much all users". I'd be fine
> with
> > > > having some shortcuts to simplify the configuration on the user side
> > for
> > > > this new feature, but not to invade the managed memory.
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > >
> > > > On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
> > > > khachatryan.ro...@gmail.com> wrote:
> > > >
> > > > > Thanks for your reply Xingong Song,
> > > > >
> > > > > Could you please elaborate on the following:
> > > > >
> > > > > > The proposed changes break several good properties that we
> designed
> > > for
> > > > > > managed memory.
> > > > > > 1. It's isolated across slots
> > > > > Just to clarify, any way to manage the memory efficiently while
> > capping
> > > > its
> > > > > usage
> > > > > will break the isolation. It's just a matter of whether it's
> managed
> > > > memory
> > > > > or not.
> > > > > Do you see any reasons why unmanaged memory can be shared, and
> > managed
> > > > > memory can not?
> > > > >
> > > > > > 2. It should never be wasted (unless there's nothing in the job
> > that
> > > > > needs
> > > > > > managed memory)
> > > > > If I understand correctly, the managed memory can already be wasted
> > > > because
> > > > > it is divided evenly between slots, regardless of the existence of
> > its
> > > > > consumers in a particular slot.
> > > > > And in general, even if every slot has RocksDB / python, it's not
> > > > > guaranteed equal consumption.
> > > > > So this property would rather be fixed in the current proposal.
> > > > >
> > > > > > In addition, it further complicates configuration / computation
> > > logics
> > > > of
> > > > > > managed memory.
> > > > > I think having multiple options overriding each other increases the
> > > > > complexity for the user. As for the computation, I think it's
> > desirable
> > > > to
> > > > > let Flink do it rather than users.
> > > > >
> > > > > Both approaches need some help from TM for:
> > > > > - storing the shared resources (static field in a class might be
> too
> > > > > dangerous because if the backend is loaded by the user-class-loader
> > > then
> > > > > memory will leak silently).
> > > > > - reading the configuration
> > > > >
> > > > > Regards,
> > > > > Roman
> > > > >
> > > > >
> > > > > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <
> tonysong...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I like the idea of sharing RocksDB memory across slots. However,
> > I'm
> > > > > quite
> > > > > > concerned by the current proposed approach.
> > > > > >
> > > > > > The proposed changes break several good properties that we
> designed
> > > for
> > > > > > managed memory.
> > > > > > 1. It's isolated across slots
> > > > > > 2. It should never be wasted (unless there's nothing in the job
> > that
> > > > > needs
> > > > > > managed memory)
> > > > > > In addition, it further complicates configuration / computation
> > > logics
> > > > of
> > > > > > managed memory.
> > > > > >
> > > > > > As an alternative, I'd suggest introducing a variant of
> > > > > > RocksDBStateBackend, that shares memory across slots and does not
> > use
> > > > > > managed memory. This basically means the shared memory is not
> > > > considered
> > > > > as
> > > > > > part of managed memory. For users of this new feature, they would
> > > need
> > > > to
> > > > > > configure how much memory the variant state backend should use,
> and
> > > > > > probably also a larger framework-off-heap / jvm-overhead memory.
> > The
> > > > > latter
> > > > > > might require a bit extra user effort compared to the current
> > > approach,
> > > > > but
> > > > > > would avoid significant complexity in the managed memory
> > > configuration
> > > > > and
> > > > > > calculation logics which affects broader users.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Xintong
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <
> > ro...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi John, Yun,
> > > > > > >
> > > > > > > Thank you for your feedback
> > > > > > >
> > > > > > > @John
> > > > > > >
> > > > > > > > It seems like operators would either choose isolation for the
> > > > > cluster’s
> > > > > > > jobs
> > > > > > > > or they would want to share the memory between jobs.
> > > > > > > > I’m not sure I see the motivation to reserve only part of the
> > > > memory
> > > > > > for
> > > > > > > sharing
> > > > > > > > and allowing jobs to choose whether they will share or be
> > > isolated.
> > > > > > >
> > > > > > > I see two related questions here:
> > > > > > >
> > > > > > > 1) Whether to allow mixed workloads within the same cluster.
> > > > > > > I agree that most likely all the jobs will have the same
> > "sharing"
> > > > > > > requirement.
> > > > > > > So we can drop "state.backend.memory.share-scope" from the
> > > proposal.
> > > > > > >
> > > > > > > 2) Whether to allow different memory consumers to use shared or
> > > > > exclusive
> > > > > > > memory.
> > > > > > > Currently, only RocksDB is proposed to use shared memory. For
> > > python,
> > > > > > it's
> > > > > > > non-trivial because it is job-specific.
> > > > > > > So we have to partition managed memory into shared/exclusive
> and
> > > > > > therefore
> > > > > > > can NOT replace "taskmanager.memory.managed.shared-fraction"
> with
> > > > some
> > > > > > > boolean flag.
> > > > > > >
> > > > > > > I think your question was about (1), just wanted to clarify why
> > the
> > > > > > > shared-fraction is needed.
> > > > > > >
> > > > > > > @Yun
> > > > > > >
> > > > > > > > I am just curious whether this could really bring benefits to
> > our
> > > > > users
> > > > > > > with such complex configuration logic.
> > > > > > > I agree, and configuration complexity seems a common concern.
> > > > > > > I hope that removing "state.backend.memory.share-scope" (as
> > > proposed
> > > > > > above)
> > > > > > > reduces the complexity.
> > > > > > > Please share any ideas of how to simplify it further.
> > > > > > >
> > > > > > > > Could you share some real experimental results?
> > > > > > > I did an experiment to verify that the approach is feasible,
> > > > > > > i.e. multilple jobs can share the same memory/block cache.
> > > > > > > But I guess that's not what you mean here? Do you have any
> > > > experiments
> > > > > in
> > > > > > > mind?
> > > > > > >
> > > > > > > > BTW, as talked before, I am not sure whether different
> > lifecycles
> > > > of
> > > > > > > RocksDB state-backends
> > > > > > > > would affect the memory usage of block cache & write buffer
> > > manager
> > > > > in
> > > > > > > RocksDB.
> > > > > > > > Currently, all instances would start and destroy nearly
> > > > > simultaneously,
> > > > > > > > this would change after we introduce this feature with jobs
> > > running
> > > > > at
> > > > > > > different scheduler times.
> > > > > > > IIUC, the concern is that closing a RocksDB instance might
> close
> > > the
> > > > > > > BlockCache.
> > > > > > > I checked that manually and it seems to work as expected.
> > > > > > > And I think that would contradict the sharing concept, as
> > described
> > > > in
> > > > > > the
> > > > > > > documentation [1].
> > > > > > >
> > > > > > > [1]
> > > > > > > https://github.com/facebook/rocksdb/wiki/Block-Cache
> > > > > > >
> > > > > > > Regards,
> > > > > > > Roman
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Roman,
> > > > > > > > Thanks for the proposal, this allows State Backend to make
> > better
> > > > use
> > > > > > of
> > > > > > > > memory.
> > > > > > > >
> > > > > > > > After reading the ticket, I'm curious about some points:
> > > > > > > >
> > > > > > > > 1. Is shared-memory only for the state backend? If both
> > > > > > > > "taskmanager.memory.managed.shared-fraction: >0" and
> > > > > > > > "state.backend.rocksdb.memory.managed: false" are set at the
> > same
> > > > > time,
> > > > > > > > will the shared-memory be wasted?
> > > > > > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of
> > > > unmanaged
> > > > > > > memory
> > > > > > > > and will compete with each other" in the example, how is the
> > > memory
> > > > > > size
> > > > > > > of
> > > > > > > > unmanaged part calculated?
> > > > > > > > 3. For fine-grained-resource-management, the control
> > > > > > > > of cpuCores, taskHeapMemory can still work, right?  And I am
> a
> > > > little
> > > > > > > > worried that too many memory-about configuration options are
> > > > > > complicated
> > > > > > > > for users to understand.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Yanfei
> > > > > > > >
> > > > > > > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道：
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > I'd like to discuss sharing RocksDB memory across slots as
> > > > proposed
> > > > > > in
> > > > > > > > > FLINK-29928 [1].
> > > > > > > > >
> > > > > > > > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > > > > > > > - share these objects among RocksDB instances of the same
> > slot
> > > > > > > > > - bound the total memory usage by all RocksDB instances of
> a
> > TM
> > > > > > > > >
> > > > > > > > > However, the memory is divided between the slots equally
> > > (unless
> > > > > > using
> > > > > > > > > fine-grained resource control). This is sub-optimal if some
> > > slots
> > > > > > > contain
> > > > > > > > > more memory intensive tasks than the others.
> > > > > > > > > Using fine-grained resource control is also often not an
> > option
> > > > > > because
> > > > > > > > the
> > > > > > > > > workload might not be known in advance.
> > > > > > > > >
> > > > > > > > > The proposal is to widen the scope of sharing memory to TM,
> > so
> > > > that
> > > > > > it
> > > > > > > > can
> > > > > > > > > be shared across all RocksDB instances of that TM. That
> would
> > > > > reduce
> > > > > > > the
> > > > > > > > > overall memory consumption in exchange for resource
> > isolation.
> > > > > > > > >
> > > > > > > > > Please see FLINK-29928 [1] for more details.
> > > > > > > > >
> > > > > > > > > Looking forward to feedback on that proposal.
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29928
> > > > > > > > > [2]
> > > > > > > > > https://issues.apache.org/jira/browse/FLINK-7289
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Roman
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Reply via email to