Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Xintong Song Tue, 07 Jan 2020 19:01:03 -0800

Thanks for the explanations @Stephan, and feedbacks @Jingsong @Andrey
@Till. I'm also ok with flink.size in flink-conf.yaml.


And if I understand correctly, we have also get consensus on having the
configuration key 'taskmanager.memory.network.*' (with deprecated key
'taskmanager.network.memory.*')?


Thank you~

Xintong Song



On Tue, Jan 7, 2020 at 7:10 PM Till Rohrmann <[email protected]> wrote:

> I'm also slightly in favour of the flink.size instead of process.size. If
> Xintong is ok with this, then we should update the flink-conf.yaml in the
> repo.
>
> Cheers,
> Till
>
> On Mon, Jan 6, 2020 at 4:09 PM Andrey Zagrebin <[email protected]>
> wrote:
>
> > Thank you for more explanation Stephan and feedback Jingsong,
> >
> > I see the point now.
> > ‘taskmanager.memory.flink.size’ is indeed simpler to understand for the
> > newcomers in the default config because it includes less types of memory
> to
> > consider.
> > Most of the time beginners will think about heap size and maybe state
> size
> > to tweak in the try-out jobs.
> > ‘taskmanager.memory.flink.size’ is better scoped for those types of
> memory.
> >
> > The process memory forces to understand the total memory consumption
> > which is more important for the next steps and will probably require
> > reading the docs in more depth anyways.
> > I agree if we decide for ‘flink.size’, it is worth mentioning a pointer
> to
> > ‘process.size' in its comment as an alternative to it.
> >
> > I am ok with ‘flink.size’ in the default config.
> >
> > Best,
> > Andrey
> >
> > On Mon, Dec 30, 2019 at 5:13 AM Jingsong Li <[email protected]>
> > wrote:
> >
> > > Thank you for your wonderful discussion.
> > >
> > > +1 for set "taskmanager.memory.flink.size" in the default config.
> > > Maybe we can write and explain "taskmanager.memory.process.size" in the
> > > comments.
> > >
> > > Most of the time,
> > > - "trying out" users is less in-depth users and use standalone mode.
> > > - Production users use active setups.
> > >
> > > Default config it very important to "trying out".
> > > In "trying out", for a novice Java user of Flink, he want to configure
> as
> > > much memory as my standalone process should be. In fact, it's hard for
> > him
> > > to realize the JVM overhead. And there is no need that must let him
> know.
> > >
> > > > I don't think it is realistic that users set the process memory to
> full
> > > machine memory. There is a lot on the machine as well in most cases.
> > >
> > > +1, We often run Flink in a less clean environment, such as the
> > environment
> > > have HDFS. Whether we are testing or producing, we will not use all the
> > > memory of the machine, such as always leaving some memory for
> PageCache.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Sat, Dec 28, 2019 at 7:02 PM Stephan Ewen <[email protected]> wrote:
> > >
> > > > "taskmanager.memory.flink.size" in the default config has a few
> > > advantages.
> > > >
> > > >  - The value in the default config needs to be suitable for "trying
> > out"
> > > > Flink, for a good "getting started" experience.
> > > >
> > > >   - For trying out Flink, standalone is the most common entry point
> > > (except
> > > > running in IDE).
> > > >
> > > >   - In standalone setup, from total process memory, we subtract
> quite a
> > > bit
> > > > before we arrive at the usable memory. We also subtract managed
> memory
> > > from
> > > > the heap now. I fear we might end up at a heap that becomes so small
> > that
> > > > it makes for a bad "getting started" experience.
> > > >
> > > >   - I don't think it is realistic that users set the process memory
> to
> > > full
> > > > machine memory. There is a lot on the machine as well in most cases.
> > > >
> > > >   - In the JVM world, users are used to configuring the heap size and
> > > know
> > > > that there is additional memory overhead. The
> > > > "taskmanager.memory.flink.size" option fits well with that mindset.
> > > >
> > > >   - One you start to think about the total process memory of Yarn
> > > > containers, you are already past the getting-started phase and on the
> > > > tuning phase.
> > > >
> > > >
> > > > On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <[email protected]>
> > > wrote:
> > > >
> > > > > Thanks for the summary, Xintong! It makes sense to me.
> > > > >
> > > > > How about putting "taskmanager.memory.flink.size" in the
> > configuration?
> > > > > > Then new downloaded Flink behaves similar to the previous
> > Standalone
> > > > > setups.
> > > > > > If someone upgrades the binaries, but re-uses their old
> > > configuration,
> > > > > > then they get the compatibility as discussed previously.
> > > > > > We used that approach previously with the fine-grained failover
> > > > recovery.
> > > > >
> > > > >
> > > > >
> > > > > > I'm trying to understand why "taskmanager.memory.flink.size"
> rather
> > > > than
> > > > > > "taskmanager.memory.process.size" in the default flink-conf.yaml.
> > Or
> > > > put
> > > > > it
> > > > > > another way, why do we want the new downloaded Flink behaves
> > similar
> > > to
> > > > > > previous Standalone setups rather than previous active mode
> setups?
> > > Is
> > > > > > there any special reason that I overlooked, which makes backwards
> > > > > > compatibility for standalone setups more important than for
> active
> > > > > setups?
> > > > > > IMO, "taskmanager.memory.process.size" is easier for the new
> > comers.
> > > > For
> > > > > > standalone setups, users can simply configure it to their
> machines'
> > > > > > available memory size, without needing to worry about leaving
> > enough
> > > > > space
> > > > > > for JVM overehead / metaspace. For containerized setups, it's
> more
> > > > > > predictable how many memory the containers / Flink could use,
> which
> > > is
> > > > > more
> > > > > > friendly for users to manage their resource quota.
> > > > > > Therefore, unless there is anything I overlooked, I'm in favor of
> > > > putting
> > > > > > "taskmanager.memory.process.size" rather than
> > > > > > "taskmanager.memory.flink.size" in the default configuration.
> > > > >
> > > > >
> > > > > I agree that having "taskmanager.memory.process.size" in default
> > config
> > > > > should be easier to understand and tweak for the new users because
> it
> > > is
> > > > > just what they are ready to spend for Flink application.
> > > > > The problem is when users upgrade Flink and use the new default
> > > > > configuration then the behaviour can change:
> > > > > - either if we put process memory then Flink memory shrinks and the
> > new
> > > > > default option contradicts their previous understanding
> > > > > - or if we put Flink memory then larger container is requested.
> > > > > The shrinking of memory sounds more implicit and worse. The
> increased
> > > > > container request will just fail in the worst case and the memory
> > setup
> > > > can
> > > > > be revisited.
> > > > > We could increase the default "taskmanager.memory.process.size" to
> > > better
> > > > > align it with the previous default setup for standalone
> > > > > but this would not remove the possible confusion problem for the
> old
> > > > users,
> > > > > on the other hand the option is new and we can add a comment how to
> > > > migrate
> > > > > from the old one.
> > > > >
> > > > > All in all, now I also tend to have
> "taskmanager.memory.process.size"
> > > in
> > > > > the default config unless there are more reasons for having less
> > > > confusion
> > > > > for the old standalone users.
> > > > >
> > > > > Best,
> > > > > Andrey
> > > > >
> > > > > On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > It seems we have already get consensus on most of the issues.
> > Thanks
> > > > > > everyone for the good discussion.
> > > > > >
> > > > > > While there are still open questions under discussion, I'd like
> to
> > > > > > summarize the discussion so far, and list the action items that
> we
> > > > > already
> > > > > > get consensus on. In this way, we can already start working on
> > these
> > > > > items
> > > > > > while discussing the remaining open questions. Please correct me
> if
> > > the
> > > > > > summary does not reflect your argument.
> > > > > >
> > > > > > *Action Items that we already get consensus on:*
> > > > > >
> > > > > >    - Map "taskmanager.heap.size" to
> "taskmanager.memory.flink.size"
> > > for
> > > > > >    standalone setups, and "taskmanager.memory.process.size" for
> > > active
> > > > > setups.
> > > > > >    (should be mentioned in release notes)
> > > > > >    - If not explicitly configured, MiniCluster should have fixed
> > > > default
> > > > > >    network and managed memory sizes. (should be mentioned in
> docs &
> > > > > release
> > > > > >    notes)
> > > > > >    - Change the memory config options' type from String to
> > MemorySize
> > > > > >    - Change the config option key
> > > "taskmanager.memory.total-flink.size"
> > > > > >    to "taskmanager.memory.flink.size"
> > > > > >    - Change the config option key
> > > > "taskmanager.memory.total-process.size"
> > > > > >    to "taskmanager.memory.process.size"
> > > > > >    - Update descriptions for
> > > > "taskmanager.memory.framework.off-heap.size"
> > > > > >    and "taskmanager.memory.task.off-heap.size" to explicitly
> state
> > > that
> > > > > >       - Both direct and native memory are accounted
> > > > > >       - Will be fully counted into MaxDirectMemorySize
> > > > > >    - Update descriptions for
> > > > > >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove
> > > "I/O
> > > > > direct
> > > > > >    memory" and explicitly state that it's not counted into
> > > > > MaxDirectMemorySize
> > > > > >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> > > > > >
> > > > > >
> > > > > > *Questions that are still open:*
> > > > > >
> > > > > >    - Which config option do we put in the default
> flink-conf.yaml?
> > > > > >       - "taskmanager.memory.flink.size"
> > > > > >       - "taskmanager.memory.process.size"
> > > > > >       - Deprecated "taskmanager.heap.size"
> > > > > >    - What is proper keys for network / shuffle memory
> > > > > >       - "taskmanager.memory.shuffle.*"
> > > > > >       - "taskmanager.memory.network.*"
> > > > > >       - "taskmanager.network.memory.*"
> > > > > >
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> How about putting "taskmanager.memory.flink.size" in the
> > > > configuration?
> > > > > >>> Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > setups.
> > > > > >>> If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > >>> then they get the compatibility as discussed previously.
> > > > > >>> We used that approach previously with the fine-grained failover
> > > > > recovery.
> > > > > >>
> > > > > >>
> > > > > >> I'm trying to understand why "taskmanager.memory.flink.size"
> > rather
> > > > than
> > > > > >> "taskmanager.memory.process.size" in the default
> flink-conf.yaml.
> > Or
> > > > > put it
> > > > > >> another way, why do we want the new downloaded Flink behaves
> > similar
> > > > to
> > > > > >> previous Standalone setups rather than previous active mode
> > setups?
> > > Is
> > > > > >> there any special reason that I overlooked, which makes
> backwards
> > > > > >> compatibility for standalone setups more important than for
> active
> > > > > setups?
> > > > > >>
> > > > > >> IMO, "taskmanager.memory.process.size" is easier for the new
> > comers.
> > > > For
> > > > > >> standalone setups, users can simply configure it to their
> > machines'
> > > > > >> available memory size, without needing to worry about leaving
> > enough
> > > > > space
> > > > > >> for JVM overehead / metaspace. For containerized setups, it's
> more
> > > > > >> predictable how many memory the containers / Flink could use,
> > which
> > > is
> > > > > more
> > > > > >> friendly for users to manage their resource quota.
> > > > > >>
> > > > > >> Therefore, unless there is anything I overlooked, I'm in favor
> of
> > > > > putting
> > > > > >> "taskmanager.memory.process.size" rather than
> > > > > >> "taskmanager.memory.flink.size" in the default configuration.
> > > > > >>
> > > > > >>
> > > > > >> Thank you~
> > > > > >>
> > > > > >> Xintong Song
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <
> > > [email protected]
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> How about putting "taskmanager.memory.flink.size" in the
> > > > configuration?
> > > > > >>>> Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > setups.
> > > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > >>>> then they get the compatibility as discussed previously.
> > > > > >>>> We used that approach previously with the fine-grained
> failover
> > > > > >>>> recovery.
> > > > > >>>
> > > > > >>> +1, this sounds like a good compromise.
> > > > > >>>
> > > > > >>> +1 to not have more options for off-heap memory, that would get
> > > > > >>>> confusing fast. We can keep the name "off-heap" for both task
> > and
> > > > > >>>> framework, as long as they mean the same thing: native plus
> > > direct,
> > > > > and
> > > > > >>>> fully counted into MaxDirectMemory. I would suggest to update
> > the
> > > > > config
> > > > > >>>> descriptions then to reflect that.
> > > > > >>>>
> > > > > >>> True, this should be explained in the config descriptions.
> > > > > >>>
> > > > > >>> looks good to me
> > > > > >>>
> > > > > >>> From a user's perspective I believe
> "taskmanager.memory.network"
> > > > would
> > > > > >>>> be easier to understand as not everyone knows exactly what the
> > > > shuffle
> > > > > >>>> service is. I see the point that it would be a bit imprecise
> as
> > we
> > > > > can have
> > > > > >>>> different shuffle implementations but I would go with the ease
> > of
> > > > > >>>> use/understanding here. Moreover, I think that we won't have
> > many
> > > > > different
> > > > > >>>> shuffle service implementations in the foreseeable future.
> > > > > >>>
> > > > > >>> I agree that if we cannot find any other convincing names for
> the
> > > > > >>> options, we should keep what we already have and change it if
> the
> > > > > >>> alternative is convincing enough.
> > > > > >>> The question is also whether we still want to rename it because
> > it
> > > > was
> > > > > >>> "taskmanager.network.*memory*.*" in 1.9 but
> > > > > "taskmanager.*memory*.network.*"
> > > > > >>> is more aligned with other new memory option names.
> > > > > >>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
> > > > > >>>
> > > > > >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <[email protected]
> >
> > > > wrote:
> > > > > >>>
> > > > > >>>> How about putting "taskmanager.memory.flink.size" in the
> > > > > configuration?
> > > > > >>>> Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > setups.
> > > > > >>>>
> > > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > >>>> then they get the compatibility as discussed previously.
> > > > > >>>> We used that approach previously with the fine-grained
> failover
> > > > > >>>> recovery.
> > > > > >>>>
> > > > > >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <
> > > [email protected]
> > > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> +1 to not have more options for off-heap memory, that would
> get
> > > > > >>>>>> confusing fast. We can keep the name "off-heap" for both
> task
> > > and
> > > > > >>>>>> framework, as long as they mean the same thing: native plus
> > > > direct,
> > > > > and
> > > > > >>>>>> fully counted into MaxDirectMemory. I would suggest to
> update
> > > the
> > > > > config
> > > > > >>>>>> descriptions then to reflect that.
> > > > > >>>>>>
> > > > > >>>>> True, this should be explained in the config descriptions.
> > > > > >>>>>
> > > > > >>>>> Which configuration option will be set in Flink's default
> > > > > >>>>>> flink-conf.yaml? If we want to maintain the existing
> behaviour
> > > it
> > > > > would
> > > > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > > > option.
> > > > > If we
> > > > > >>>>>> are ok with Yarn requesting slightly larger containers, then
> > it
> > > > > could also
> > > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > > >>>>>>
> > > > > >>>>> Good point. Currently, we have
> > > > > >>>>> "taskmanager.memory.total-process.size". In order to preserve
> > the
> > > > > previous
> > > > > >>>>> behavior, we need to have "taskmanager.heap.size" so it can
> be
> > > > > mapped to
> > > > > >>>>> different new options in standalone / active setups.
> > > > > >>>>> I think we can have the deprecated "taskmanager.heap.size" in
> > the
> > > > > >>>>> default flink-conf.yaml, and also have the
> > > > > >>>>> new "taskmanager.memory.total-process.size" in a commented
> > line.
> > > We
> > > > > can
> > > > > >>>>> explain how the deprecated config option behaves differently
> in
> > > the
> > > > > >>>>> comments, so that user can switch to the new config options
> if
> > > they
> > > > > want to.
> > > > > >>>>>
> > > > > >>>>> Thank you~
> > > > > >>>>>
> > > > > >>>>> Xintong Song
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <
> > > > [email protected]>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Thanks for the feedback and good discussion everyone. I left
> > > some
> > > > > >>>>>> comments inline.
> > > > > >>>>>>
> > > > > >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <
> > [email protected]>
> > > > > >>>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> +1 to not have more options for off-heap memory, that would
> > get
> > > > > >>>>>>> confusing fast. We can keep the name "off-heap" for both
> task
> > > and
> > > > > >>>>>>> framework, as long as they mean the same thing: native plus
> > > > > direct, and
> > > > > >>>>>>> fully counted into MaxDirectMemory. I would suggest to
> update
> > > the
> > > > > config
> > > > > >>>>>>> descriptions then to reflect that.
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> > > > > [email protected]>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Regarding the framework/task direct/native memory
> options, I
> > > > tend
> > > > > >>>>>>>> to think it differently. I'm in favor of keep the
> > > > > "*.off-heap.size" for the
> > > > > >>>>>>>> config option keys.
> > > > > >>>>>>>>
> > > > > >>>>>>>>    - It's not necessary IMO to expose the difference
> > concepts
> > > of
> > > > > >>>>>>>>    direct / native memory to the users.
> > > > > >>>>>>>>    - I would avoid introducing more options for native
> > memory
> > > if
> > > > > >>>>>>>>    possible. Taking fine grained resource management and
> > > dynamic
> > > > > slot
> > > > > >>>>>>>>    allocation into consideration, that also means
> introduce
> > > more
> > > > > fields into
> > > > > >>>>>>>>    ResourceSpec / ResourceProfile.
> > > > > >>>>>>>>    - My gut feeling is that having a relative loose
> > > > > >>>>>>>>    MaxDirectMemory should not be a big problem.
> > > > > >>>>>>>>    - In most cases, the task / framework off-heap memory
> > > should
> > > > be
> > > > > >>>>>>>>       mainly (if not all) direct memory, so the difference
> > > > > between derived
> > > > > >>>>>>>>       MaxDirectMemory and the ideal direct memory limit
> > should
> > > > > not be too much.
> > > > > >>>>>>>>       - We do not have a good way to know the exact size
> > > needed
> > > > > >>>>>>>>       for jvm overhead / metaspace and framework / task
> > > off-heap
> > > > > memory, thus
> > > > > >>>>>>>>       having to conservatively reserve slightly more
> memory
> > > then
> > > > > what actually
> > > > > >>>>>>>>       needed. Such reserved but not used memory can cover
> > for
> > > > the
> > > > > small
> > > > > >>>>>>>>       MaxDirectMemory error.
> > > > > >>>>>>>>       -
> > > > > >>>>>>>>       - MaxDirectMemory is not the only way to trigger
> full
> > > gc.
> > > > We
> > > > > >>>>>>>>       still heap activities that can also trigger the gc.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Regarding the memory type config options, I've looked into
> > the
> > > > > >>>>>>>> latest ConfigOptions changes. I think it shouldn't be too
> > > > > complicated to
> > > > > >>>>>>>> change the config options to use memory type, and I can
> > handle
> > > > it
> > > > > maybe
> > > > > >>>>>>>> during your vacations.
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> Also agree with improving MemorySize logging and parsing.
> > This
> > > > > >>>>>>>> should not be a blocker that has to be done in 1.10. I
> would
> > > say
> > > > > we finish
> > > > > >>>>>>>> other works (testability, documentation and those
> discussed
> > in
> > > > > this thread)
> > > > > >>>>>>>> first, and get to this only if we have time.
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thank you~
> > > > > >>>>>>>>
> > > > > >>>>>>>> Xintong Song
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> > > > > >>>>>>>> [email protected]> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hi Stephan and Xintong,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > > > > replaced
> > > > > >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> > > > > different meaning in
> > > > > >>>>>>>>>> standalone setups. The old option did not subtract
> > metaspace
> > > > > and other
> > > > > >>>>>>>>>> overhead, while the new option does. That means that
> with
> > > the
> > > > > default
> > > > > >>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > > > > (independent of
> > > > > >>>>>>>>>> managed memory going off heap).
> > > > > >>>>>>>>>>     I am wondering if we could interpret
> > > > > >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> > line
> > > > > with the old
> > > > > >>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > > > >>>>>>>>>>     The effect would be that the container size on
> > > Yarn/Mesos
> > > > > >>>>>>>>>> increases, because from
> > > "taskmanager.memory.total-flink.size",
> > > > > we need to
> > > > > >>>>>>>>>> add overhead and metaspace to reach the total process
> > size,
> > > > > rather than
> > > > > >>>>>>>>>> cutting off memory. But if we want, we could even adjust
> > for
> > > > > that in the
> > > > > >>>>>>>>>> active resource manager, getting full backwards
> > > compatibility
> > > > > on that part.
> > > > > >>>>>>>>>>     Curious to hear more thoughts there.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I think the problem here is that the
> > > > > >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > > standalone setups
> > > > > >>>>>>>>> and active yarn / mesos setups, and such different
> > > calculation
> > > > > logics and
> > > > > >>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > > > > Therefore, I'm
> > > > > >>>>>>>>> not in favor of treating
> > > "taskmanager.memory.total-flink.size"
> > > > > differently
> > > > > >>>>>>>>> for standalone and active setups.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I think what we really want is probably
> > > > > >>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > > options
> > > > > in
> > > > > >>>>>>>>> different setups. How about we mark
> "taskmanager.heap.size"
> > > as
> > > > > deprecated
> > > > > >>>>>>>>> key for neither of
> "taskmanager.memory.total-process.size"
> > > and
> > > > > >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse
> it
> > > (if
> > > > > explicitly
> > > > > >>>>>>>>> configured) in startup scripts / active resource
> managers,
> > > and
> > > > > set the
> > > > > >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > scripts
> > > > and
> > > > > >>>>>>>>> "taskmanager.memory.total-process.size" in active
> resource
> > > > > managers (if the
> > > > > >>>>>>>>> new config options are not configured). We can provide
> util
> > > > > methods in
> > > > > >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep
> all
> > > the
> > > > > >>>>>>>>> configuration logics at one place.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I agree that the problem is that the legacy option
> > > > > >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> > > > > standalone/container.
> > > > > >>>>>>>>> We had it initially falling back to
> > > > > 'taskmanager.memory.total-flink.size’
> > > > > >>>>>>>>> but I changed that to align it with container cut-off.
> Now
> > I
> > > > see
> > > > > it changes
> > > > > >>>>>>>>> standalone setup then.
> > > > > >>>>>>>>> +1 for supporting its backwards compatibility differently
> > for
> > > > > >>>>>>>>> standalone/container setups.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>> Which configuration option will be set in Flink's default
> > > > > >>>>>> flink-conf.yaml? If we want to maintain the existing
> behaviour
> > > it
> > > > > would
> > > > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > > > option.
> > > > > If we
> > > > > >>>>>> are ok with Yarn requesting slightly larger containers, then
> > it
> > > > > could also
> > > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> > pools
> > > > as
> > > > > a
> > > > > >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > > wondering if it is
> > > > > >>>>>>>>>> the right approach here.
> > > > > >>>>>>>>>>     For example: I started a relatively large JVM (large
> > > heap
> > > > > >>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > system
> > > > > tries to
> > > > > >>>>>>>>>> reserve an additional 6GB for managed memory which is
> more
> > > > than
> > > > > there is
> > > > > >>>>>>>>>> memory left. When you see the error that no memory could
> > be
> > > > > allocated, you
> > > > > >>>>>>>>>> need to understand the magic of how this is derived.
> > > > > >>>>>>>>>>     I am trying to think about this from the perspective
> > of
> > > > > using
> > > > > >>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
> > > > > >>>>>>>>>>     When starting Flink out of a running process, we
> > cannot
> > > > > >>>>>>>>>> assume that we are the only users of that process and
> that
> > > we
> > > > > can mold the
> > > > > >>>>>>>>>> process to our demands. I think a fix value for managed
> > > memory
> > > > > and network
> > > > > >>>>>>>>>> memory would feel more natural in such a setup than a
> > > > mechanism
> > > > > that is
> > > > > >>>>>>>>>> tailored towards exclusive use of the process.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> also +1 for that, if user has not specified any main
> > options
> > > to
> > > > > >>>>>>>>> derive memory. We should also log this fixing of managed
> /
> > > > > shuffle memory.
> > > > > >>>>>>>>> And just noticed, we could also sanity check framework
> and
> > if
> > > > > >>>>>>>>> explicitly configured task heap against available JVM
> heap,
> > > and
> > > > > at least
> > > > > >>>>>>>>> log inconsistencies.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Some off-heap memory goes into direct memory, some
> does
> > > > not.
> > > > > >>>>>>>>>> This confused me a bit. For example
> > > > > >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> > into
> > > > > >>>>>>>>>> MaxDirectMemory while
> > > "taskmanager.memory.task.off-heap.size"
> > > > > is counted as
> > > > > >>>>>>>>>> native memory. Maybe we should rename the keys to
> reflect
> > > > that.
> > > > > There is no
> > > > > >>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > > >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > > >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > > > > >>>>>>>>> accounted in the max direct memory size limit. The
> > confusion
> > > > > probably comes
> > > > > >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > > explicitly mentioned
> > > > > >>>>>>>>> that in its description while
> > > > > "taskmanager.memory.task.off-heap.size"
> > > > > >>>>>>>>> didn't. That's actually because the framework off-heap
> > memory
> > > > is
> > > > > introduced
> > > > > >>>>>>>>> later in a separate commit. We should fix that.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> For framework / task off-heap memory, we do not
> > differentiate
> > > > > >>>>>>>>> direct / native memory usage. That means the configure
> > value
> > > > for
> > > > > these two
> > > > > >>>>>>>>> options could be a mixture of direct / native memory.
> Since
> > > we
> > > > > do not know
> > > > > >>>>>>>>> the portion of direct memory out of the configured value,
> > we
> > > > have
> > > > > >>>>>>>>> to conservatively account it all into the max direct
> memory
> > > > size
> > > > > limit.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total
> size
> > > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > > MaxDirectMemory?
> > > > > It is a
> > > > > >>>>>>>>> difficult parameter, and the main reason to set it was
> (if
> > I
> > > > > recall
> > > > > >>>>>>>>> correctly) to trigger GC based on direct memory
> allocation
> > > (to
> > > > > free heap
> > > > > >>>>>>>>> structures that then in turn release direct memory). If
> the
> > > > > limit is
> > > > > >>>>>>>>> anyways too high (because we also count native memory in
> > > there)
> > > > > such that
> > > > > >>>>>>>>> we can exceed the total process (container) memory, why
> do
> > we
> > > > > set it then?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I always also thought about it as providing more safety
> net
> > > for
> > > > > >>>>>>>>> direct allocations but GC thing looks more important.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 for fixing docs for
> > > 'taskmanager.memory.task.off-heap.size’
> > > > > and
> > > > > >>>>>>>>> renaming to ‘direct' as this is what really happens
> > > > > >>>>>>>>> if we want to support direct limit more exact than now.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I also think that it is hard to separate direct / native
> > > memory
> > > > > >>>>>>>>> unless we introduce even more options.
> > > > > >>>>>>>>> If user wants to keep the direct limit tight to a certain
> > > value
> > > > > >>>>>>>>> but also use native memory outside of it,
> > > > > >>>>>>>>> she would have to increase something else, like JVM
> > overhead
> > > to
> > > > > >>>>>>>>> account for it and there is no other better way.
> > > > > >>>>>>>>> Having more options to account for the native memory
> > outside
> > > of
> > > > > >>>>>>>>> direct limit complicates things but can be introduced
> later
> > > if
> > > > > needed.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - What do you think about renaming
> > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> > > > > "taskmanager.memory.flink.size"
> > > > > >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> > > > > >>>>>>>>>> "taskmanager.memory.process.size" (or
> > > > > "taskmanager.memory.jvm.total"). I
> > > > > >>>>>>>>>> think these keys may be a bit less clumsy (dropping the
> > > > > "total-") without
> > > > > >>>>>>>>>> loss of expressiveness.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 on this.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 as well. Also an option:
> > > > > >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> > > > > >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> > > > > >>>>>>>>> although it can be also mentioned in docs that we mean
> JVM
> > > > > process.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>> I'd be in favour of Stephan's proposal for the config keys
> as
> > > > > shorter
> > > > > >>>>>> is usually better and they are still descriptive enough.
> > Between
> > > > > >>>>>> "taskmanager.memory.process.size" and
> > > > > "taskmanager.memory.jvm.total" I
> > > > > >>>>>> would slightly favour the first variant.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>>>   - The network memory keys are now called
> > > > > >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle
> > is
> > > > > usually
> > > > > >>>>>>>>>> understood as a redistribution (random, or maybe by hash
> > of
> > > > > key). As an
> > > > > >>>>>>>>>> example, there are many discussions about "shuffle join
> > > versus
> > > > > broadcast
> > > > > >>>>>>>>>> join", where "shuffle" is the synonym for
> > "re-partitioning".
> > > > We
> > > > > use that
> > > > > >>>>>>>>>> memory however for all network operations, like forward
> > > pipes,
> > > > > broadcasts,
> > > > > >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> > name
> > > > > "*.shuffle.*"
> > > > > >>>>>>>>>> confusing, I am wondering if users would find that as
> > well.
> > > So
> > > > > throwing in
> > > > > >>>>>>>>>> the suggestion to call the options
> > > > > "taskmanager.memory.network.*".
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +0 on this one. I'm ok with
> "taskmanager.memory.network.*".
> > > On
> > > > > the
> > > > > >>>>>>>>> other hand, one can also argue that this part of memory
> is
> > > used
> > > > > by
> > > > > >>>>>>>>> ShuffleEnvironment, and the key
> > > "taskmanager.memory.shuffle.*"
> > > > > points more
> > > > > >>>>>>>>> directly to the shuffle service components.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be
> a
> > > bit
> > > > > >>>>>>>>> incorrect, because it is doing not only shuffles as well.
> > The
> > > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is
> > not
> > > > too
> > > > > critical.
> > > > > >>>>>>>>> This isn't super high priority for me, but if we want to
> > > adjust
> > > > > it, better
> > > > > >>>>>>>>> earlier than later.*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> This is also a bit controversial topic for me. Indeed, we
> > > have
> > > > > >>>>>>>>> always used ’network’ for this concept of task data
> > shuffling
> > > > > over the
> > > > > >>>>>>>>> network and this can confuse existing users.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On the other hand for the new users and in a long term,
> > > > ’network’
> > > > > >>>>>>>>> can delude into a conclusion that all network memory is
> > > managed
> > > > > by this
> > > > > >>>>>>>>> option.
> > > > > >>>>>>>>> Also other types of shuffle might not directly deal with
> > > > network
> > > > > >>>>>>>>> at all.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> By calling it shuffle, we were somewhat biased by
> > > understanding
> > > > > it
> > > > > >>>>>>>>> in term of map/reduce. This is rather an inter-task data
> > > > > exchange.
> > > > > >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’
> or
> > > > > >>>>>>>>>
> > ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>  From a user's perspective I believe
> > > "taskmanager.memory.network"
> > > > > >>>>>> would be easier to understand as not everyone knows exactly
> > what
> > > > the
> > > > > >>>>>> shuffle service is. I see the point that it would be a bit
> > > > > imprecise as we
> > > > > >>>>>> can have different shuffle implementations but I would go
> with
> > > the
> > > > > ease of
> > > > > >>>>>> use/understanding here. Moreover, I think that we won't have
> > > many
> > > > > different
> > > > > >>>>>> shuffle service implementations in the foreseeable future.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>>>   - The descriptions for the
> > > > "taskmanager.memory.jvm-overhead.*"
> > > > > >>>>>>>>>> keys say that it also accounts for I/O direct memory,
> but
> > > the
> > > > > parameter is
> > > > > >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> True. Since we already have framework off-heap memory
> > > accounted
> > > > > >>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > jvm-overhead
> > > > > also into
> > > > > >>>>>>>>> max direct memory limit seems not necessary. I would
> > suggest
> > > to
> > > > > remove "I/O
> > > > > >>>>>>>>> direct memory" from the description, and explicitly
> mention
> > > > that
> > > > > this
> > > > > >>>>>>>>> option does not account for direct memory and will not be
> > > > > accounted into
> > > > > >>>>>>>>> max direct memory limit. WDYT?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Can make the new ConfigOptions strongly typed with
> the
> > > new
> > > > > >>>>>>>>>> configuration options. For example, directly use
> > MemorySize
> > > > > typed options.
> > > > > >>>>>>>>>> That makes validation automatic and saves us from
> breaking
> > > the
> > > > > options
> > > > > >>>>>>>>>> later.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I would appreciate it.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Also small side note, we extensively use MemorySize in
> logs
> > > now
> > > > > >>>>>>>>> but it might be not always readable as its string
> > > > representation
> > > > > is only in
> > > > > >>>>>>>>> bytes atm
> > > > > >>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes
> > > value.
> > > > > We
> > > > > >>>>>>>>> could have at least some .prettyPrint function to use in
> > > logs.
> > > > > >>>>>>>>> And .fromMegabytes/etc factory methods would improve code
> > > > > >>>>>>>>> readability instead of .parse(int + “m”).
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best,
> > > > > >>>>>>>>> Andrey
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <[email protected]
> >
> > > > wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Hi Xintong!
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Please find my answers inline:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>   - "taskmanager.memory.size" (old main config option)
> is
> > > > > >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size"
> which
> > > has
> > > > > a different
> > > > > >>>>>>>>>>> meaning in standalone setups. The old option did not
> > > subtract
> > > > > metaspace and
> > > > > >>>>>>>>>>> other overhead, while the new option does. That means
> > that
> > > > > with the default
> > > > > >>>>>>>>>>> config, standalone clusters get quite a bit less
> memory.
> > > > > (independent of
> > > > > >>>>>>>>>>> managed memory going off heap).
> > > > > >>>>>>>>>>>     I am wondering if we could interpret
> > > > > >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > > >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> > > line
> > > > > with the old
> > > > > >>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > > > >>>>>>>>>>>     The effect would be that the container size on
> > > Yarn/Mesos
> > > > > >>>>>>>>>>> increases, because from
> > > > "taskmanager.memory.total-flink.size",
> > > > > we need to
> > > > > >>>>>>>>>>> add overhead and metaspace to reach the total process
> > size,
> > > > > rather than
> > > > > >>>>>>>>>>> cutting off memory. But if we want, we could even
> adjust
> > > for
> > > > > that in the
> > > > > >>>>>>>>>>> active resource manager, getting full backwards
> > > compatibility
> > > > > on that part.
> > > > > >>>>>>>>>>>     Curious to hear more thoughts there.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Yes*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> I think the problem here is that the
> > > > > >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > > standalone setups
> > > > > >>>>>>>>>> and active yarn / mesos setups, and such different
> > > calculation
> > > > > logics and
> > > > > >>>>>>>>>> behaviors are exactly what we want to avoid with
> FLIP-49.
> > > > > Therefore, I'm
> > > > > >>>>>>>>>> not in favor of treating
> > > "taskmanager.memory.total-flink.size"
> > > > > differently
> > > > > >>>>>>>>>> for standalone and active setups.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I think what we really want is probably
> > > > > >>>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > > > options
> > > > > in
> > > > > >>>>>>>>>> different setups. How about we mark
> > "taskmanager.heap.size"
> > > as
> > > > > deprecated
> > > > > >>>>>>>>>> key for neither of
> "taskmanager.memory.total-process.size"
> > > and
> > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse
> > it
> > > > (if
> > > > > explicitly
> > > > > >>>>>>>>>> configured) in startup scripts / active resource
> managers,
> > > and
> > > > > set the
> > > > > >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > > scripts
> > > > > and
> > > > > >>>>>>>>>> "taskmanager.memory.total-process.size" in active
> resource
> > > > > managers (if the
> > > > > >>>>>>>>>> new config options are not configured). We can provide
> > util
> > > > > methods in
> > > > > >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep
> > all
> > > > the
> > > > > >>>>>>>>>> configuration logics at one place.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> > > > > >>>>>>>>> description was not very clear), so +1 for that
> mechanism*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> > pools
> > > > as
> > > > > a
> > > > > >>>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > > wondering if it is
> > > > > >>>>>>>>>>> the right approach here.
> > > > > >>>>>>>>>>>     For example: I started a relatively large JVM
> (large
> > > heap
> > > > > >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > > system
> > > > > tries to
> > > > > >>>>>>>>>>> reserve an additional 6GB for managed memory which is
> > more
> > > > > than there is
> > > > > >>>>>>>>>>> memory left. When you see the error that no memory
> could
> > be
> > > > > allocated, you
> > > > > >>>>>>>>>>> need to understand the magic of how this is derived.
> > > > > >>>>>>>>>>>     I am trying to think about this from the
> perspective
> > of
> > > > > >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is
> > close
> > > > to.
> > > > > >>>>>>>>>>>     When starting Flink out of a running process, we
> > cannot
> > > > > >>>>>>>>>>> assume that we are the only users of that process and
> > that
> > > we
> > > > > can mold the
> > > > > >>>>>>>>>>> process to our demands. I think a fix value for managed
> > > > memory
> > > > > and network
> > > > > >>>>>>>>>>> memory would feel more natural in such a setup than a
> > > > > mechanism that is
> > > > > >>>>>>>>>>> tailored towards exclusive use of the process.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Cool, let's also see what Andrey and Till think
> here.*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>   - Some off-heap memory goes into direct memory, some
> > does
> > > > not.
> > > > > >>>>>>>>>>> This confused me a bit. For example
> > > > > >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> > > into
> > > > > >>>>>>>>>>> MaxDirectMemory while
> > > "taskmanager.memory.task.off-heap.size"
> > > > > is counted as
> > > > > >>>>>>>>>>> native memory. Maybe we should rename the keys to
> reflect
> > > > > that. There is no
> > > > > >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > > >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > > >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is
> also
> > > > > >>>>>>>>>> accounted in the max direct memory size limit. The
> > confusion
> > > > > probably comes
> > > > > >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > > explicitly mentioned
> > > > > >>>>>>>>>> that in its description while
> > > > > "taskmanager.memory.task.off-heap.size"
> > > > > >>>>>>>>>> didn't. That's actually because the framework off-heap
> > > memory
> > > > > is introduced
> > > > > >>>>>>>>>> later in a separate commit. We should fix that.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> For framework / task off-heap memory, we do not
> > > differentiate
> > > > > >>>>>>>>>> direct / native memory usage. That means the configure
> > value
> > > > > for these two
> > > > > >>>>>>>>>> options could be a mixture of direct / native memory.
> > Since
> > > we
> > > > > do not know
> > > > > >>>>>>>>>> the portion of direct memory out of the configured
> value,
> > we
> > > > > have
> > > > > >>>>>>>>>> to conservatively account it all into the max direct
> > memory
> > > > > size limit.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total
> size
> > > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > > MaxDirectMemory?
> > > > > It is a
> > > > > >>>>>>>>> difficult parameter, and the main reason to set it was
> (if
> > I
> > > > > recall
> > > > > >>>>>>>>> correctly) to trigger GC based on direct memory
> allocation
> > > (to
> > > > > free heap
> > > > > >>>>>>>>> structures that then in turn release direct memory). If
> the
> > > > > limit is
> > > > > >>>>>>>>> anyways too high (because we also count native memory in
> > > there)
> > > > > such that
> > > > > >>>>>>>>> we can exceed the total process (container) memory, why
> do
> > we
> > > > > set it then?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - The network memory keys are now called
> > > > > >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge,
> shuffle
> > is
> > > > > usually
> > > > > >>>>>>>>>>> understood as a redistribution (random, or maybe by
> hash
> > of
> > > > > key). As an
> > > > > >>>>>>>>>>> example, there are many discussions about "shuffle join
> > > > versus
> > > > > broadcast
> > > > > >>>>>>>>>>> join", where "shuffle" is the synonym for
> > > "re-partitioning".
> > > > > We use that
> > > > > >>>>>>>>>>> memory however for all network operations, like forward
> > > > pipes,
> > > > > broadcasts,
> > > > > >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> > > name
> > > > > "*.shuffle.*"
> > > > > >>>>>>>>>>> confusing, I am wondering if users would find that as
> > well.
> > > > So
> > > > > throwing in
> > > > > >>>>>>>>>>> the suggestion to call the options
> > > > > "taskmanager.memory.network.*".
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +0 on this one. I'm ok with
> > "taskmanager.memory.network.*".
> > > On
> > > > > >>>>>>>>>> the other hand, one can also argue that this part of
> > memory
> > > is
> > > > > used by
> > > > > >>>>>>>>>> ShuffleEnvironment, and the key
> > > "taskmanager.memory.shuffle.*"
> > > > > points more
> > > > > >>>>>>>>>> directly to the shuffle service components.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be
> a
> > > bit
> > > > > >>>>>>>>> incorrect, because it is doing not only shuffles as well.
> > The
> > > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is
> > not
> > > > too
> > > > > critical.
> > > > > >>>>>>>>> This isn't super high priority for me, but if we want to
> > > adjust
> > > > > it, better
> > > > > >>>>>>>>> earlier than later.*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>    - The descriptions for the
> > > > "taskmanager.memory.jvm-overhead.*"
> > > > > >>>>>>>>> keys say that it also accounts for I/O direct memory, but
> > the
> > > > > parameter is
> > > > > >>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> True. Since we already have framework off-heap memory
> > > > accounted
> > > > > >>>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > > jvm-overhead
> > > > > also into
> > > > > >>>>>>>>>> max direct memory limit seems not necessary. I would
> > suggest
> > > > to
> > > > > remove "I/O
> > > > > >>>>>>>>>> direct memory" from the description, and explicitly
> > mention
> > > > > that this
> > > > > >>>>>>>>>> option does not account for direct memory and will not
> be
> > > > > accounted into
> > > > > >>>>>>>>>> max direct memory limit. WDYT?
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Sounds good. *
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with
> the
> > > new
> > > > > >>>>>>>>>>> configuration options. For example, directly use
> > MemorySize
> > > > > typed options.
> > > > > >>>>>>>>>>> That makes validation automatic and saves us from
> > breaking
> > > > the
> > > > > options
> > > > > >>>>>>>>>>> later.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best,
> > > > > >>>>>>>>> Stephan
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Reply via email to