Hi Stephan,

Thank you for kicking off this discussion and providing the suggestions.

  - "taskmanager.memory.size" (old main config option) is replaced by
> "taskmanager.memory.total-process.size" which has a different meaning in
> standalone setups. The old option did not subtract metaspace and other
> overhead, while the new option does. That means that with the default
> config, standalone clusters get quite a bit less memory. (independent of
> managed memory going off heap).
>     I am wondering if we could interpret "taskmanager.memory.size" as the
> deprecated key for "taskmanager.memory.total-flink.size". That would be in
> line with the old mechanism (assuming managed memory is set to off heap).
>     The effect would be that the container size on Yarn/Mesos increases,
> because from "taskmanager.memory.total-flink.size", we need to add overhead
> and metaspace to reach the total process size, rather than cutting off
> memory. But if we want, we could even adjust for that in the active
> resource manager, getting full backwards compatibility on that part.
>     Curious to hear more thoughts there.


I believe you mean "taskmanager.heap.size".

I think the problem here is that the legacy "taskmanager.heap.size" was
used differently in standalone setups and active yarn / mesos setups, and
such different calculation logics and behaviors are exactly what we want to
avoid with FLIP-49. Therefore, I'm not in favor of treating
"taskmanager.memory.total-flink.size" differently for standalone and active
setups.

I think what we really want is probably mapping "taskmanager.heap.size" to
different new config options in different setups. How about we
mark "taskmanager.heap.size" as deprecated key for neither of
"taskmanager.memory.total-process.size" and
"taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
configured) in startup scripts / active resource managers, and set the
value to "taskmanager.memory.total-flink.size" in the scripts and
"taskmanager.memory.total-process.size" in active resource managers (if the
new config options are not configured). We can provide util methods in
TaskExecutorResourceUtils for such conversions, to keep all the
configuration logics at one place.

  - Mini Cluster tries to imitate exact ratio of memory pools as a
> standalone setup. I get the idea behind that, but I am wondering if it is
> the right approach here.
>     For example: I started a relatively large JVM (large heap size of 10
> GB) as a test. With the current logic, the system tries to reserve an
> additional 6GB for managed memory which is more than there is memory left.
> When you see the error that no memory could be allocated, you need to
> understand the magic of how this is derived.
>     I am trying to think about this from the perspective of using "Flink
> as a Library", which the MiniCluster is close to.
>     When starting Flink out of a running process, we cannot assume that we
> are the only users of that process and that we can mold the process to our
> demands. I think a fix value for managed memory and network memory would
> feel more natural in such a setup than a mechanism that is tailored towards
> exclusive use of the process.


+1 on having fixed values for managed / shuffle memory.

  - Some off-heap memory goes into direct memory, some does not. This
> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
> is counted into MaxDirectMemory while
> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
> we should rename the keys to reflect that. There is no one "off heap"
> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
> and "taskmanager.memory.framework.direct: XXXmb" instead?


I believe "taskmanager.memory.task.off-heap.size" is also accounted in the
max direct memory size limit. The confusion probably comes from that
"taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
its description while "taskmanager.memory.task.off-heap.size" didn't.
That's actually because the framework off-heap memory is introduced later
in a separate commit. We should fix that.

For framework / task off-heap memory, we do not differentiate direct /
native memory usage. That means the configure value for these two options
could be a mixture of direct / native memory. Since we do not know the
portion of direct memory out of the configured value, we have
to conservatively account it all into the max direct memory size limit.

  - What do you think about renaming "taskmanager.memory.total-flink.size"
> to "taskmanager.memory.flink.size" and
> "taskmanager.memory.total-process.size" to
> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
> think these keys may be a bit less clumsy (dropping the "total-") without
> loss of expressiveness.


+1 on this.

  - The network memory keys are now called "taskmanager.memory.shuffle.*".
> To my knowledge, shuffle is usually understood as a redistribution (random,
> or maybe by hash of key). As an example, there are many discussions about
> "shuffle join versus broadcast join", where "shuffle" is the synonym for
> "re-partitioning". We use that memory however for all network operations,
> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
> find that as well. So throwing in the suggestion to call the options
> "taskmanager.memory.network.*".


+0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
hand, one can also argue that this part of memory is used by
ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
directly to the shuffle service components.

  - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
> that it also accounts for I/O direct memory, but the parameter is not
> counted into the MaxDirectMemory parameter.


True. Since we already have framework off-heap memory accounted for ad hoc
direct memory usages, accounting all of jvm-overhead also into max direct
memory limit seems not necessary. I would suggest to remove "I/O direct
memory" from the description, and explicitly mention that this option does
not account for direct memory and will not be accounted into max direct
memory limit. WDYT?

  - Can make the new ConfigOptions strongly typed with the new
> configuration options. For example, directly use MemorySize typed options.
> That makes validation automatic and saves us from breaking the options
> later.

+1. Wasn't aware of the new memory type config options.

Thank you~

Xintong Song



On Thu, Dec 19, 2019 at 6:03 PM Stephan Ewen <se...@apache.org> wrote:

> Hi all!
>
> I looked into using Flink with the new memory configurations and played
> around a bit.
> All in all, I think we made a really good step there in removing implicit
> assumptions and rather making the configuration of memory sizes more
> explicit. My experience was mostly very positive.
>
> There were a few issues I stumbled across that were a bit un-intuitive,
> where we could think about some alternatives.
>
> ==========
> Main points
> ==========
>
>   - "taskmanager.memory.size" (old main config option) is replaced by
> "taskmanager.memory.total-process.size" which has a different meaning in
> standalone setups. The old option did not subtract metaspace and other
> overhead, while the new option does. That means that with the default
> config, standalone clusters get quite a bit less memory. (independent of
> managed memory going off heap).
>
>     I am wondering if we could interpret "taskmanager.memory.size" as the
> deprecated key for "taskmanager.memory.total-flink.size". That would be in
> line with the old mechanism (assuming managed memory is set to off heap).
>
>     The effect would be that the container size on Yarn/Mesos increases,
> because from "taskmanager.memory.total-flink.size", we need to add overhead
> and metaspace to reach the total process size, rather than cutting off
> memory. But if we want, we could even adjust for that in the active
> resource manager, getting full backwards compatibility on that part.
>
>     Curious to hear more thoughts there.
>
>   - Mini Cluster tries to imitate exact ratio of memory pools as a
> standalone setup. I get the idea behind that, but I am wondering if it is
> the right approach here.
>
>     For example: I started a relatively large JVM (large heap size of 10
> GB) as a test. With the current logic, the system tries to reserve an
> additional 6GB for managed memory which is more than there is memory left.
> When you see the error that no memory could be allocated, you need to
> understand the magic of how this is derived.
>
>     I am trying to think about this from the perspective of using "Flink
> as a Library", which the MiniCluster is close to.
>     When starting Flink out of a running process, we cannot assume that we
> are the only users of that process and that we can mold the process to our
> demands. I think a fix value for managed memory and network memory would
> feel more natural in such a setup than a mechanism that is tailored towards
> exclusive use of the process.
>
>
> ===========================
> Config key names and descriptions
> ===========================
>
>   - Some off-heap memory goes into direct memory, some does not. This
> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
> is counted into MaxDirectMemory while
> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
> we should rename the keys to reflect that. There is no one "off heap"
> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
> and "taskmanager.memory.framework.direct: XXXmb" instead?
>
>   - What do you think about renaming "taskmanager.memory.total-flink.size"
> to "taskmanager.memory.flink.size" and
> "taskmanager.memory.total-process.size" to
> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
> think these keys may be a bit less clumsy (dropping the "total-") without
> loss of expressiveness.
>
>   - The network memory keys are now called "taskmanager.memory.shuffle.*".
> To my knowledge, shuffle is usually understood as a redistribution (random,
> or maybe by hash of key). As an example, there are many discussions about
> "shuffle join versus broadcast join", where "shuffle" is the synonym for
> "re-partitioning". We use that memory however for all network operations,
> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
> find that as well. So throwing in the suggestion to call the options
> "taskmanager.memory.network.*".
>
>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
> that it also accounts for I/O direct memory, but the parameter is not
> counted into the MaxDirectMemory parameter.
>
>
> Some minor comments
> ==================
>
>   - Can make the new ConfigOptions strongly typed with the new
> configuration options. For example, directly use MemorySize typed options.
> That makes validation automatic and saves us from breaking the options
> later.
>
>
> Best,
> Stephan
>
>

Reply via email to