Hi all!

I looked into using Flink with the new memory configurations and played
around a bit.
All in all, I think we made a really good step there in removing implicit
assumptions and rather making the configuration of memory sizes more
explicit. My experience was mostly very positive.

There were a few issues I stumbled across that were a bit un-intuitive,
where we could think about some alternatives.

==========
Main points
==========

  - "taskmanager.memory.size" (old main config option) is replaced by
"taskmanager.memory.total-process.size" which has a different meaning in
standalone setups. The old option did not subtract metaspace and other
overhead, while the new option does. That means that with the default
config, standalone clusters get quite a bit less memory. (independent of
managed memory going off heap).

    I am wondering if we could interpret "taskmanager.memory.size" as the
deprecated key for "taskmanager.memory.total-flink.size". That would be in
line with the old mechanism (assuming managed memory is set to off heap).

    The effect would be that the container size on Yarn/Mesos increases,
because from "taskmanager.memory.total-flink.size", we need to add overhead
and metaspace to reach the total process size, rather than cutting off
memory. But if we want, we could even adjust for that in the active
resource manager, getting full backwards compatibility on that part.

    Curious to hear more thoughts there.

  - Mini Cluster tries to imitate exact ratio of memory pools as a
standalone setup. I get the idea behind that, but I am wondering if it is
the right approach here.

    For example: I started a relatively large JVM (large heap size of 10
GB) as a test. With the current logic, the system tries to reserve an
additional 6GB for managed memory which is more than there is memory left.
When you see the error that no memory could be allocated, you need to
understand the magic of how this is derived.

    I am trying to think about this from the perspective of using "Flink as
a Library", which the MiniCluster is close to.
    When starting Flink out of a running process, we cannot assume that we
are the only users of that process and that we can mold the process to our
demands. I think a fix value for managed memory and network memory would
feel more natural in such a setup than a mechanism that is tailored towards
exclusive use of the process.


===========================
Config key names and descriptions
===========================

  - Some off-heap memory goes into direct memory, some does not. This
confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
is counted into MaxDirectMemory while
"taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
we should rename the keys to reflect that. There is no one "off heap"
memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
and "taskmanager.memory.framework.direct: XXXmb" instead?

  - What do you think about renaming "taskmanager.memory.total-flink.size"
to "taskmanager.memory.flink.size" and
"taskmanager.memory.total-process.size" to
"taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
think these keys may be a bit less clumsy (dropping the "total-") without
loss of expressiveness.

  - The network memory keys are now called "taskmanager.memory.shuffle.*".
To my knowledge, shuffle is usually understood as a redistribution (random,
or maybe by hash of key). As an example, there are many discussions about
"shuffle join versus broadcast join", where "shuffle" is the synonym for
"re-partitioning". We use that memory however for all network operations,
like forward pipes, broadcasts, receiver-side buffering on checkpoints,
etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
find that as well. So throwing in the suggestion to call the options
"taskmanager.memory.network.*".

  - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
that it also accounts for I/O direct memory, but the parameter is not
counted into the MaxDirectMemory parameter.


Some minor comments
==================

  - Can make the new ConfigOptions strongly typed with the new
configuration options. For example, directly use MemorySize typed options.
That makes validation automatic and saves us from breaking the options
later.


Best,
Stephan

Reply via email to