Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Till Rohrmann Tue, 27 Aug 2019 05:12:13 -0700

Hi Xintong,

thanks for addressing the comments and adding a more detailed
implementation plan. I have a couple of comments concerning the
implementation plan:


- The name `TaskExecutorSpecifics` is not really descriptive. Choosing a
different name could help here.
- I'm not sure whether I would pass the memory configuration to the
TaskExecutor via environment variables. I think it would be better to write
it into the configuration one uses to start the TM process.
- If possible, I would exclude the memory reservation from this FLIP and
add this as part of a dedicated FLIP.
- If possible, then I would exclude changes to the network stack from this
FLIP. Maybe we can simply say that the direct memory needed by the network
stack is the framework direct memory requirement. Changing how the memory
is allocated can happen in a second step. This would keep the scope of this
FLIP smaller.

Cheers,
Till

On Thu, Aug 22, 2019 at 2:51 PM Xintong Song <[email protected]> wrote:

> Hi everyone,
>
> I just updated the FLIP document on wiki [1], with the following changes.
>
>    - Removed open question regarding MemorySegment allocation. As
>    discussed, we exclude this topic from the scope of this FLIP.
>    - Updated content about JVM direct memory parameter according to recent
>    discussions, and moved the other options to "Rejected Alternatives" for
> the
>    moment.
>    - Added implementation steps.
>
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Mon, Aug 19, 2019 at 7:16 PM Stephan Ewen <[email protected]> wrote:
>
> > @Xintong: Concerning "wait for memory users before task dispose and
> memory
> > release": I agree, that's how it should be. Let's try it out.
> >
> > @Xintong @Jingsong: Concerning " JVM does not wait for GC when allocating
> > direct memory buffer": There seems to be pretty elaborate logic to free
> > buffers when allocating new ones. See
> >
> >
> http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643
> >
> > @Till: Maybe. If we assume that the JVM default works (like going with
> > option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I think
> it
> > should be okay to set "-XX:MaxDirectMemorySize" to
> > "off_heap_managed_memory + direct_memory" even if we use RocksDB. That
> is a
> > big if, though, I honestly have no idea :D Would be good to understand
> > this, though, because this would affect option (2) and option (1.2).
> >
> > On Mon, Aug 19, 2019 at 4:44 PM Xintong Song <[email protected]>
> > wrote:
> >
> > > Thanks for the inputs, Jingsong.
> > >
> > > Let me try to summarize your points. Please correct me if I'm wrong.
> > >
> > >    - Memory consumers should always avoid returning memory segments to
> > >    memory manager while there are still un-cleaned structures / threads
> > > that
> > >    may use the memory. Otherwise, it would cause serious problems by
> > having
> > >    multiple consumers trying to use the same memory segment.
> > >    - JVM does not wait for GC when allocating direct memory buffer.
> > >    Therefore even we set proper max direct memory size limit, we may
> > still
> > >    encounter direct memory oom if the GC cleaning memory slower than
> the
> > >    direct memory allocation.
> > >
> > > Am I understanding this correctly?
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[email protected]
> > > .invalid>
> > > wrote:
> > >
> > > > Hi stephan:
> > > >
> > > > About option 2:
> > > >
> > > > if additional threads not cleanly shut down before we can exit the
> > task:
> > > > In the current case of memory reuse, it has freed up the memory it
> > > >  uses. If this memory is used by other tasks and asynchronous threads
> > > >  of exited task may still be writing, there will be concurrent
> security
> > > >  problems, and even lead to errors in user computing results.
> > > >
> > > > So I think this is a serious and intolerable bug, No matter what the
> > > >  option is, it should be avoided.
> > > >
> > > > About direct memory cleaned by GC:
> > > > I don't think it is a good idea, I've encountered so many situations
> > > >  that it's too late for GC to cause DirectMemory OOM. Release and
> > > >  allocate DirectMemory depend on the type of user job, which is
> > > >  often beyond our control.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > From:Stephan Ewen <[email protected]>
> > > > Send Time:2019年8月19日(星期一) 15:56
> > > > To:dev <[email protected]>
> > > > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> > > > TaskExecutors
> > > >
> > > > My main concern with option 2 (manually release memory) is that
> > segfaults
> > > > in the JVM send off all sorts of alarms on user ends. So we need to
> > > > guarantee that this never happens.
> > > >
> > > > The trickyness is in tasks that uses data structures / algorithms
> with
> > > > additional threads, like hash table spill/read and sorting threads.
> We
> > > need
> > > > to ensure that these cleanly shut down before we can exit the task.
> > > > I am not sure that we have that guaranteed already, that's why option
> > 1.1
> > > > seemed simpler to me.
> > > >
> > > > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[email protected]>
> > > > wrote:
> > > >
> > > > > Thanks for the comments, Stephan. Summarized in this way really
> makes
> > > > > things easier to understand.
> > > > >
> > > > > I'm in favor of option 2, at least for the moment. I think it is
> not
> > > that
> > > > > difficult to keep it segfault safe for memory manager, as long as
> we
> > > > always
> > > > > de-allocate the memory segment when it is released from the memory
> > > > > consumers. Only if the memory consumer continue using the buffer of
> > > > memory
> > > > > segment after releasing it, in which case we do want the job to
> fail
> > so
> > > > we
> > > > > detect the memory leak early.
> > > > >
> > > > > For option 1.2, I don't think this is a good idea. Not only because
> > the
> > > > > assumption (regular GC is enough to clean direct buffers) may not
> > > always
> > > > be
> > > > > true, but also it makes harder for finding problems in cases of
> > memory
> > > > > overuse. E.g., user configured some direct memory for the user
> > > libraries.
> > > > > If the library actually use more direct memory then configured,
> which
> > > > > cannot be cleaned by GC because they are still in use, may lead to
> > > > overuse
> > > > > of the total container memory. In that case, if it didn't touch the
> > JVM
> > > > > default max direct memory limit, we cannot get a direct memory OOM
> > and
> > > it
> > > > > will become super hard to understand which part of the
> configuration
> > > need
> > > > > to be updated.
> > > > >
> > > > > For option 1.1, it has the similar problem as 1.2, if the exceeded
> > > direct
> > > > > memory does not reach the max direct memory limit specified by the
> > > > > dedicated parameter. I think it is slightly better than 1.2, only
> > > because
> > > > > we can tune the parameter.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[email protected]>
> > wrote:
> > > > >
> > > > > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me
> > > summarize
> > > > > it a
> > > > > > bit differently:
> > > > > >
> > > > > > We have the following two options:
> > > > > >
> > > > > > (1) We let MemorySegments be de-allocated by the GC. That makes
> it
> > > > > segfault
> > > > > > safe. But then we need a way to trigger GC in case de-allocation
> > and
> > > > > > re-allocation of a bunch of segments happens quickly, which is
> > often
> > > > the
> > > > > > case during batch scheduling or task restart.
> > > > > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do
> > this
> > > > > >   - Another way could be to have a dedicated bookkeeping in the
> > > > > > MemoryManager (option 1.2), so that this is a number independent
> of
> > > the
> > > > > > "-XX:MaxDirectMemorySize" parameter.
> > > > > >
> > > > > > (2) We manually allocate and de-allocate the memory for the
> > > > > MemorySegments
> > > > > > (option 2). That way we need not worry about triggering GC by
> some
> > > > > > threshold or bookkeeping, but it is harder to prevent segfaults.
> We
> > > > need
> > > > > to
> > > > > > be very careful about when we release the memory segments (only
> in
> > > the
> > > > > > cleanup phase of the main thread).
> > > > > >
> > > > > > If we go with option 1.1, we probably need to set
> > > > > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > direct_memory"
> > > > > and
> > > > > > have "direct_memory" as a separate reserved memory pool. Because
> if
> > > we
> > > > > just
> > > > > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > > > jvm_overhead",
> > > > > > then there will be times when that entire memory is allocated by
> > > direct
> > > > > > buffers and we have nothing left for the JVM overhead. So we
> either
> > > > need
> > > > > a
> > > > > > way to compensate for that (again some safety margin cutoff
> value)
> > or
> > > > we
> > > > > > will exceed container memory.
> > > > > >
> > > > > > If we go with option 1.2, we need to be aware that it takes
> > elaborate
> > > > > logic
> > > > > > to push recycling of direct buffers without always triggering a
> > full
> > > > GC.
> > > > > >
> > > > > >
> > > > > > My first guess is that the options will be easiest to do in the
> > > > following
> > > > > > order:
> > > > > >
> > > > > >   - Option 1.1 with a dedicated direct_memory parameter, as
> > discussed
> > > > > > above. We would need to find a way to set the direct_memory
> > parameter
> > > > by
> > > > > > default. We could start with 64 MB and see how it goes in
> practice.
> > > One
> > > > > > danger I see is that setting this loo low can cause a bunch of
> > > > additional
> > > > > > GCs compared to before (we need to watch this carefully).
> > > > > >
> > > > > >   - Option 2. It is actually quite simple to implement, we could
> > try
> > > > how
> > > > > > segfault safe we are at the moment.
> > > > > >
> > > > > >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> > > > > parameter
> > > > > > at all and assume that all the direct memory allocations that the
> > JVM
> > > > and
> > > > > > Netty do are infrequent enough to be cleaned up fast enough
> through
> > > > > regular
> > > > > > GC. I am not sure if that is a valid assumption, though.
> > > > > >
> > > > > > Best,
> > > > > > Stephan
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for sharing your opinion Till.
> > > > > > >
> > > > > > > I'm also in favor of alternative 2. I was wondering whether we
> > can
> > > > > avoid
> > > > > > > using Unsafe.allocate() for off-heap managed memory and network
> > > > memory
> > > > > > with
> > > > > > > alternative 3. But after giving it a second thought, I think
> even
> > > for
> > > > > > > alternative 3 using direct memory for off-heap managed memory
> > could
> > > > > cause
> > > > > > > problems.
> > > > > > >
> > > > > > > Hi Yang,
> > > > > > >
> > > > > > > Regarding your concern, I think what proposed in this FLIP it
> to
> > > have
> > > > > > both
> > > > > > > off-heap managed memory and network memory allocated through
> > > > > > > Unsafe.allocate(), which means they are practically native
> memory
> > > and
> > > > > not
> > > > > > > limited by JVM max direct memory. The only parts of memory
> > limited
> > > by
> > > > > JVM
> > > > > > > max direct memory are task off-heap memory and JVM overhead,
> > which
> > > > are
> > > > > > > exactly alternative 2 suggests to set the JVM max direct memory
> > to.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the clarification Xintong. I understand the two
> > > > > alternatives
> > > > > > > > now.
> > > > > > > >
> > > > > > > > I would be in favour of option 2 because it makes things
> > > explicit.
> > > > If
> > > > > > we
> > > > > > > > don't limit the direct memory, I fear that we might end up
> in a
> > > > > similar
> > > > > > > > situation as we are currently in: The user might see that her
> > > > process
> > > > > > > gets
> > > > > > > > killed by the OS and does not know why this is the case.
> > > > > Consequently,
> > > > > > > she
> > > > > > > > tries to decrease the process memory size (similar to
> > increasing
> > > > the
> > > > > > > cutoff
> > > > > > > > ratio) in order to accommodate for the extra direct memory.
> > Even
> > > > > worse,
> > > > > > > she
> > > > > > > > tries to decrease memory budgets which are not fully used and
> > > hence
> > > > > > won't
> > > > > > > > change the overall memory consumption.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Let me explain this with a concrete example Till.
> > > > > > > > >
> > > > > > > > > Let's say we have the following scenario.
> > > > > > > > >
> > > > > > > > > Total Process Memory: 1GB
> > > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead):
> > 200MB
> > > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap
> > Managed
> > > > > Memory
> > > > > > > and
> > > > > > > > > Network Memory): 800MB
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> > > large
> > > > > > > value,
> > > > > > > > > let's say 1TB.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > Overhead
> > > > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> > > should
> > > > > have
> > > > > > > the
> > > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will
> not
> > > > > reduce
> > > > > > > the
> > > > > > > > > sizes of the other memory pools.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > > Overhead potentially exceed 200MB, then
> > > > > > > > >
> > > > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid
> that,
> > > the
> > > > > only
> > > > > > > > thing
> > > > > > > > >    user can do is to modify the configuration and increase
> > JVM
> > > > > Direct
> > > > > > > > > Memory
> > > > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that
> user
> > > > > > increases
> > > > > > > > JVM
> > > > > > > > >    Direct Memory to 250MB, this will reduce the total size
> of
> > > > other
> > > > > > > > memory
> > > > > > > > >    pools to 750MB, given the total process memory remains
> > 1GB.
> > > > > > > > >    - For alternative 3, there is no chance of direct OOM.
> > There
> > > > are
> > > > > > > > chances
> > > > > > > > >    of exceeding the total process memory limit, but given
> > that
> > > > the
> > > > > > > > process
> > > > > > > > > may
> > > > > > > > >    not use up all the reserved native memory (Off-Heap
> > Managed
> > > > > > Memory,
> > > > > > > > > Network
> > > > > > > > >    Memory, JVM Metaspace), if the actual direct memory
> usage
> > is
> > > > > > > slightly
> > > > > > > > > above
> > > > > > > > >    yet very close to 200MB, user probably do not need to
> > change
> > > > the
> > > > > > > > >    configurations.
> > > > > > > > >
> > > > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > > > configuration
> > > > > > > > > for alternative 2 may lead to lower resource utilization
> > > compared
> > > > > to
> > > > > > > > > alternative 3.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I guess you have to help me understand the difference
> > between
> > > > > > > > > alternative 2
> > > > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > > > >
> > > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task
> > Off-Heap
> > > > > Memory
> > > > > > > and
> > > > > > > > > JVM
> > > > > > > > > > Overhead. Then there is the risk that this size is too
> low
> > > > > > resulting
> > > > > > > > in a
> > > > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> > > larger
> > > > > > than
> > > > > > > > > > alternative 2. This would of course reduce the sizes of
> the
> > > > other
> > > > > > > > memory
> > > > > > > > > > types.
> > > > > > > > > >
> > > > > > > > > > How would alternative 2 now result in an under
> utilization
> > of
> > > > > > memory
> > > > > > > > > > compared to alternative 3? If alternative 3 strictly
> sets a
> > > > > higher
> > > > > > > max
> > > > > > > > > > direct memory size and we use only little, then I would
> > > expect
> > > > > that
> > > > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,till
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > My point is setting a very large max direct memory size
> > > when
> > > > we
> > > > > > do
> > > > > > > > not
> > > > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > > > memory,including
> > > > > > > > > > user
> > > > > > > > > > > direct memory and framework direct memory,could be
> > > calculated
> > > > > > > > > > > correctly,then
> > > > > > > > > > > i am in favor of setting direct memory with fixed
> value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check
> > the
> > > > > > memory
> > > > > > > > > > > configurations in client to avoid submitting
> successfully
> > > and
> > > > > > > failing
> > > > > > > > > in
> > > > > > > > > > > the flink master.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > >
> > > > > > > > > > > Yang
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[email protected]>于2019年8月13日
> > 周二22:07写道：
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > > > >
> > > > > > > > > > > > About MemorySegment, I think you are right that we
> > should
> > > > not
> > > > > > > > include
> > > > > > > > > > > this
> > > > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > > > concentrate
> > > > > > on
> > > > > > > > how
> > > > > > > > > to
> > > > > > > > > > > > configure memory pools for TaskExecutors, with
> minimum
> > > > > > > involvement
> > > > > > > > on
> > > > > > > > > > how
> > > > > > > > > > > > memory consumers use it.
> > > > > > > > > > > >
> > > > > > > > > > > > About direct memory, I think alternative 3 may not
> > having
> > > > the
> > > > > > > same
> > > > > > > > > over
> > > > > > > > > > > > reservation issue that alternative 2 does, but at the
> > > cost
> > > > of
> > > > > > > risk
> > > > > > > > of
> > > > > > > > > > > over
> > > > > > > > > > > > using memory at the container level, which is not
> good.
> > > My
> > > > > > point
> > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are
> not
> > > easy
> > > > > to
> > > > > > > > > config.
> > > > > > > > > > > For
> > > > > > > > > > > > alternative 2, users might configure them higher than
> > > what
> > > > > > > actually
> > > > > > > > > > > needed,
> > > > > > > > > > > > just to avoid getting a direct OOM. For alternative
> 3,
> > > > users
> > > > > do
> > > > > > > not
> > > > > > > > > get
> > > > > > > > > > > > direct OOM, so they may not config the two options
> > > > > aggressively
> > > > > > > > high.
> > > > > > > > > > But
> > > > > > > > > > > > the consequences are risks of overall container
> memory
> > > > usage
> > > > > > > > exceeds
> > > > > > > > > > the
> > > > > > > > > > > > budget.
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All in all I think it already looks quite good.
> > > > Concerning
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > open
> > > > > > > > > > > > > question about allocating memory segments, I was
> > > > wondering
> > > > > > > > whether
> > > > > > > > > > this
> > > > > > > > > > > > is
> > > > > > > > > > > > > strictly necessary to do in the context of this
> FLIP
> > or
> > > > > > whether
> > > > > > > > > this
> > > > > > > > > > > > could
> > > > > > > > > > > > > be done as a follow up? Without knowing all
> details,
> > I
> > > > > would
> > > > > > be
> > > > > > > > > > > concerned
> > > > > > > > > > > > > that we would widen the scope of this FLIP too much
> > > > because
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > > have
> > > > > > > > > > > > > to touch all the existing call sites of the
> > > MemoryManager
> > > > > > where
> > > > > > > > we
> > > > > > > > > > > > allocate
> > > > > > > > > > > > > memory segments (this should mainly be batch
> > > operators).
> > > > > The
> > > > > > > > > addition
> > > > > > > > > > > of
> > > > > > > > > > > > > the memory reservation call to the MemoryManager
> > should
> > > > not
> > > > > > be
> > > > > > > > > > affected
> > > > > > > > > > > > by
> > > > > > > > > > > > > this and I would hope that this is the only point
> of
> > > > > > > interaction
> > > > > > > > a
> > > > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Concerning the second open question about setting
> or
> > > not
> > > > > > > setting
> > > > > > > > a
> > > > > > > > > > max
> > > > > > > > > > > > > direct memory limit, I would also be interested why
> > > Yang
> > > > > Wang
> > > > > > > > > thinks
> > > > > > > > > > > > > leaving it open would be best. My concern about
> this
> > > > would
> > > > > be
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > would
> > > > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > > > RocksDBStateBackend.
> > > > > > > > > > > If
> > > > > > > > > > > > > the different memory pools are not clearly
> separated
> > > and
> > > > > can
> > > > > > > > spill
> > > > > > > > > > over
> > > > > > > > > > > > to
> > > > > > > > > > > > > a different pool, then it is quite hard to
> understand
> > > > what
> > > > > > > > exactly
> > > > > > > > > > > > causes a
> > > > > > > > > > > > > process to get killed for using too much memory.
> This
> > > > could
> > > > > > > then
> > > > > > > > > > easily
> > > > > > > > > > > > > lead to a similar situation what we have with the
> > > > > > cutoff-ratio.
> > > > > > > > So
> > > > > > > > > > why
> > > > > > > > > > > > not
> > > > > > > > > > > > > setting a sane default value for max direct memory
> > and
> > > > > giving
> > > > > > > the
> > > > > > > > > > user
> > > > > > > > > > > an
> > > > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > > > >
> > > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower
> > memory
> > > > > > > > utilization
> > > > > > > > > > than
> > > > > > > > > > > > > alternative 3 where we set the direct memory to a
> > > higher
> > > > > > value?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Till
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > > > I think setting a very large max direct memory
> size
> > > > > > > definitely
> > > > > > > > > has
> > > > > > > > > > > some
> > > > > > > > > > > > > > good sides. E.g., we do not worry about direct
> OOM,
> > > and
> > > > > we
> > > > > > > > don't
> > > > > > > > > > even
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > to allocate managed / network memory with
> > > > > > Unsafe.allocate() .
> > > > > > > > > > > > > > However, there are also some down sides of doing
> > > this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - One thing I can think of is that if a task
> > > > executor
> > > > > > > > > container
> > > > > > > > > > is
> > > > > > > > > > > > > >    killed due to overusing memory, it could be
> hard
> > > for
> > > > > use
> > > > > > > to
> > > > > > > > > know
> > > > > > > > > > > > which
> > > > > > > > > > > > > > part
> > > > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > > > >    - Another down side is that the JVM never
> > trigger
> > > GC
> > > > > due
> > > > > > > to
> > > > > > > > > > > reaching
> > > > > > > > > > > > > max
> > > > > > > > > > > > > >    direct memory limit, because the limit is too
> > high
> > > > to
> > > > > be
> > > > > > > > > > reached.
> > > > > > > > > > > > That
> > > > > > > > > > > > > >    means we kind of relay on heap memory to
> trigger
> > > GC
> > > > > and
> > > > > > > > > release
> > > > > > > > > > > > direct
> > > > > > > > > > > > > >    memory. That could be a problem in cases where
> > we
> > > > have
> > > > > > > more
> > > > > > > > > > direct
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > >    usage but not enough heap activity to trigger
> > the
> > > > GC.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Maybe you can share your reasons for preferring
> > > > setting a
> > > > > > > very
> > > > > > > > > > large
> > > > > > > > > > > > > value,
> > > > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > > > If there is any conflict between multiple
> > > configuration
> > > > > > that
> > > > > > > > user
> > > > > > > > > > > > > > explicitly specified, I think we should throw an
> > > error.
> > > > > > > > > > > > > > I think doing checking on the client side is a
> good
> > > > idea,
> > > > > > so
> > > > > > > > that
> > > > > > > > > > on
> > > > > > > > > > > > > Yarn /
> > > > > > > > > > > > > > K8s we can discover the problem before submitting
> > the
> > > > > Flink
> > > > > > > > > > cluster,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > > > But we can not only rely on the client side
> > checking,
> > > > > > because
> > > > > > > > for
> > > > > > > > > > > > > > standalone cluster TaskManagers on different
> > machines
> > > > may
> > > > > > > have
> > > > > > > > > > > > different
> > > > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for your detailed proposal. After all
> the
> > > > memory
> > > > > > > > > > > configuration
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > introduced, it will be more powerful to control
> > the
> > > > > flink
> > > > > > > > > memory
> > > > > > > > > > > > > usage. I
> > > > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We do not differentiate user direct memory and
> > > native
> > > > > > > memory.
> > > > > > > > > > They
> > > > > > > > > > > > are
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > included in task off-heap memory. Right? So i
> > don’t
> > > > > think
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > not
> > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > > > leaving
> > > > > > it a
> > > > > > > > > very
> > > > > > > > > > > > large
> > > > > > > > > > > > > > > value.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If the sum of and fine-grained memory(network
> > > memory,
> > > > > > > managed
> > > > > > > > > > > memory,
> > > > > > > > > > > > > > etc.)
> > > > > > > > > > > > > > > is larger than total process memory, how do we
> > deal
> > > > > with
> > > > > > > this
> > > > > > > > > > > > > situation?
> > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > we need to check the memory configuration in
> > > client?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song <[email protected]>
> > 于2019年8月7日周三
> > > > > > > 下午10:14写道：
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > > "FLIP-49:
> > > > > > > > > Unified
> > > > > > > > > > > > > Memory
> > > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > > > describe
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > > improve
> > > > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > > > document
> > > > > > is
> > > > > > > > > mostly
> > > > > > > > > > > > based
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > early design "Memory Management and
> > Configuration
> > > > > > > > > Reloaded"[2]
> > > > > > > > > > by
> > > > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > > > with updates from follow-up discussions both
> > > online
> > > > > and
> > > > > > > > > > offline.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> > > current
> > > > > > > (Flink
> > > > > > > > > 1.9)
> > > > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Different configuration for Streaming
> and
> > > > Batch.
> > > > > > > > > > > > > > > >    - Complex and difficult configuration of
> > > RocksDB
> > > > > in
> > > > > > > > > > Streaming.
> > > > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> > > understand.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Key changes to solve the problems can be
> > > summarized
> > > > > as
> > > > > > > > > follows.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Extend memory manager to also account
> for
> > > > memory
> > > > > > > usage
> > > > > > > > > by
> > > > > > > > > > > > state
> > > > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> > > partitioned
> > > > > > > > accounted
> > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > > > >    - Simplify memory configuration options
> and
> > > > > > > calculations
> > > > > > > > > > > logics.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > > document
> > > > > [1].
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > (Please note that the early design doc [2] is
> > out
> > > > of
> > > > > > > sync,
> > > > > > > > > and
> > > > > > > > > > it
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > appreciated to have the discussion in this
> > > mailing
> > > > > list
> > > > > > > > > > thread.)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for sharing your opinion Till.
> > > > > > >
> > > > > > > I'm also in favor of alternative 2. I was wondering whether we
> > can
> > > > > avoid
> > > > > > > using Unsafe.allocate() for off-heap managed memory and network
> > > > memory
> > > > > > with
> > > > > > > alternative 3. But after giving it a second thought, I think
> even
> > > for
> > > > > > > alternative 3 using direct memory for off-heap managed memory
> > could
> > > > > cause
> > > > > > > problems.
> > > > > > >
> > > > > > > Hi Yang,
> > > > > > >
> > > > > > > Regarding your concern, I think what proposed in this FLIP it
> to
> > > have
> > > > > > both
> > > > > > > off-heap managed memory and network memory allocated through
> > > > > > > Unsafe.allocate(), which means they are practically native
> memory
> > > and
> > > > > not
> > > > > > > limited by JVM max direct memory. The only parts of memory
> > limited
> > > by
> > > > > JVM
> > > > > > > max direct memory are task off-heap memory and JVM overhead,
> > which
> > > > are
> > > > > > > exactly alternative 2 suggests to set the JVM max direct memory
> > to.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the clarification Xintong. I understand the two
> > > > > alternatives
> > > > > > > > now.
> > > > > > > >
> > > > > > > > I would be in favour of option 2 because it makes things
> > > explicit.
> > > > If
> > > > > > we
> > > > > > > > don't limit the direct memory, I fear that we might end up
> in a
> > > > > similar
> > > > > > > > situation as we are currently in: The user might see that her
> > > > process
> > > > > > > gets
> > > > > > > > killed by the OS and does not know why this is the case.
> > > > > Consequently,
> > > > > > > she
> > > > > > > > tries to decrease the process memory size (similar to
> > increasing
> > > > the
> > > > > > > cutoff
> > > > > > > > ratio) in order to accommodate for the extra direct memory.
> > Even
> > > > > worse,
> > > > > > > she
> > > > > > > > tries to decrease memory budgets which are not fully used and
> > > hence
> > > > > > won't
> > > > > > > > change the overall memory consumption.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Let me explain this with a concrete example Till.
> > > > > > > > >
> > > > > > > > > Let's say we have the following scenario.
> > > > > > > > >
> > > > > > > > > Total Process Memory: 1GB
> > > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead):
> > 200MB
> > > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap
> > Managed
> > > > > Memory
> > > > > > > and
> > > > > > > > > Network Memory): 800MB
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> > > large
> > > > > > > value,
> > > > > > > > > let's say 1TB.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > Overhead
> > > > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> > > should
> > > > > have
> > > > > > > the
> > > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will
> not
> > > > > reduce
> > > > > > > the
> > > > > > > > > sizes of the other memory pools.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > > Overhead potentially exceed 200MB, then
> > > > > > > > >
> > > > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid
> that,
> > > the
> > > > > only
> > > > > > > > thing
> > > > > > > > >    user can do is to modify the configuration and increase
> > JVM
> > > > > Direct
> > > > > > > > > Memory
> > > > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that
> user
> > > > > > increases
> > > > > > > > JVM
> > > > > > > > >    Direct Memory to 250MB, this will reduce the total size
> of
> > > > other
> > > > > > > > memory
> > > > > > > > >    pools to 750MB, given the total process memory remains
> > 1GB.
> > > > > > > > >    - For alternative 3, there is no chance of direct OOM.
> > There
> > > > are
> > > > > > > > chances
> > > > > > > > >    of exceeding the total process memory limit, but given
> > that
> > > > the
> > > > > > > > process
> > > > > > > > > may
> > > > > > > > >    not use up all the reserved native memory (Off-Heap
> > Managed
> > > > > > Memory,
> > > > > > > > > Network
> > > > > > > > >    Memory, JVM Metaspace), if the actual direct memory
> usage
> > is
> > > > > > > slightly
> > > > > > > > > above
> > > > > > > > >    yet very close to 200MB, user probably do not need to
> > change
> > > > the
> > > > > > > > >    configurations.
> > > > > > > > >
> > > > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > > > configuration
> > > > > > > > > for alternative 2 may lead to lower resource utilization
> > > compared
> > > > > to
> > > > > > > > > alternative 3.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I guess you have to help me understand the difference
> > between
> > > > > > > > > alternative 2
> > > > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > > > >
> > > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task
> > Off-Heap
> > > > > Memory
> > > > > > > and
> > > > > > > > > JVM
> > > > > > > > > > Overhead. Then there is the risk that this size is too
> low
> > > > > > resulting
> > > > > > > > in a
> > > > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> > > larger
> > > > > > than
> > > > > > > > > > alternative 2. This would of course reduce the sizes of
> the
> > > > other
> > > > > > > > memory
> > > > > > > > > > types.
> > > > > > > > > >
> > > > > > > > > > How would alternative 2 now result in an under
> utilization
> > of
> > > > > > memory
> > > > > > > > > > compared to alternative 3? If alternative 3 strictly
> sets a
> > > > > higher
> > > > > > > max
> > > > > > > > > > direct memory size and we use only little, then I would
> > > expect
> > > > > that
> > > > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,till
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > My point is setting a very large max direct memory size
> > > when
> > > > we
> > > > > > do
> > > > > > > > not
> > > > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > > > memory,including
> > > > > > > > > > user
> > > > > > > > > > > direct memory and framework direct memory,could be
> > > calculated
> > > > > > > > > > > correctly,then
> > > > > > > > > > > i am in favor of setting direct memory with fixed
> value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check
> > the
> > > > > > memory
> > > > > > > > > > > configurations in client to avoid submitting
> successfully
> > > and
> > > > > > > failing
> > > > > > > > > in
> > > > > > > > > > > the flink master.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > >
> > > > > > > > > > > Yang
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[email protected]>于2019年8月13日
> > 周二22:07写道：
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > > > >
> > > > > > > > > > > > About MemorySegment, I think you are right that we
> > should
> > > > not
> > > > > > > > include
> > > > > > > > > > > this
> > > > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > > > concentrate
> > > > > > on
> > > > > > > > how
> > > > > > > > > to
> > > > > > > > > > > > configure memory pools for TaskExecutors, with
> minimum
> > > > > > > involvement
> > > > > > > > on
> > > > > > > > > > how
> > > > > > > > > > > > memory consumers use it.
> > > > > > > > > > > >
> > > > > > > > > > > > About direct memory, I think alternative 3 may not
> > having
> > > > the
> > > > > > > same
> > > > > > > > > over
> > > > > > > > > > > > reservation issue that alternative 2 does, but at the
> > > cost
> > > > of
> > > > > > > risk
> > > > > > > > of
> > > > > > > > > > > over
> > > > > > > > > > > > using memory at the container level, which is not
> good.
> > > My
> > > > > > point
> > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are
> not
> > > easy
> > > > > to
> > > > > > > > > config.
> > > > > > > > > > > For
> > > > > > > > > > > > alternative 2, users might configure them higher than
> > > what
> > > > > > > actually
> > > > > > > > > > > needed,
> > > > > > > > > > > > just to avoid getting a direct OOM. For alternative
> 3,
> > > > users
> > > > > do
> > > > > > > not
> > > > > > > > > get
> > > > > > > > > > > > direct OOM, so they may not config the two options
> > > > > aggressively
> > > > > > > > high.
> > > > > > > > > > But
> > > > > > > > > > > > the consequences are risks of overall container
> memory
> > > > usage
> > > > > > > > exceeds
> > > > > > > > > > the
> > > > > > > > > > > > budget.
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All in all I think it already looks quite good.
> > > > Concerning
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > open
> > > > > > > > > > > > > question about allocating memory segments, I was
> > > > wondering
> > > > > > > > whether
> > > > > > > > > > this
> > > > > > > > > > > > is
> > > > > > > > > > > > > strictly necessary to do in the context of this
> FLIP
> > or
> > > > > > whether
> > > > > > > > > this
> > > > > > > > > > > > could
> > > > > > > > > > > > > be done as a follow up? Without knowing all
> details,
> > I
> > > > > would
> > > > > > be
> > > > > > > > > > > concerned
> > > > > > > > > > > > > that we would widen the scope of this FLIP too much
> > > > because
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > > have
> > > > > > > > > > > > > to touch all the existing call sites of the
> > > MemoryManager
> > > > > > where
> > > > > > > > we
> > > > > > > > > > > > allocate
> > > > > > > > > > > > > memory segments (this should mainly be batch
> > > operators).
> > > > > The
> > > > > > > > > addition
> > > > > > > > > > > of
> > > > > > > > > > > > > the memory reservation call to the MemoryManager
> > should
> > > > not
> > > > > > be
> > > > > > > > > > affected
> > > > > > > > > > > > by
> > > > > > > > > > > > > this and I would hope that this is the only point
> of
> > > > > > > interaction
> > > > > > > > a
> > > > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Concerning the second open question about setting
> or
> > > not
> > > > > > > setting
> > > > > > > > a
> > > > > > > > > > max
> > > > > > > > > > > > > direct memory limit, I would also be interested why
> > > Yang
> > > > > Wang
> > > > > > > > > thinks
> > > > > > > > > > > > > leaving it open would be best. My concern about
> this
> > > > would
> > > > > be
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > would
> > > > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > > > RocksDBStateBackend.
> > > > > > > > > > > If
> > > > > > > > > > > > > the different memory pools are not clearly
> separated
> > > and
> > > > > can
> > > > > > > > spill
> > > > > > > > > > over
> > > > > > > > > > > > to
> > > > > > > > > > > > > a different pool, then it is quite hard to
> understand
> > > > what
> > > > > > > > exactly
> > > > > > > > > > > > causes a
> > > > > > > > > > > > > process to get killed for using too much memory.
> This
> > > > could
> > > > > > > then
> > > > > > > > > > easily
> > > > > > > > > > > > > lead to a similar situation what we have with the
> > > > > > cutoff-ratio.
> > > > > > > > So
> > > > > > > > > > why
> > > > > > > > > > > > not
> > > > > > > > > > > > > setting a sane default value for max direct memory
> > and
> > > > > giving
> > > > > > > the
> > > > > > > > > > user
> > > > > > > > > > > an
> > > > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > > > >
> > > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower
> > memory
> > > > > > > > utilization
> > > > > > > > > > than
> > > > > > > > > > > > > alternative 3 where we set the direct memory to a
> > > higher
> > > > > > value?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Till
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > > > I think setting a very large max direct memory
> size
> > > > > > > definitely
> > > > > > > > > has
> > > > > > > > > > > some
> > > > > > > > > > > > > > good sides. E.g., we do not worry about direct
> OOM,
> > > and
> > > > > we
> > > > > > > > don't
> > > > > > > > > > even
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > to allocate managed / network memory with
> > > > > > Unsafe.allocate() .
> > > > > > > > > > > > > > However, there are also some down sides of doing
> > > this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - One thing I can think of is that if a task
> > > > executor
> > > > > > > > > container
> > > > > > > > > > is
> > > > > > > > > > > > > >    killed due to overusing memory, it could be
> hard
> > > for
> > > > > use
> > > > > > > to
> > > > > > > > > know
> > > > > > > > > > > > which
> > > > > > > > > > > > > > part
> > > > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > > > >    - Another down side is that the JVM never
> > trigger
> > > GC
> > > > > due
> > > > > > > to
> > > > > > > > > > > reaching
> > > > > > > > > > > > > max
> > > > > > > > > > > > > >    direct memory limit, because the limit is too
> > high
> > > > to
> > > > > be
> > > > > > > > > > reached.
> > > > > > > > > > > > That
> > > > > > > > > > > > > >    means we kind of relay on heap memory to
> trigger
> > > GC
> > > > > and
> > > > > > > > > release
> > > > > > > > > > > > direct
> > > > > > > > > > > > > >    memory. That could be a problem in cases where
> > we
> > > > have
> > > > > > > more
> > > > > > > > > > direct
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > >    usage but not enough heap activity to trigger
> > the
> > > > GC.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Maybe you can share your reasons for preferring
> > > > setting a
> > > > > > > very
> > > > > > > > > > large
> > > > > > > > > > > > > value,
> > > > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > > > If there is any conflict between multiple
> > > configuration
> > > > > > that
> > > > > > > > user
> > > > > > > > > > > > > > explicitly specified, I think we should throw an
> > > error.
> > > > > > > > > > > > > > I think doing checking on the client side is a
> good
> > > > idea,
> > > > > > so
> > > > > > > > that
> > > > > > > > > > on
> > > > > > > > > > > > > Yarn /
> > > > > > > > > > > > > > K8s we can discover the problem before submitting
> > the
> > > > > Flink
> > > > > > > > > > cluster,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > > > But we can not only rely on the client side
> > checking,
> > > > > > because
> > > > > > > > for
> > > > > > > > > > > > > > standalone cluster TaskManagers on different
> > machines
> > > > may
> > > > > > > have
> > > > > > > > > > > > different
> > > > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for your detailed proposal. After all
> the
> > > > memory
> > > > > > > > > > > configuration
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > introduced, it will be more powerful to control
> > the
> > > > > flink
> > > > > > > > > memory
> > > > > > > > > > > > > usage. I
> > > > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We do not differentiate user direct memory and
> > > native
> > > > > > > memory.
> > > > > > > > > > They
> > > > > > > > > > > > are
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > included in task off-heap memory. Right? So i
> > don’t
> > > > > think
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > not
> > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > > > leaving
> > > > > > it a
> > > > > > > > > very
> > > > > > > > > > > > large
> > > > > > > > > > > > > > > value.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If the sum of and fine-grained memory(network
> > > memory,
> > > > > > > managed
> > > > > > > > > > > memory,
> > > > > > > > > > > > > > etc.)
> > > > > > > > > > > > > > > is larger than total process memory, how do we
> > deal
> > > > > with
> > > > > > > this
> > > > > > > > > > > > > situation?
> > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > we need to check the memory configuration in
> > > client?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song <[email protected]>
> > 于2019年8月7日周三
> > > > > > > 下午10:14写道：
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > > "FLIP-49:
> > > > > > > > > Unified
> > > > > > > > > > > > > Memory
> > > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > > > describe
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > > improve
> > > > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > > > document
> > > > > > is
> > > > > > > > > mostly
> > > > > > > > > > > > based
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > early design "Memory Management and
> > Configuration
> > > > > > > > > Reloaded"[2]
> > > > > > > > > > by
> > > > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > > > with updates from follow-up discussions both
> > > online
> > > > > and
> > > > > > > > > > offline.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> > > current
> > > > > > > (Flink
> > > > > > > > > 1.9)
> > > > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Different configuration for Streaming
> and
> > > > Batch.
> > > > > > > > > > > > > > > >    - Complex and difficult configuration of
> > > RocksDB
> > > > > in
> > > > > > > > > > Streaming.
> > > > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> > > understand.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Key changes to solve the problems can be
> > > summarized
> > > > > as
> > > > > > > > > follows.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Extend memory manager to also account
> for
> > > > memory
> > > > > > > usage
> > > > > > > > > by
> > > > > > > > > > > > state
> > > > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> > > partitioned
> > > > > > > > accounted
> > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > > > >    - Simplify memory configuration options
> and
> > > > > > > calculations
> > > > > > > > > > > logics.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > > document
> > > > > [1].
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > (Please note that the early design doc [2] is
> > out
> > > > of
> > > > > > > sync,
> > > > > > > > > and
> > > > > > > > > > it
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > appreciated to have the discussion in this
> > > mailing
> > > > > list
> > > > > > > > > > thread.)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Reply via email to