One note on the Environment Variables and Configuration discussion. My understanding is that passed ENV variables are added to the configuration in the "GlobalConfiguration.loadConfig()" method (or similar). For all the code inside Flink, it looks like the data was in the config to start with, just that the scripts that compute the variables can pass the values to the process without actually needing to write a file.
For example the "GlobalConfiguration.loadConfig()" method would take any ENV variable prefixed with "flink" and add it as a config key. "flink_taskmanager_memory_size=2g" would become "taskmanager.memory.size: 2g". On Tue, Aug 27, 2019 at 4:05 PM Xintong Song <tonysong...@gmail.com> wrote: > Thanks for the comments, Till. > > I've also seen your comments on the wiki page, but let's keep the > discussion here. > > - Regarding 'TaskExecutorSpecifics', how do you think about naming it > 'TaskExecutorResourceSpecifics'. > - Regarding passing memory configurations into task executors, I'm in favor > of do it via environment variables rather than configurations, with the > following two reasons. > - It is easier to keep the memory options once calculate not to be > changed with environment variables rather than configurations. > - I'm not sure whether we should write the configuration in startup > scripts. Writing changes into the configuration files when running the > startup scripts does not sounds right to me. Or we could make a copy of > configuration files per flink cluster, and make the task executor to load > from the copy, and clean up the copy after the cluster is shutdown, which > is complicated. (I think this is also what Stephan means in his comment on > the wiki page?) > - Regarding reserving memory, I think this change should be included in > this FLIP. I think a big part of motivations of this FLIP is to unify > memory configuration for streaming / batch and make it easy for configuring > rocksdb memory. If we don't support memory reservation, then streaming jobs > cannot use managed memory (neither on-heap or off-heap), which makes this > FLIP incomplete. > - Regarding network memory, I think you are right. I think we probably > don't need to change network stack from using direct memory to using unsafe > native memory. Network memory size is deterministic, cannot be reserved as > managed memory does, and cannot be overused. I think it also works if we > simply keep using direct memory for network and include it in jvm max > direct memory size. > > Thank you~ > > Xintong Song > > > > On Tue, Aug 27, 2019 at 8:12 PM Till Rohrmann <trohrm...@apache.org> > wrote: > > > Hi Xintong, > > > > thanks for addressing the comments and adding a more detailed > > implementation plan. I have a couple of comments concerning the > > implementation plan: > > > > - The name `TaskExecutorSpecifics` is not really descriptive. Choosing a > > different name could help here. > > - I'm not sure whether I would pass the memory configuration to the > > TaskExecutor via environment variables. I think it would be better to > write > > it into the configuration one uses to start the TM process. > > - If possible, I would exclude the memory reservation from this FLIP and > > add this as part of a dedicated FLIP. > > - If possible, then I would exclude changes to the network stack from > this > > FLIP. Maybe we can simply say that the direct memory needed by the > network > > stack is the framework direct memory requirement. Changing how the memory > > is allocated can happen in a second step. This would keep the scope of > this > > FLIP smaller. > > > > Cheers, > > Till > > > > On Thu, Aug 22, 2019 at 2:51 PM Xintong Song <tonysong...@gmail.com> > > wrote: > > > > > Hi everyone, > > > > > > I just updated the FLIP document on wiki [1], with the following > changes. > > > > > > - Removed open question regarding MemorySegment allocation. As > > > discussed, we exclude this topic from the scope of this FLIP. > > > - Updated content about JVM direct memory parameter according to > > recent > > > discussions, and moved the other options to "Rejected Alternatives" > > for > > > the > > > moment. > > > - Added implementation steps. > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > On Mon, Aug 19, 2019 at 7:16 PM Stephan Ewen <se...@apache.org> wrote: > > > > > > > @Xintong: Concerning "wait for memory users before task dispose and > > > memory > > > > release": I agree, that's how it should be. Let's try it out. > > > > > > > > @Xintong @Jingsong: Concerning " JVM does not wait for GC when > > allocating > > > > direct memory buffer": There seems to be pretty elaborate logic to > free > > > > buffers when allocating new ones. See > > > > > > > > > > > > > > http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643 > > > > > > > > @Till: Maybe. If we assume that the JVM default works (like going > with > > > > option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I > > think > > > it > > > > should be okay to set "-XX:MaxDirectMemorySize" to > > > > "off_heap_managed_memory + direct_memory" even if we use RocksDB. > That > > > is a > > > > big if, though, I honestly have no idea :D Would be good to > understand > > > > this, though, because this would affect option (2) and option (1.2). > > > > > > > > On Mon, Aug 19, 2019 at 4:44 PM Xintong Song <tonysong...@gmail.com> > > > > wrote: > > > > > > > > > Thanks for the inputs, Jingsong. > > > > > > > > > > Let me try to summarize your points. Please correct me if I'm > wrong. > > > > > > > > > > - Memory consumers should always avoid returning memory segments > > to > > > > > memory manager while there are still un-cleaned structures / > > threads > > > > > that > > > > > may use the memory. Otherwise, it would cause serious problems > by > > > > having > > > > > multiple consumers trying to use the same memory segment. > > > > > - JVM does not wait for GC when allocating direct memory buffer. > > > > > Therefore even we set proper max direct memory size limit, we > may > > > > still > > > > > encounter direct memory oom if the GC cleaning memory slower > than > > > the > > > > > direct memory allocation. > > > > > > > > > > Am I understanding this correctly? > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2019 at 4:21 PM JingsongLee < > lzljs3620...@aliyun.com > > > > > .invalid> > > > > > wrote: > > > > > > > > > > > Hi stephan: > > > > > > > > > > > > About option 2: > > > > > > > > > > > > if additional threads not cleanly shut down before we can exit > the > > > > task: > > > > > > In the current case of memory reuse, it has freed up the memory > it > > > > > > uses. If this memory is used by other tasks and asynchronous > > threads > > > > > > of exited task may still be writing, there will be concurrent > > > security > > > > > > problems, and even lead to errors in user computing results. > > > > > > > > > > > > So I think this is a serious and intolerable bug, No matter what > > the > > > > > > option is, it should be avoided. > > > > > > > > > > > > About direct memory cleaned by GC: > > > > > > I don't think it is a good idea, I've encountered so many > > situations > > > > > > that it's too late for GC to cause DirectMemory OOM. Release and > > > > > > allocate DirectMemory depend on the type of user job, which is > > > > > > often beyond our control. > > > > > > > > > > > > Best, > > > > > > Jingsong Lee > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------ > > > > > > From:Stephan Ewen <se...@apache.org> > > > > > > Send Time:2019年8月19日(星期一) 15:56 > > > > > > To:dev <dev@flink.apache.org> > > > > > > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for > > > > > > TaskExecutors > > > > > > > > > > > > My main concern with option 2 (manually release memory) is that > > > > segfaults > > > > > > in the JVM send off all sorts of alarms on user ends. So we need > to > > > > > > guarantee that this never happens. > > > > > > > > > > > > The trickyness is in tasks that uses data structures / algorithms > > > with > > > > > > additional threads, like hash table spill/read and sorting > threads. > > > We > > > > > need > > > > > > to ensure that these cleanly shut down before we can exit the > task. > > > > > > I am not sure that we have that guaranteed already, that's why > > option > > > > 1.1 > > > > > > seemed simpler to me. > > > > > > > > > > > > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song < > > tonysong...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Thanks for the comments, Stephan. Summarized in this way really > > > makes > > > > > > > things easier to understand. > > > > > > > > > > > > > > I'm in favor of option 2, at least for the moment. I think it > is > > > not > > > > > that > > > > > > > difficult to keep it segfault safe for memory manager, as long > as > > > we > > > > > > always > > > > > > > de-allocate the memory segment when it is released from the > > memory > > > > > > > consumers. Only if the memory consumer continue using the > buffer > > of > > > > > > memory > > > > > > > segment after releasing it, in which case we do want the job to > > > fail > > > > so > > > > > > we > > > > > > > detect the memory leak early. > > > > > > > > > > > > > > For option 1.2, I don't think this is a good idea. Not only > > because > > > > the > > > > > > > assumption (regular GC is enough to clean direct buffers) may > not > > > > > always > > > > > > be > > > > > > > true, but also it makes harder for finding problems in cases of > > > > memory > > > > > > > overuse. E.g., user configured some direct memory for the user > > > > > libraries. > > > > > > > If the library actually use more direct memory then configured, > > > which > > > > > > > cannot be cleaned by GC because they are still in use, may lead > > to > > > > > > overuse > > > > > > > of the total container memory. In that case, if it didn't touch > > the > > > > JVM > > > > > > > default max direct memory limit, we cannot get a direct memory > > OOM > > > > and > > > > > it > > > > > > > will become super hard to understand which part of the > > > configuration > > > > > need > > > > > > > to be updated. > > > > > > > > > > > > > > For option 1.1, it has the similar problem as 1.2, if the > > exceeded > > > > > direct > > > > > > > memory does not reach the max direct memory limit specified by > > the > > > > > > > dedicated parameter. I think it is slightly better than 1.2, > only > > > > > because > > > > > > > we can tune the parameter. > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <se...@apache.org > > > > > > wrote: > > > > > > > > > > > > > > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me > > > > > summarize > > > > > > > it a > > > > > > > > bit differently: > > > > > > > > > > > > > > > > We have the following two options: > > > > > > > > > > > > > > > > (1) We let MemorySegments be de-allocated by the GC. That > makes > > > it > > > > > > > segfault > > > > > > > > safe. But then we need a way to trigger GC in case > > de-allocation > > > > and > > > > > > > > re-allocation of a bunch of segments happens quickly, which > is > > > > often > > > > > > the > > > > > > > > case during batch scheduling or task restart. > > > > > > > > - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to > do > > > > this > > > > > > > > - Another way could be to have a dedicated bookkeeping in > the > > > > > > > > MemoryManager (option 1.2), so that this is a number > > independent > > > of > > > > > the > > > > > > > > "-XX:MaxDirectMemorySize" parameter. > > > > > > > > > > > > > > > > (2) We manually allocate and de-allocate the memory for the > > > > > > > MemorySegments > > > > > > > > (option 2). That way we need not worry about triggering GC by > > > some > > > > > > > > threshold or bookkeeping, but it is harder to prevent > > segfaults. > > > We > > > > > > need > > > > > > > to > > > > > > > > be very careful about when we release the memory segments > (only > > > in > > > > > the > > > > > > > > cleanup phase of the main thread). > > > > > > > > > > > > > > > > If we go with option 1.1, we probably need to set > > > > > > > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + > > > > > direct_memory" > > > > > > > and > > > > > > > > have "direct_memory" as a separate reserved memory pool. > > Because > > > if > > > > > we > > > > > > > just > > > > > > > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + > > > > > > > jvm_overhead", > > > > > > > > then there will be times when that entire memory is allocated > > by > > > > > direct > > > > > > > > buffers and we have nothing left for the JVM overhead. So we > > > either > > > > > > need > > > > > > > a > > > > > > > > way to compensate for that (again some safety margin cutoff > > > value) > > > > or > > > > > > we > > > > > > > > will exceed container memory. > > > > > > > > > > > > > > > > If we go with option 1.2, we need to be aware that it takes > > > > elaborate > > > > > > > logic > > > > > > > > to push recycling of direct buffers without always > triggering a > > > > full > > > > > > GC. > > > > > > > > > > > > > > > > > > > > > > > > My first guess is that the options will be easiest to do in > the > > > > > > following > > > > > > > > order: > > > > > > > > > > > > > > > > - Option 1.1 with a dedicated direct_memory parameter, as > > > > discussed > > > > > > > > above. We would need to find a way to set the direct_memory > > > > parameter > > > > > > by > > > > > > > > default. We could start with 64 MB and see how it goes in > > > practice. > > > > > One > > > > > > > > danger I see is that setting this loo low can cause a bunch > of > > > > > > additional > > > > > > > > GCs compared to before (we need to watch this carefully). > > > > > > > > > > > > > > > > - Option 2. It is actually quite simple to implement, we > > could > > > > try > > > > > > how > > > > > > > > segfault safe we are at the moment. > > > > > > > > > > > > > > > > - Option 1.2: We would not touch the > > "-XX:MaxDirectMemorySize" > > > > > > > parameter > > > > > > > > at all and assume that all the direct memory allocations that > > the > > > > JVM > > > > > > and > > > > > > > > Netty do are infrequent enough to be cleaned up fast enough > > > through > > > > > > > regular > > > > > > > > GC. I am not sure if that is a valid assumption, though. > > > > > > > > > > > > > > > > Best, > > > > > > > > Stephan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song < > > > > tonysong...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for sharing your opinion Till. > > > > > > > > > > > > > > > > > > I'm also in favor of alternative 2. I was wondering whether > > we > > > > can > > > > > > > avoid > > > > > > > > > using Unsafe.allocate() for off-heap managed memory and > > network > > > > > > memory > > > > > > > > with > > > > > > > > > alternative 3. But after giving it a second thought, I > think > > > even > > > > > for > > > > > > > > > alternative 3 using direct memory for off-heap managed > memory > > > > could > > > > > > > cause > > > > > > > > > problems. > > > > > > > > > > > > > > > > > > Hi Yang, > > > > > > > > > > > > > > > > > > Regarding your concern, I think what proposed in this FLIP > it > > > to > > > > > have > > > > > > > > both > > > > > > > > > off-heap managed memory and network memory allocated > through > > > > > > > > > Unsafe.allocate(), which means they are practically native > > > memory > > > > > and > > > > > > > not > > > > > > > > > limited by JVM max direct memory. The only parts of memory > > > > limited > > > > > by > > > > > > > JVM > > > > > > > > > max direct memory are task off-heap memory and JVM > overhead, > > > > which > > > > > > are > > > > > > > > > exactly alternative 2 suggests to set the JVM max direct > > memory > > > > to. > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann < > > > > > trohrm...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks for the clarification Xintong. I understand the > two > > > > > > > alternatives > > > > > > > > > > now. > > > > > > > > > > > > > > > > > > > > I would be in favour of option 2 because it makes things > > > > > explicit. > > > > > > If > > > > > > > > we > > > > > > > > > > don't limit the direct memory, I fear that we might end > up > > > in a > > > > > > > similar > > > > > > > > > > situation as we are currently in: The user might see that > > her > > > > > > process > > > > > > > > > gets > > > > > > > > > > killed by the OS and does not know why this is the case. > > > > > > > Consequently, > > > > > > > > > she > > > > > > > > > > tries to decrease the process memory size (similar to > > > > increasing > > > > > > the > > > > > > > > > cutoff > > > > > > > > > > ratio) in order to accommodate for the extra direct > memory. > > > > Even > > > > > > > worse, > > > > > > > > > she > > > > > > > > > > tries to decrease memory budgets which are not fully used > > and > > > > > hence > > > > > > > > won't > > > > > > > > > > change the overall memory consumption. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song < > > > > > > tonysong...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Let me explain this with a concrete example Till. > > > > > > > > > > > > > > > > > > > > > > Let's say we have the following scenario. > > > > > > > > > > > > > > > > > > > > > > Total Process Memory: 1GB > > > > > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM > Overhead): > > > > 200MB > > > > > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap > > > > Managed > > > > > > > Memory > > > > > > > > > and > > > > > > > > > > > Network Memory): 800MB > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to > > 200MB. > > > > > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a > > very > > > > > large > > > > > > > > > value, > > > > > > > > > > > let's say 1TB. > > > > > > > > > > > > > > > > > > > > > > If the actual direct memory usage of Task Off-Heap > Memory > > > and > > > > > JVM > > > > > > > > > > Overhead > > > > > > > > > > > do not exceed 200MB, then alternative 2 and > alternative 3 > > > > > should > > > > > > > have > > > > > > > > > the > > > > > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize > will > > > not > > > > > > > reduce > > > > > > > > > the > > > > > > > > > > > sizes of the other memory pools. > > > > > > > > > > > > > > > > > > > > > > If the actual direct memory usage of Task Off-Heap > Memory > > > and > > > > > JVM > > > > > > > > > > > Overhead potentially exceed 200MB, then > > > > > > > > > > > > > > > > > > > > > > - Alternative 2 suffers from frequent OOM. To avoid > > > that, > > > > > the > > > > > > > only > > > > > > > > > > thing > > > > > > > > > > > user can do is to modify the configuration and > > increase > > > > JVM > > > > > > > Direct > > > > > > > > > > > Memory > > > > > > > > > > > (Task Off-Heap Memory + JVM Overhead). Let's say > that > > > user > > > > > > > > increases > > > > > > > > > > JVM > > > > > > > > > > > Direct Memory to 250MB, this will reduce the total > > size > > > of > > > > > > other > > > > > > > > > > memory > > > > > > > > > > > pools to 750MB, given the total process memory > remains > > > > 1GB. > > > > > > > > > > > - For alternative 3, there is no chance of direct > OOM. > > > > There > > > > > > are > > > > > > > > > > chances > > > > > > > > > > > of exceeding the total process memory limit, but > given > > > > that > > > > > > the > > > > > > > > > > process > > > > > > > > > > > may > > > > > > > > > > > not use up all the reserved native memory (Off-Heap > > > > Managed > > > > > > > > Memory, > > > > > > > > > > > Network > > > > > > > > > > > Memory, JVM Metaspace), if the actual direct memory > > > usage > > > > is > > > > > > > > > slightly > > > > > > > > > > > above > > > > > > > > > > > yet very close to 200MB, user probably do not need > to > > > > change > > > > > > the > > > > > > > > > > > configurations. > > > > > > > > > > > > > > > > > > > > > > Therefore, I think from the user's perspective, a > > feasible > > > > > > > > > configuration > > > > > > > > > > > for alternative 2 may lead to lower resource > utilization > > > > > compared > > > > > > > to > > > > > > > > > > > alternative 3. > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann < > > > > > > > trohrm...@apache.org > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I guess you have to help me understand the difference > > > > between > > > > > > > > > > > alternative 2 > > > > > > > > > > > > and 3 wrt to memory under utilization Xintong. > > > > > > > > > > > > > > > > > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task > > > > Off-Heap > > > > > > > Memory > > > > > > > > > and > > > > > > > > > > > JVM > > > > > > > > > > > > Overhead. Then there is the risk that this size is > too > > > low > > > > > > > > resulting > > > > > > > > > > in a > > > > > > > > > > > > lot of garbage collection and potentially an OOM. > > > > > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to > > something > > > > > larger > > > > > > > > than > > > > > > > > > > > > alternative 2. This would of course reduce the sizes > of > > > the > > > > > > other > > > > > > > > > > memory > > > > > > > > > > > > types. > > > > > > > > > > > > > > > > > > > > > > > > How would alternative 2 now result in an under > > > utilization > > > > of > > > > > > > > memory > > > > > > > > > > > > compared to alternative 3? If alternative 3 strictly > > > sets a > > > > > > > higher > > > > > > > > > max > > > > > > > > > > > > direct memory size and we use only little, then I > would > > > > > expect > > > > > > > that > > > > > > > > > > > > alternative 3 results in memory under utilization. > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang < > > > > > > danrtsey...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi xintong,till > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Native and Direct Memory > > > > > > > > > > > > > > > > > > > > > > > > > > My point is setting a very large max direct memory > > size > > > > > when > > > > > > we > > > > > > > > do > > > > > > > > > > not > > > > > > > > > > > > > differentiate direct and native memory. If the > direct > > > > > > > > > > memory,including > > > > > > > > > > > > user > > > > > > > > > > > > > direct memory and framework direct memory,could be > > > > > calculated > > > > > > > > > > > > > correctly,then > > > > > > > > > > > > > i am in favor of setting direct memory with fixed > > > value. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Memory Calculation > > > > > > > > > > > > > > > > > > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to > > check > > > > the > > > > > > > > memory > > > > > > > > > > > > > configurations in client to avoid submitting > > > successfully > > > > > and > > > > > > > > > failing > > > > > > > > > > > in > > > > > > > > > > > > > the flink master. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > > > > Yang > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song <tonysong...@gmail.com>于2019年8月13日 > > > > 周二22:07写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for replying, Till. > > > > > > > > > > > > > > > > > > > > > > > > > > > > About MemorySegment, I think you are right that > we > > > > should > > > > > > not > > > > > > > > > > include > > > > > > > > > > > > > this > > > > > > > > > > > > > > issue in the scope of this FLIP. This FLIP should > > > > > > concentrate > > > > > > > > on > > > > > > > > > > how > > > > > > > > > > > to > > > > > > > > > > > > > > configure memory pools for TaskExecutors, with > > > minimum > > > > > > > > > involvement > > > > > > > > > > on > > > > > > > > > > > > how > > > > > > > > > > > > > > memory consumers use it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > About direct memory, I think alternative 3 may > not > > > > having > > > > > > the > > > > > > > > > same > > > > > > > > > > > over > > > > > > > > > > > > > > reservation issue that alternative 2 does, but at > > the > > > > > cost > > > > > > of > > > > > > > > > risk > > > > > > > > > > of > > > > > > > > > > > > > over > > > > > > > > > > > > > > using memory at the container level, which is not > > > good. > > > > > My > > > > > > > > point > > > > > > > > > is > > > > > > > > > > > > that > > > > > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" > are > > > not > > > > > easy > > > > > > > to > > > > > > > > > > > config. > > > > > > > > > > > > > For > > > > > > > > > > > > > > alternative 2, users might configure them higher > > than > > > > > what > > > > > > > > > actually > > > > > > > > > > > > > needed, > > > > > > > > > > > > > > just to avoid getting a direct OOM. For > alternative > > > 3, > > > > > > users > > > > > > > do > > > > > > > > > not > > > > > > > > > > > get > > > > > > > > > > > > > > direct OOM, so they may not config the two > options > > > > > > > aggressively > > > > > > > > > > high. > > > > > > > > > > > > But > > > > > > > > > > > > > > the consequences are risks of overall container > > > memory > > > > > > usage > > > > > > > > > > exceeds > > > > > > > > > > > > the > > > > > > > > > > > > > > budget. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann < > > > > > > > > > > trohrm...@apache.org> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for proposing this FLIP Xintong. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All in all I think it already looks quite good. > > > > > > Concerning > > > > > > > > the > > > > > > > > > > > first > > > > > > > > > > > > > open > > > > > > > > > > > > > > > question about allocating memory segments, I > was > > > > > > wondering > > > > > > > > > > whether > > > > > > > > > > > > this > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > strictly necessary to do in the context of this > > > FLIP > > > > or > > > > > > > > whether > > > > > > > > > > > this > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > be done as a follow up? Without knowing all > > > details, > > > > I > > > > > > > would > > > > > > > > be > > > > > > > > > > > > > concerned > > > > > > > > > > > > > > > that we would widen the scope of this FLIP too > > much > > > > > > because > > > > > > > > we > > > > > > > > > > > would > > > > > > > > > > > > > have > > > > > > > > > > > > > > > to touch all the existing call sites of the > > > > > MemoryManager > > > > > > > > where > > > > > > > > > > we > > > > > > > > > > > > > > allocate > > > > > > > > > > > > > > > memory segments (this should mainly be batch > > > > > operators). > > > > > > > The > > > > > > > > > > > addition > > > > > > > > > > > > > of > > > > > > > > > > > > > > > the memory reservation call to the > MemoryManager > > > > should > > > > > > not > > > > > > > > be > > > > > > > > > > > > affected > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > this and I would hope that this is the only > point > > > of > > > > > > > > > interaction > > > > > > > > > > a > > > > > > > > > > > > > > > streaming job would have with the > MemoryManager. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Concerning the second open question about > setting > > > or > > > > > not > > > > > > > > > setting > > > > > > > > > > a > > > > > > > > > > > > max > > > > > > > > > > > > > > > direct memory limit, I would also be interested > > why > > > > > Yang > > > > > > > Wang > > > > > > > > > > > thinks > > > > > > > > > > > > > > > leaving it open would be best. My concern about > > > this > > > > > > would > > > > > > > be > > > > > > > > > > that > > > > > > > > > > > we > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > be in a similar situation as we are now with > the > > > > > > > > > > > RocksDBStateBackend. > > > > > > > > > > > > > If > > > > > > > > > > > > > > > the different memory pools are not clearly > > > separated > > > > > and > > > > > > > can > > > > > > > > > > spill > > > > > > > > > > > > over > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > a different pool, then it is quite hard to > > > understand > > > > > > what > > > > > > > > > > exactly > > > > > > > > > > > > > > causes a > > > > > > > > > > > > > > > process to get killed for using too much > memory. > > > This > > > > > > could > > > > > > > > > then > > > > > > > > > > > > easily > > > > > > > > > > > > > > > lead to a similar situation what we have with > the > > > > > > > > cutoff-ratio. > > > > > > > > > > So > > > > > > > > > > > > why > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > setting a sane default value for max direct > > memory > > > > and > > > > > > > giving > > > > > > > > > the > > > > > > > > > > > > user > > > > > > > > > > > > > an > > > > > > > > > > > > > > > option to increase it if he runs into an OOM. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower > > > > memory > > > > > > > > > > utilization > > > > > > > > > > > > than > > > > > > > > > > > > > > > alternative 3 where we set the direct memory > to a > > > > > higher > > > > > > > > value? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song < > > > > > > > > > > tonysong...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback, Yang. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding your comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Native and Direct Memory* > > > > > > > > > > > > > > > > I think setting a very large max direct > memory > > > size > > > > > > > > > definitely > > > > > > > > > > > has > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > good sides. E.g., we do not worry about > direct > > > OOM, > > > > > and > > > > > > > we > > > > > > > > > > don't > > > > > > > > > > > > even > > > > > > > > > > > > > > > need > > > > > > > > > > > > > > > > to allocate managed / network memory with > > > > > > > > Unsafe.allocate() . > > > > > > > > > > > > > > > > However, there are also some down sides of > > doing > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - One thing I can think of is that if a > task > > > > > > executor > > > > > > > > > > > container > > > > > > > > > > > > is > > > > > > > > > > > > > > > > killed due to overusing memory, it could > be > > > hard > > > > > for > > > > > > > use > > > > > > > > > to > > > > > > > > > > > know > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > part > > > > > > > > > > > > > > > > of the memory is overused. > > > > > > > > > > > > > > > > - Another down side is that the JVM never > > > > trigger > > > > > GC > > > > > > > due > > > > > > > > > to > > > > > > > > > > > > > reaching > > > > > > > > > > > > > > > max > > > > > > > > > > > > > > > > direct memory limit, because the limit is > > too > > > > high > > > > > > to > > > > > > > be > > > > > > > > > > > > reached. > > > > > > > > > > > > > > That > > > > > > > > > > > > > > > > means we kind of relay on heap memory to > > > trigger > > > > > GC > > > > > > > and > > > > > > > > > > > release > > > > > > > > > > > > > > direct > > > > > > > > > > > > > > > > memory. That could be a problem in cases > > where > > > > we > > > > > > have > > > > > > > > > more > > > > > > > > > > > > direct > > > > > > > > > > > > > > > > memory > > > > > > > > > > > > > > > > usage but not enough heap activity to > > trigger > > > > the > > > > > > GC. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Maybe you can share your reasons for > preferring > > > > > > setting a > > > > > > > > > very > > > > > > > > > > > > large > > > > > > > > > > > > > > > value, > > > > > > > > > > > > > > > > if there are anything else I overlooked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Memory Calculation* > > > > > > > > > > > > > > > > If there is any conflict between multiple > > > > > configuration > > > > > > > > that > > > > > > > > > > user > > > > > > > > > > > > > > > > explicitly specified, I think we should throw > > an > > > > > error. > > > > > > > > > > > > > > > > I think doing checking on the client side is > a > > > good > > > > > > idea, > > > > > > > > so > > > > > > > > > > that > > > > > > > > > > > > on > > > > > > > > > > > > > > > Yarn / > > > > > > > > > > > > > > > > K8s we can discover the problem before > > submitting > > > > the > > > > > > > Flink > > > > > > > > > > > > cluster, > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > is always a good thing. > > > > > > > > > > > > > > > > But we can not only rely on the client side > > > > checking, > > > > > > > > because > > > > > > > > > > for > > > > > > > > > > > > > > > > standalone cluster TaskManagers on different > > > > machines > > > > > > may > > > > > > > > > have > > > > > > > > > > > > > > different > > > > > > > > > > > > > > > > configurations and the client does see that. > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang < > > > > > > > > > > danrtsey...@gmail.com> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi xintong, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for your detailed proposal. After > all > > > the > > > > > > memory > > > > > > > > > > > > > configuration > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > introduced, it will be more powerful to > > control > > > > the > > > > > > > flink > > > > > > > > > > > memory > > > > > > > > > > > > > > > usage. I > > > > > > > > > > > > > > > > > just have few questions about it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Native and Direct Memory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We do not differentiate user direct memory > > and > > > > > native > > > > > > > > > memory. > > > > > > > > > > > > They > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > included in task off-heap memory. Right? > So i > > > > don’t > > > > > > > think > > > > > > > > > we > > > > > > > > > > > > could > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > set > > > > > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I > > prefer > > > > > > leaving > > > > > > > > it a > > > > > > > > > > > very > > > > > > > > > > > > > > large > > > > > > > > > > > > > > > > > value. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Memory Calculation > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If the sum of and fine-grained > memory(network > > > > > memory, > > > > > > > > > managed > > > > > > > > > > > > > memory, > > > > > > > > > > > > > > > > etc.) > > > > > > > > > > > > > > > > > is larger than total process memory, how do > > we > > > > deal > > > > > > > with > > > > > > > > > this > > > > > > > > > > > > > > > situation? > > > > > > > > > > > > > > > > Do > > > > > > > > > > > > > > > > > we need to check the memory configuration > in > > > > > client? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song <tonysong...@gmail.com> > > > > 于2019年8月7日周三 > > > > > > > > > 下午10:14写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We would like to start a discussion > thread > > on > > > > > > > "FLIP-49: > > > > > > > > > > > Unified > > > > > > > > > > > > > > > Memory > > > > > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], > where > > we > > > > > > > describe > > > > > > > > > how > > > > > > > > > > to > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > > > TaskExecutor memory configurations. The > > FLIP > > > > > > document > > > > > > > > is > > > > > > > > > > > mostly > > > > > > > > > > > > > > based > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > an > > > > > > > > > > > > > > > > > > early design "Memory Management and > > > > Configuration > > > > > > > > > > > Reloaded"[2] > > > > > > > > > > > > by > > > > > > > > > > > > > > > > > Stephan, > > > > > > > > > > > > > > > > > > with updates from follow-up discussions > > both > > > > > online > > > > > > > and > > > > > > > > > > > > offline. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This FLIP addresses several shortcomings > of > > > > > current > > > > > > > > > (Flink > > > > > > > > > > > 1.9) > > > > > > > > > > > > > > > > > > TaskExecutor memory configuration. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Different configuration for > Streaming > > > and > > > > > > Batch. > > > > > > > > > > > > > > > > > > - Complex and difficult configuration > of > > > > > RocksDB > > > > > > > in > > > > > > > > > > > > Streaming. > > > > > > > > > > > > > > > > > > - Complicated, uncertain and hard to > > > > > understand. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Key changes to solve the problems can be > > > > > summarized > > > > > > > as > > > > > > > > > > > follows. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Extend memory manager to also > account > > > for > > > > > > memory > > > > > > > > > usage > > > > > > > > > > > by > > > > > > > > > > > > > > state > > > > > > > > > > > > > > > > > > backends. > > > > > > > > > > > > > > > > > > - Modify how TaskExecutor memory is > > > > > partitioned > > > > > > > > > > accounted > > > > > > > > > > > > > > > individual > > > > > > > > > > > > > > > > > > memory reservations and pools. > > > > > > > > > > > > > > > > > > - Simplify memory configuration > options > > > and > > > > > > > > > calculations > > > > > > > > > > > > > logics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please find more details in the FLIP wiki > > > > > document > > > > > > > [1]. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (Please note that the early design doc > [2] > > is > > > > out > > > > > > of > > > > > > > > > sync, > > > > > > > > > > > and > > > > > > > > > > > > it > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > appreciated to have the discussion in > this > > > > > mailing > > > > > > > list > > > > > > > > > > > > thread.) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your feedbacks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song < > > > > tonysong...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for sharing your opinion Till. > > > > > > > > > > > > > > > > > > I'm also in favor of alternative 2. I was wondering whether > > we > > > > can > > > > > > > avoid > > > > > > > > > using Unsafe.allocate() for off-heap managed memory and > > network > > > > > > memory > > > > > > > > with > > > > > > > > > alternative 3. But after giving it a second thought, I > think > > > even > > > > > for > > > > > > > > > alternative 3 using direct memory for off-heap managed > memory > > > > could > > > > > > > cause > > > > > > > > > problems. > > > > > > > > > > > > > > > > > > Hi Yang, > > > > > > > > > > > > > > > > > > Regarding your concern, I think what proposed in this FLIP > it > > > to > > > > > have > > > > > > > > both > > > > > > > > > off-heap managed memory and network memory allocated > through > > > > > > > > > Unsafe.allocate(), which means they are practically native > > > memory > > > > > and > > > > > > > not > > > > > > > > > limited by JVM max direct memory. The only parts of memory > > > > limited > > > > > by > > > > > > > JVM > > > > > > > > > max direct memory are task off-heap memory and JVM > overhead, > > > > which > > > > > > are > > > > > > > > > exactly alternative 2 suggests to set the JVM max direct > > memory > > > > to. > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann < > > > > > trohrm...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks for the clarification Xintong. I understand the > two > > > > > > > alternatives > > > > > > > > > > now. > > > > > > > > > > > > > > > > > > > > I would be in favour of option 2 because it makes things > > > > > explicit. > > > > > > If > > > > > > > > we > > > > > > > > > > don't limit the direct memory, I fear that we might end > up > > > in a > > > > > > > similar > > > > > > > > > > situation as we are currently in: The user might see that > > her > > > > > > process > > > > > > > > > gets > > > > > > > > > > killed by the OS and does not know why this is the case. > > > > > > > Consequently, > > > > > > > > > she > > > > > > > > > > tries to decrease the process memory size (similar to > > > > increasing > > > > > > the > > > > > > > > > cutoff > > > > > > > > > > ratio) in order to accommodate for the extra direct > memory. > > > > Even > > > > > > > worse, > > > > > > > > > she > > > > > > > > > > tries to decrease memory budgets which are not fully used > > and > > > > > hence > > > > > > > > won't > > > > > > > > > > change the overall memory consumption. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song < > > > > > > tonysong...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Let me explain this with a concrete example Till. > > > > > > > > > > > > > > > > > > > > > > Let's say we have the following scenario. > > > > > > > > > > > > > > > > > > > > > > Total Process Memory: 1GB > > > > > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM > Overhead): > > > > 200MB > > > > > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap > > > > Managed > > > > > > > Memory > > > > > > > > > and > > > > > > > > > > > Network Memory): 800MB > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to > > 200MB. > > > > > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a > > very > > > > > large > > > > > > > > > value, > > > > > > > > > > > let's say 1TB. > > > > > > > > > > > > > > > > > > > > > > If the actual direct memory usage of Task Off-Heap > Memory > > > and > > > > > JVM > > > > > > > > > > Overhead > > > > > > > > > > > do not exceed 200MB, then alternative 2 and > alternative 3 > > > > > should > > > > > > > have > > > > > > > > > the > > > > > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize > will > > > not > > > > > > > reduce > > > > > > > > > the > > > > > > > > > > > sizes of the other memory pools. > > > > > > > > > > > > > > > > > > > > > > If the actual direct memory usage of Task Off-Heap > Memory > > > and > > > > > JVM > > > > > > > > > > > Overhead potentially exceed 200MB, then > > > > > > > > > > > > > > > > > > > > > > - Alternative 2 suffers from frequent OOM. To avoid > > > that, > > > > > the > > > > > > > only > > > > > > > > > > thing > > > > > > > > > > > user can do is to modify the configuration and > > increase > > > > JVM > > > > > > > Direct > > > > > > > > > > > Memory > > > > > > > > > > > (Task Off-Heap Memory + JVM Overhead). Let's say > that > > > user > > > > > > > > increases > > > > > > > > > > JVM > > > > > > > > > > > Direct Memory to 250MB, this will reduce the total > > size > > > of > > > > > > other > > > > > > > > > > memory > > > > > > > > > > > pools to 750MB, given the total process memory > remains > > > > 1GB. > > > > > > > > > > > - For alternative 3, there is no chance of direct > OOM. > > > > There > > > > > > are > > > > > > > > > > chances > > > > > > > > > > > of exceeding the total process memory limit, but > given > > > > that > > > > > > the > > > > > > > > > > process > > > > > > > > > > > may > > > > > > > > > > > not use up all the reserved native memory (Off-Heap > > > > Managed > > > > > > > > Memory, > > > > > > > > > > > Network > > > > > > > > > > > Memory, JVM Metaspace), if the actual direct memory > > > usage > > > > is > > > > > > > > > slightly > > > > > > > > > > > above > > > > > > > > > > > yet very close to 200MB, user probably do not need > to > > > > change > > > > > > the > > > > > > > > > > > configurations. > > > > > > > > > > > > > > > > > > > > > > Therefore, I think from the user's perspective, a > > feasible > > > > > > > > > configuration > > > > > > > > > > > for alternative 2 may lead to lower resource > utilization > > > > > compared > > > > > > > to > > > > > > > > > > > alternative 3. > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann < > > > > > > > trohrm...@apache.org > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I guess you have to help me understand the difference > > > > between > > > > > > > > > > > alternative 2 > > > > > > > > > > > > and 3 wrt to memory under utilization Xintong. > > > > > > > > > > > > > > > > > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task > > > > Off-Heap > > > > > > > Memory > > > > > > > > > and > > > > > > > > > > > JVM > > > > > > > > > > > > Overhead. Then there is the risk that this size is > too > > > low > > > > > > > > resulting > > > > > > > > > > in a > > > > > > > > > > > > lot of garbage collection and potentially an OOM. > > > > > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to > > something > > > > > larger > > > > > > > > than > > > > > > > > > > > > alternative 2. This would of course reduce the sizes > of > > > the > > > > > > other > > > > > > > > > > memory > > > > > > > > > > > > types. > > > > > > > > > > > > > > > > > > > > > > > > How would alternative 2 now result in an under > > > utilization > > > > of > > > > > > > > memory > > > > > > > > > > > > compared to alternative 3? If alternative 3 strictly > > > sets a > > > > > > > higher > > > > > > > > > max > > > > > > > > > > > > direct memory size and we use only little, then I > would > > > > > expect > > > > > > > that > > > > > > > > > > > > alternative 3 results in memory under utilization. > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang < > > > > > > danrtsey...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi xintong,till > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Native and Direct Memory > > > > > > > > > > > > > > > > > > > > > > > > > > My point is setting a very large max direct memory > > size > > > > > when > > > > > > we > > > > > > > > do > > > > > > > > > > not > > > > > > > > > > > > > differentiate direct and native memory. If the > direct > > > > > > > > > > memory,including > > > > > > > > > > > > user > > > > > > > > > > > > > direct memory and framework direct memory,could be > > > > > calculated > > > > > > > > > > > > > correctly,then > > > > > > > > > > > > > i am in favor of setting direct memory with fixed > > > value. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Memory Calculation > > > > > > > > > > > > > > > > > > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to > > check > > > > the > > > > > > > > memory > > > > > > > > > > > > > configurations in client to avoid submitting > > > successfully > > > > > and > > > > > > > > > failing > > > > > > > > > > > in > > > > > > > > > > > > > the flink master. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > > > > Yang > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song <tonysong...@gmail.com>于2019年8月13日 > > > > 周二22:07写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for replying, Till. > > > > > > > > > > > > > > > > > > > > > > > > > > > > About MemorySegment, I think you are right that > we > > > > should > > > > > > not > > > > > > > > > > include > > > > > > > > > > > > > this > > > > > > > > > > > > > > issue in the scope of this FLIP. This FLIP should > > > > > > concentrate > > > > > > > > on > > > > > > > > > > how > > > > > > > > > > > to > > > > > > > > > > > > > > configure memory pools for TaskExecutors, with > > > minimum > > > > > > > > > involvement > > > > > > > > > > on > > > > > > > > > > > > how > > > > > > > > > > > > > > memory consumers use it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > About direct memory, I think alternative 3 may > not > > > > having > > > > > > the > > > > > > > > > same > > > > > > > > > > > over > > > > > > > > > > > > > > reservation issue that alternative 2 does, but at > > the > > > > > cost > > > > > > of > > > > > > > > > risk > > > > > > > > > > of > > > > > > > > > > > > > over > > > > > > > > > > > > > > using memory at the container level, which is not > > > good. > > > > > My > > > > > > > > point > > > > > > > > > is > > > > > > > > > > > > that > > > > > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" > are > > > not > > > > > easy > > > > > > > to > > > > > > > > > > > config. > > > > > > > > > > > > > For > > > > > > > > > > > > > > alternative 2, users might configure them higher > > than > > > > > what > > > > > > > > > actually > > > > > > > > > > > > > needed, > > > > > > > > > > > > > > just to avoid getting a direct OOM. For > alternative > > > 3, > > > > > > users > > > > > > > do > > > > > > > > > not > > > > > > > > > > > get > > > > > > > > > > > > > > direct OOM, so they may not config the two > options > > > > > > > aggressively > > > > > > > > > > high. > > > > > > > > > > > > But > > > > > > > > > > > > > > the consequences are risks of overall container > > > memory > > > > > > usage > > > > > > > > > > exceeds > > > > > > > > > > > > the > > > > > > > > > > > > > > budget. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann < > > > > > > > > > > trohrm...@apache.org> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for proposing this FLIP Xintong. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All in all I think it already looks quite good. > > > > > > Concerning > > > > > > > > the > > > > > > > > > > > first > > > > > > > > > > > > > open > > > > > > > > > > > > > > > question about allocating memory segments, I > was > > > > > > wondering > > > > > > > > > > whether > > > > > > > > > > > > this > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > strictly necessary to do in the context of this > > > FLIP > > > > or > > > > > > > > whether > > > > > > > > > > > this > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > be done as a follow up? Without knowing all > > > details, > > > > I > > > > > > > would > > > > > > > > be > > > > > > > > > > > > > concerned > > > > > > > > > > > > > > > that we would widen the scope of this FLIP too > > much > > > > > > because > > > > > > > > we > > > > > > > > > > > would > > > > > > > > > > > > > have > > > > > > > > > > > > > > > to touch all the existing call sites of the > > > > > MemoryManager > > > > > > > > where > > > > > > > > > > we > > > > > > > > > > > > > > allocate > > > > > > > > > > > > > > > memory segments (this should mainly be batch > > > > > operators). > > > > > > > The > > > > > > > > > > > addition > > > > > > > > > > > > > of > > > > > > > > > > > > > > > the memory reservation call to the > MemoryManager > > > > should > > > > > > not > > > > > > > > be > > > > > > > > > > > > affected > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > this and I would hope that this is the only > point > > > of > > > > > > > > > interaction > > > > > > > > > > a > > > > > > > > > > > > > > > streaming job would have with the > MemoryManager. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Concerning the second open question about > setting > > > or > > > > > not > > > > > > > > > setting > > > > > > > > > > a > > > > > > > > > > > > max > > > > > > > > > > > > > > > direct memory limit, I would also be interested > > why > > > > > Yang > > > > > > > Wang > > > > > > > > > > > thinks > > > > > > > > > > > > > > > leaving it open would be best. My concern about > > > this > > > > > > would > > > > > > > be > > > > > > > > > > that > > > > > > > > > > > we > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > be in a similar situation as we are now with > the > > > > > > > > > > > RocksDBStateBackend. > > > > > > > > > > > > > If > > > > > > > > > > > > > > > the different memory pools are not clearly > > > separated > > > > > and > > > > > > > can > > > > > > > > > > spill > > > > > > > > > > > > over > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > a different pool, then it is quite hard to > > > understand > > > > > > what > > > > > > > > > > exactly > > > > > > > > > > > > > > causes a > > > > > > > > > > > > > > > process to get killed for using too much > memory. > > > This > > > > > > could > > > > > > > > > then > > > > > > > > > > > > easily > > > > > > > > > > > > > > > lead to a similar situation what we have with > the > > > > > > > > cutoff-ratio. > > > > > > > > > > So > > > > > > > > > > > > why > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > setting a sane default value for max direct > > memory > > > > and > > > > > > > giving > > > > > > > > > the > > > > > > > > > > > > user > > > > > > > > > > > > > an > > > > > > > > > > > > > > > option to increase it if he runs into an OOM. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower > > > > memory > > > > > > > > > > utilization > > > > > > > > > > > > than > > > > > > > > > > > > > > > alternative 3 where we set the direct memory > to a > > > > > higher > > > > > > > > value? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song < > > > > > > > > > > tonysong...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback, Yang. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding your comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Native and Direct Memory* > > > > > > > > > > > > > > > > I think setting a very large max direct > memory > > > size > > > > > > > > > definitely > > > > > > > > > > > has > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > good sides. E.g., we do not worry about > direct > > > OOM, > > > > > and > > > > > > > we > > > > > > > > > > don't > > > > > > > > > > > > even > > > > > > > > > > > > > > > need > > > > > > > > > > > > > > > > to allocate managed / network memory with > > > > > > > > Unsafe.allocate() . > > > > > > > > > > > > > > > > However, there are also some down sides of > > doing > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - One thing I can think of is that if a > task > > > > > > executor > > > > > > > > > > > container > > > > > > > > > > > > is > > > > > > > > > > > > > > > > killed due to overusing memory, it could > be > > > hard > > > > > for > > > > > > > use > > > > > > > > > to > > > > > > > > > > > know > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > part > > > > > > > > > > > > > > > > of the memory is overused. > > > > > > > > > > > > > > > > - Another down side is that the JVM never > > > > trigger > > > > > GC > > > > > > > due > > > > > > > > > to > > > > > > > > > > > > > reaching > > > > > > > > > > > > > > > max > > > > > > > > > > > > > > > > direct memory limit, because the limit is > > too > > > > high > > > > > > to > > > > > > > be > > > > > > > > > > > > reached. > > > > > > > > > > > > > > That > > > > > > > > > > > > > > > > means we kind of relay on heap memory to > > > trigger > > > > > GC > > > > > > > and > > > > > > > > > > > release > > > > > > > > > > > > > > direct > > > > > > > > > > > > > > > > memory. That could be a problem in cases > > where > > > > we > > > > > > have > > > > > > > > > more > > > > > > > > > > > > direct > > > > > > > > > > > > > > > > memory > > > > > > > > > > > > > > > > usage but not enough heap activity to > > trigger > > > > the > > > > > > GC. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Maybe you can share your reasons for > preferring > > > > > > setting a > > > > > > > > > very > > > > > > > > > > > > large > > > > > > > > > > > > > > > value, > > > > > > > > > > > > > > > > if there are anything else I overlooked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Memory Calculation* > > > > > > > > > > > > > > > > If there is any conflict between multiple > > > > > configuration > > > > > > > > that > > > > > > > > > > user > > > > > > > > > > > > > > > > explicitly specified, I think we should throw > > an > > > > > error. > > > > > > > > > > > > > > > > I think doing checking on the client side is > a > > > good > > > > > > idea, > > > > > > > > so > > > > > > > > > > that > > > > > > > > > > > > on > > > > > > > > > > > > > > > Yarn / > > > > > > > > > > > > > > > > K8s we can discover the problem before > > submitting > > > > the > > > > > > > Flink > > > > > > > > > > > > cluster, > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > is always a good thing. > > > > > > > > > > > > > > > > But we can not only rely on the client side > > > > checking, > > > > > > > > because > > > > > > > > > > for > > > > > > > > > > > > > > > > standalone cluster TaskManagers on different > > > > machines > > > > > > may > > > > > > > > > have > > > > > > > > > > > > > > different > > > > > > > > > > > > > > > > configurations and the client does see that. > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang < > > > > > > > > > > danrtsey...@gmail.com> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi xintong, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for your detailed proposal. After > all > > > the > > > > > > memory > > > > > > > > > > > > > configuration > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > introduced, it will be more powerful to > > control > > > > the > > > > > > > flink > > > > > > > > > > > memory > > > > > > > > > > > > > > > usage. I > > > > > > > > > > > > > > > > > just have few questions about it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Native and Direct Memory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We do not differentiate user direct memory > > and > > > > > native > > > > > > > > > memory. > > > > > > > > > > > > They > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > included in task off-heap memory. Right? > So i > > > > don’t > > > > > > > think > > > > > > > > > we > > > > > > > > > > > > could > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > set > > > > > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I > > prefer > > > > > > leaving > > > > > > > > it a > > > > > > > > > > > very > > > > > > > > > > > > > > large > > > > > > > > > > > > > > > > > value. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Memory Calculation > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If the sum of and fine-grained > memory(network > > > > > memory, > > > > > > > > > managed > > > > > > > > > > > > > memory, > > > > > > > > > > > > > > > > etc.) > > > > > > > > > > > > > > > > > is larger than total process memory, how do > > we > > > > deal > > > > > > > with > > > > > > > > > this > > > > > > > > > > > > > > > situation? > > > > > > > > > > > > > > > > Do > > > > > > > > > > > > > > > > > we need to check the memory configuration > in > > > > > client? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song <tonysong...@gmail.com> > > > > 于2019年8月7日周三 > > > > > > > > > 下午10:14写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We would like to start a discussion > thread > > on > > > > > > > "FLIP-49: > > > > > > > > > > > Unified > > > > > > > > > > > > > > > Memory > > > > > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], > where > > we > > > > > > > describe > > > > > > > > > how > > > > > > > > > > to > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > > > TaskExecutor memory configurations. The > > FLIP > > > > > > document > > > > > > > > is > > > > > > > > > > > mostly > > > > > > > > > > > > > > based > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > an > > > > > > > > > > > > > > > > > > early design "Memory Management and > > > > Configuration > > > > > > > > > > > Reloaded"[2] > > > > > > > > > > > > by > > > > > > > > > > > > > > > > > Stephan, > > > > > > > > > > > > > > > > > > with updates from follow-up discussions > > both > > > > > online > > > > > > > and > > > > > > > > > > > > offline. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This FLIP addresses several shortcomings > of > > > > > current > > > > > > > > > (Flink > > > > > > > > > > > 1.9) > > > > > > > > > > > > > > > > > > TaskExecutor memory configuration. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Different configuration for > Streaming > > > and > > > > > > Batch. > > > > > > > > > > > > > > > > > > - Complex and difficult configuration > of > > > > > RocksDB > > > > > > > in > > > > > > > > > > > > Streaming. > > > > > > > > > > > > > > > > > > - Complicated, uncertain and hard to > > > > > understand. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Key changes to solve the problems can be > > > > > summarized > > > > > > > as > > > > > > > > > > > follows. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Extend memory manager to also > account > > > for > > > > > > memory > > > > > > > > > usage > > > > > > > > > > > by > > > > > > > > > > > > > > state > > > > > > > > > > > > > > > > > > backends. > > > > > > > > > > > > > > > > > > - Modify how TaskExecutor memory is > > > > > partitioned > > > > > > > > > > accounted > > > > > > > > > > > > > > > individual > > > > > > > > > > > > > > > > > > memory reservations and pools. > > > > > > > > > > > > > > > > > > - Simplify memory configuration > options > > > and > > > > > > > > > calculations > > > > > > > > > > > > > logics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please find more details in the FLIP wiki > > > > > document > > > > > > > [1]. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (Please note that the early design doc > [2] > > is > > > > out > > > > > > of > > > > > > > > > sync, > > > > > > > > > > > and > > > > > > > > > > > > it > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > appreciated to have the discussion in > this > > > > > mailing > > > > > > > list > > > > > > > > > > > > thread.) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your feedbacks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >