+1 for allowing streaming operators to use managed memory. The memory use of streams requires some hierarchy, and the bottom layer is undoubtedly the current StateBackend. Let the stream operators freely use the managed memory, which will make the memory management model to be unified and give the operator free space.
Xingtong's proposal looks good to me. +1 to split `DATAPROC` into `STATE_BACKEND` or `OPERATOR`. Best, Jingsong On Tue, Jan 5, 2021 at 12:33 PM Jark Wu <imj...@gmail.com> wrote: > +1 to Xingtong's proposal! > > Best, > Jark > > On Tue, 5 Jan 2021 at 12:13, Xintong Song <tonysong...@gmail.com> wrote: > > > +1 for allowing streaming operators to use managed memory. > > > > As for the consumer names, I'm afraid using `DATAPROC` for both streaming > > ops and state backends will not work. Currently, RocksDB state backend > uses > > a shared piece of memory for all the states within that slot. It's not > the > > operator's decision how much memory it uses for the states. > > > > I would suggest the following. (IIUC, the same as what Jark proposed) > > * `OPERATOR` for both streaming and bath operators > > * `STATE_BACKEND` for state backends > > * `PYTHON` for python processes > > * `DATAPROC` as a legacy key for state backend or batch operators if > > `STATE_BACKEND` or `OPERATOR` are not specified. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Jan 5, 2021 at 11:23 AM Jark Wu <imj...@gmail.com> wrote: > > > > > Hi Aljoscha, > > > > > > I think we may need to divide `DATAPROC` into `OPERATOR` and > > > `STATE_BACKEND`, because they have different scope (slot vs. operator). > > > But @Xintong Song <tonysong...@gmail.com> may have more insights on > it. > > > > > > Best, > > > Jark > > > > > > > > > On Mon, 4 Jan 2021 at 20:44, Aljoscha Krettek <aljos...@apache.org> > > wrote: > > > > > >> I agree, we should allow streaming operators to use managed memory for > > >> other use cases. > > >> > > >> Do you think we need an additional "consumer" setting or that they > would > > >> just use `DATAPROC` and decide by themselves what to use the memory > for? > > >> > > >> Best, > > >> Aljoscha > > >> > > >> On 2020/12/22 17:14, Jark Wu wrote: > > >> >Hi all, > > >> > > > >> >I found that currently the managed memory can only be used in 3 > > workloads > > >> >[1]: > > >> >- state backends for streaming jobs > > >> >- sorting, hash tables for batch jobs > > >> >- python UDFs > > >> > > > >> >And the configuration option > > >> `taskmanager.memory.managed.consumer-weights` > > >> >only allows values: PYTHON and DATAPROC (state in streaming or > > algorithms > > >> >in batch). > > >> >I'm confused why it doesn't allow streaming operators to use managed > > >> memory > > >> >for purposes other than state backends. > > >> > > > >> >The background is that we are planning to use some batch algorithms > > >> >(sorting & bytes hash table) to improve the performance of streaming > > SQL > > >> >operators, especially for the mini-batch operators. > > >> >Currently, the mini-batch operators are buffering input records and > > >> >accumulators in heap (i.e. Java HashMap) which is not efficient and > > there > > >> >are potential risks of full GC and OOM. > > >> >With the managed memory, we can fully use the memory to buffer more > > data > > >> >without worrying about OOM and improve the performance a lot. > > >> > > > >> >What do you think about allowing streaming operators to use managed > > >> memory > > >> >and exposing it in configuration. > > >> > > > >> >Best, > > >> >Jark > > >> > > > >> >[1]: > > >> > > > >> > > > https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory > > >> > > > > > > -- Best, Jingsong Lee