+1 to Xingtong's proposal! Best, Jark
On Tue, 5 Jan 2021 at 12:13, Xintong Song <tonysong...@gmail.com> wrote: > +1 for allowing streaming operators to use managed memory. > > As for the consumer names, I'm afraid using `DATAPROC` for both streaming > ops and state backends will not work. Currently, RocksDB state backend uses > a shared piece of memory for all the states within that slot. It's not the > operator's decision how much memory it uses for the states. > > I would suggest the following. (IIUC, the same as what Jark proposed) > * `OPERATOR` for both streaming and bath operators > * `STATE_BACKEND` for state backends > * `PYTHON` for python processes > * `DATAPROC` as a legacy key for state backend or batch operators if > `STATE_BACKEND` or `OPERATOR` are not specified. > > Thank you~ > > Xintong Song > > > > On Tue, Jan 5, 2021 at 11:23 AM Jark Wu <imj...@gmail.com> wrote: > > > Hi Aljoscha, > > > > I think we may need to divide `DATAPROC` into `OPERATOR` and > > `STATE_BACKEND`, because they have different scope (slot vs. operator). > > But @Xintong Song <tonysong...@gmail.com> may have more insights on it. > > > > Best, > > Jark > > > > > > On Mon, 4 Jan 2021 at 20:44, Aljoscha Krettek <aljos...@apache.org> > wrote: > > > >> I agree, we should allow streaming operators to use managed memory for > >> other use cases. > >> > >> Do you think we need an additional "consumer" setting or that they would > >> just use `DATAPROC` and decide by themselves what to use the memory for? > >> > >> Best, > >> Aljoscha > >> > >> On 2020/12/22 17:14, Jark Wu wrote: > >> >Hi all, > >> > > >> >I found that currently the managed memory can only be used in 3 > workloads > >> >[1]: > >> >- state backends for streaming jobs > >> >- sorting, hash tables for batch jobs > >> >- python UDFs > >> > > >> >And the configuration option > >> `taskmanager.memory.managed.consumer-weights` > >> >only allows values: PYTHON and DATAPROC (state in streaming or > algorithms > >> >in batch). > >> >I'm confused why it doesn't allow streaming operators to use managed > >> memory > >> >for purposes other than state backends. > >> > > >> >The background is that we are planning to use some batch algorithms > >> >(sorting & bytes hash table) to improve the performance of streaming > SQL > >> >operators, especially for the mini-batch operators. > >> >Currently, the mini-batch operators are buffering input records and > >> >accumulators in heap (i.e. Java HashMap) which is not efficient and > there > >> >are potential risks of full GC and OOM. > >> >With the managed memory, we can fully use the memory to buffer more > data > >> >without worrying about OOM and improve the performance a lot. > >> > > >> >What do you think about allowing streaming operators to use managed > >> memory > >> >and exposing it in configuration. > >> > > >> >Best, > >> >Jark > >> > > >> >[1]: > >> > > >> > https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory > >> > > >