Re: [DISCUSS] Allow streaming operators to use managed memory

Jark Wu Mon, 04 Jan 2021 20:33:23 -0800

+1 to Xingtong's proposal!

Best,
Jark


On Tue, 5 Jan 2021 at 12:13, Xintong Song <[email protected]> wrote:

> +1 for allowing streaming operators to use managed memory.
>
> As for the consumer names, I'm afraid using `DATAPROC` for both streaming
> ops and state backends will not work. Currently, RocksDB state backend uses
> a shared piece of memory for all the states within that slot. It's not the
> operator's decision how much memory it uses for the states.
>
> I would suggest the following. (IIUC, the same as what Jark proposed)
> * `OPERATOR` for both streaming and bath operators
> * `STATE_BACKEND` for state backends
> * `PYTHON` for python processes
> * `DATAPROC` as a legacy key for state backend or batch operators if
> `STATE_BACKEND` or `OPERATOR` are not specified.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Jan 5, 2021 at 11:23 AM Jark Wu <[email protected]> wrote:
>
> > Hi Aljoscha,
> >
> > I think we may need to divide `DATAPROC` into `OPERATOR` and
> > `STATE_BACKEND`, because they have different scope (slot vs. operator).
> > But @Xintong Song <[email protected]> may have more insights on it.
> >
> > Best,
> > Jark
> >
> >
> > On Mon, 4 Jan 2021 at 20:44, Aljoscha Krettek <[email protected]>
> wrote:
> >
> >> I agree, we should allow streaming operators to use managed memory for
> >> other use cases.
> >>
> >> Do you think we need an additional "consumer" setting or that they would
> >> just use `DATAPROC` and decide by themselves what to use the memory for?
> >>
> >> Best,
> >> Aljoscha
> >>
> >> On 2020/12/22 17:14, Jark Wu wrote:
> >> >Hi all,
> >> >
> >> >I found that currently the managed memory can only be used in 3
> workloads
> >> >[1]:
> >> >- state backends for streaming jobs
> >> >- sorting, hash tables for batch jobs
> >> >- python UDFs
> >> >
> >> >And the configuration option
> >> `taskmanager.memory.managed.consumer-weights`
> >> >only allows values: PYTHON and DATAPROC (state in streaming or
> algorithms
> >> >in batch).
> >> >I'm confused why it doesn't allow streaming operators to use managed
> >> memory
> >> >for purposes other than state backends.
> >> >
> >> >The background is that we are planning to use some batch algorithms
> >> >(sorting & bytes hash table) to improve the performance of streaming
> SQL
> >> >operators, especially for the mini-batch operators.
> >> >Currently, the mini-batch operators are buffering input records and
> >> >accumulators in heap (i.e. Java HashMap) which is not efficient and
> there
> >> >are potential risks of full GC and OOM.
> >> >With the managed memory, we can fully use the memory to buffer more
> data
> >> >without worrying about OOM and improve the performance a lot.
> >> >
> >> >What do you think about allowing streaming operators to use managed
> >> memory
> >> >and exposing it in configuration.
> >> >
> >> >Best,
> >> >Jark
> >> >
> >> >[1]:
> >> >
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory
> >>
> >
>

Re: [DISCUSS] Allow streaming operators to use managed memory

Reply via email to