Thanks all for the discussion.

I have created an issue FLINK-20860 [1] to support this.

In conclusion, we will extend the configuration
`taskmanager.memory.managed.consumer-weights` to have 2 more consumer
kinds: OPERATOR and STATE_BACKEND, the available consumer kinds will be :

* `OPERATOR` for both streaming and bath operators
* `STATE_BACKEND` for state backends
* `PYTHON` for python processes
* `DATAPROC` as a legacy key for state backend or batch operators if
`STATE_BACKEND` or `OPERATOR` are not specified.

The previous default value is DATAPROC:70,PYTHON:30, the new default value
will be OPERATOR:70,STATE_BACKEND:70,PYTHON:30.

The weight for OPERATOR and STATE_BACKEND will be the same value to align
with previous behaviors.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-20860

On Tue, 5 Jan 2021 at 18:35, Xintong Song <tonysong...@gmail.com> wrote:

> >
> > Would the default weight for OPERATOR and STATE_BACKEND be the same
> value?
> >
>
> I would say yes, to align with previous behaviors.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Jan 5, 2021 at 5:51 PM Till Rohrmann <trohrm...@apache.org> wrote:
>
> > +1 for Jark's and Xintong's proposal.
> >
> > Would the default weight for OPERATOR and STATE_BACKEND be the same
> value?
> >
> > Cheers,
> > Till
> >
> > On Tue, Jan 5, 2021 at 6:39 AM Jingsong Li <jingsongl...@gmail.com>
> wrote:
> >
> > > +1 for allowing streaming operators to use managed memory.
> > >
> > > The memory use of streams requires some hierarchy, and the bottom layer
> > is
> > > undoubtedly the current StateBackend.
> > > Let the stream operators freely use the managed memory, which will make
> > the
> > > memory management model to be unified and give the operator free space.
> > >
> > > Xingtong's proposal looks good to me. +1 to split `DATAPROC` into
> > > `STATE_BACKEND` or `OPERATOR`.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jan 5, 2021 at 12:33 PM Jark Wu <imj...@gmail.com> wrote:
> > >
> > > > +1 to Xingtong's proposal!
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Tue, 5 Jan 2021 at 12:13, Xintong Song <tonysong...@gmail.com>
> > wrote:
> > > >
> > > > > +1 for allowing streaming operators to use managed memory.
> > > > >
> > > > > As for the consumer names, I'm afraid using `DATAPROC` for both
> > > streaming
> > > > > ops and state backends will not work. Currently, RocksDB state
> > backend
> > > > uses
> > > > > a shared piece of memory for all the states within that slot. It's
> > not
> > > > the
> > > > > operator's decision how much memory it uses for the states.
> > > > >
> > > > > I would suggest the following. (IIUC, the same as what Jark
> proposed)
> > > > > * `OPERATOR` for both streaming and bath operators
> > > > > * `STATE_BACKEND` for state backends
> > > > > * `PYTHON` for python processes
> > > > > * `DATAPROC` as a legacy key for state backend or batch operators
> if
> > > > > `STATE_BACKEND` or `OPERATOR` are not specified.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jan 5, 2021 at 11:23 AM Jark Wu <imj...@gmail.com> wrote:
> > > > >
> > > > > > Hi Aljoscha,
> > > > > >
> > > > > > I think we may need to divide `DATAPROC` into `OPERATOR` and
> > > > > > `STATE_BACKEND`, because they have different scope (slot vs.
> > > operator).
> > > > > > But @Xintong Song <tonysong...@gmail.com> may have more insights
> > on
> > > > it.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > >
> > > > > > On Mon, 4 Jan 2021 at 20:44, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > >> I agree, we should allow streaming operators to use managed
> memory
> > > for
> > > > > >> other use cases.
> > > > > >>
> > > > > >> Do you think we need an additional "consumer" setting or that
> they
> > > > would
> > > > > >> just use `DATAPROC` and decide by themselves what to use the
> > memory
> > > > for?
> > > > > >>
> > > > > >> Best,
> > > > > >> Aljoscha
> > > > > >>
> > > > > >> On 2020/12/22 17:14, Jark Wu wrote:
> > > > > >> >Hi all,
> > > > > >> >
> > > > > >> >I found that currently the managed memory can only be used in 3
> > > > > workloads
> > > > > >> >[1]:
> > > > > >> >- state backends for streaming jobs
> > > > > >> >- sorting, hash tables for batch jobs
> > > > > >> >- python UDFs
> > > > > >> >
> > > > > >> >And the configuration option
> > > > > >> `taskmanager.memory.managed.consumer-weights`
> > > > > >> >only allows values: PYTHON and DATAPROC (state in streaming or
> > > > > algorithms
> > > > > >> >in batch).
> > > > > >> >I'm confused why it doesn't allow streaming operators to use
> > > managed
> > > > > >> memory
> > > > > >> >for purposes other than state backends.
> > > > > >> >
> > > > > >> >The background is that we are planning to use some batch
> > algorithms
> > > > > >> >(sorting & bytes hash table) to improve the performance of
> > > streaming
> > > > > SQL
> > > > > >> >operators, especially for the mini-batch operators.
> > > > > >> >Currently, the mini-batch operators are buffering input records
> > and
> > > > > >> >accumulators in heap (i.e. Java HashMap) which is not efficient
> > and
> > > > > there
> > > > > >> >are potential risks of full GC and OOM.
> > > > > >> >With the managed memory, we can fully use the memory to buffer
> > more
> > > > > data
> > > > > >> >without worrying about OOM and improve the performance a lot.
> > > > > >> >
> > > > > >> >What do you think about allowing streaming operators to use
> > managed
> > > > > >> memory
> > > > > >> >and exposing it in configuration.
> > > > > >> >
> > > > > >> >Best,
> > > > > >> >Jark
> > > > > >> >
> > > > > >> >[1]:
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>

Reply via email to