Re: [DISCUSS] FLIP-227: Support overdraft buffer

Anton Kalashnikov Tue, 03 May 2022 08:15:07 -0700

Hi,


>> Do you mean to ignore it while processing records, but keep using
`maxBuffersPerChannel` when calculating the availability of the output?


Yes, it is correct.

>> Would it be a big issue if we changed it to check if at least"overdraft number of buffers are available", where "overdraft number" isconfigurable, instead of the currently hardcoded value of "1"?

These things resolve the different problems (at least as I see that).The current hardcoded "1" says that we switch "availability" to"unavailability" when one more buffer is left(actually a little lessthan one buffer since we write the last piece of data to this lastbuffer). The overdraft feature doesn't change this logic we still wantto switch to "unavailability" in such a way but if we are already in"unavailability" and we want more buffers then we can take "overdraftnumber" more. So we can not avoid this hardcoded "1" since we need tounderstand when we should switch to "unavailability"



-- About "reserve" vs "overdraft"

As Fanrui mentioned above, perhaps, the best overdraft size will beequal to parallelism. Also, the user can set any value he wants. So evenif parallelism is small(~5) but the user's flatmap produces a lot ofdata, the user can set 10 or even more. Which almost double the maxbuffers and it will be impossible to reserve. At least we need to figureout how to protect from such cases (the limit for an overdraft?). Soactually it looks even more difficult than increasing the maximum buffers.

I want to emphasize that overdraft buffers are soft configuration whichmeans it takes as many buffers as the global buffers pool hasavailable(maybe zero) but less than this configured value. It is alsoimportant to notice that perhaps, not many subtasks in TaskManager willbe using this feature so we don't actually need a lot of availablebuffers for every subtask(Here, I mean that if we have only onewindow/flatmap operator and many other operators, then one TaskManagerwill have many ordinary subtasks which don't actually need overdraft andseveral subtasks that needs this feature). But in case of reservation,we will reserve some buffers for all operators even if they don't reallyneed it.



-- Legacy source problem

If we still want to change max buffers then it is problem forLegacySources(since every subtask of source will always use theseoverdraft). But right now, I think that we can force to set 0 overdraftbuffers for legacy subtasks in configuration during execution(if it isnot too late for changing configuration in this place).



03.05.2022 14:11, rui fan пишет:

Hi

Thanks for Martijn Visser and Piotrek's feedback.  I agree with
ignoring the legacy source, it will affect our design. User should
use the new Source Api as much as possible.

Hi Piotrek, we may still need to discuss whether the
overdraft/reserve/spare should use extra buffers or buffers
in (exclusive + floating buffers)? They have some differences.

If it uses extra buffers:
1.The LocalBufferPool will be available when (usedBuffers + 1
  <= currentPoolSize) and all subpartitions don't reach the
maxBuffersPerChannel.

If it uses the buffers in (exclusive + floating buffers):
1. The LocalBufferPool will be available when (usedBuffers +
overdraftBuffers <= currentPoolSize) and all subpartitions
don't reach the maxBuffersPerChannel.
2. For low parallelism jobs, if overdraftBuffers is large(>8), the
usedBuffers will be small. That is the LocalBufferPool will be
easily unavailable. For throughput, if users turn up the
overdraft buffers, they need to turn up exclusive or floating
buffers. It also affects the InputChannel, and it's is unfriendly
to users.

So I prefer the overdraft to use extra buffers.


BTW, for watermark, the number of buffers it needs is
numberOfSubpartitions. So if overdraftBuffers=numberOfSubpartitions,
the watermark won't block in requestMemory. But it has
2 problems:
1. It needs more overdraft buffers. If the overdraft uses
(exclusive + floating buffers),  there will be fewer buffers
available. Throughput may be affected.
2. The numberOfSubpartitions is different for each Task.
So if users want to cover watermark using this feature,
they don't know how to set the overdraftBuffers more r
easonably. And if the parallelism is changed, users still
need to change overdraftBuffers. It is unfriendly to users.

So I propose we support overdraftBuffers=-1, It means
we will automatically set overdraftBuffers=numberOfSubpartitions
in the Constructor of LocalBufferPool.

Please correct me if I'm wrong.

Thanks
fanrui

On Tue, May 3, 2022 at 4:54 PM Piotr Nowojski <[email protected]> wrote:

Hi fanrui,

Do you mean don't add the extra buffers? We just use (exclusive buffers *
parallelism + floating buffers)? The LocalBufferPool will be available

when

(usedBuffers+overdraftBuffers <=

exclusiveBuffers*parallelism+floatingBuffers)

and all subpartitions don't reach the maxBuffersPerChannel, right?

I'm not sure. Definitely we would need to adjust the minimum number of the
required buffers, just as we did when we were implementing the non blocking
outputs and adding availability logic to LocalBufferPool. Back then we
added "+ 1" to the minimum number of buffers. Currently this logic is
located NettyShuffleUtils#getMinMaxNetworkBuffersPerResultPartition:

int min = isSortShuffle ? sortShuffleMinBuffers : numSubpartitions + 1;

For performance reasons, we always require at least one buffer per
sub-partition. Otherwise performance falls drastically. Now if we require 5
overdraft buffers for output to be available, we need to have them on top
of those "one buffer per sub-partition". So the logic should be changed to:

int min = isSortShuffle ? sortShuffleMinBuffers : numSubpartitions +

numOverdraftBuffers;

Regarding increasing the number of max buffers I'm not sure. As long as
"overdraft << max number of buffers", because all buffers on the outputs
are shared across all sub-partitions. If we have 5 overdraft buffers, and
parallelism of 100, it doesn't matter in the grand scheme of things if we
make the output available if at least one single buffer is available or at
least 5 buffers are available out of ~200 (100 * 2 + 8). So effects of
increasing the overdraft from 1 to for example 5 should be negligible. For
small parallelism, like 5, increasing overdraft from 1 to 5 still increases
the overdraft by only about 25%. So maybe we can keep the max as it is?

If so, maybe we should change the name from "overdraft" to "buffer reserve"
or "spare buffers"? And document it as "number of buffers kept in reserve
in case of flatMap/firing timers/huge records"?

What do you think Fenrui, Anton?

Re LegacySources. I agree we can kind of ignore them in the new features,
as long as we don't brake the existing deployments too much.

Best,
Piotrek

wt., 3 maj 2022 o 09:20 Martijn Visser <[email protected]> napisał(a):

Hi everyone,

Just wanted to chip in on the discussion of legacy sources: IMHO, we

should

not focus too much on improving/adding capabilities for legacy sources.

We

want to persuade and push users to use the new Source API. Yes, this

means

that there's work required by the end users to port any custom source to
the new interface. The benefits of the new Source API should outweigh

this.

Anything that we build to support multiple interfaces means adding more
complexity and more possibilities for bugs. Let's try to make our lives a
little bit easier.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser


On Tue, 3 May 2022 at 07:50, rui fan <[email protected]> wrote:

Hi Piotrek

Do you mean to ignore it while processing records, but keep using
`maxBuffersPerChannel` when calculating the availability of the

output?

I think yes, and please Anton Kalashnikov to help double check.

+1 for just having this as a separate configuration. Is it a big

problem

that legacy sources would be ignoring it? Note that we already have
effectively hardcoded a single overdraft buffer.
`LocalBufferPool#checkAvailability` checks if there is a single

buffer

available and this works the same for all tasks (including legacy

source

tasks). Would it be a big issue if we changed it to check if at least
"overdraft number of buffers are available", where "overdraft number"

is

configurable, instead of the currently hardcoded value of "1"?

Do you mean don't add the extra buffers? We just use (exclusive

buffers *

parallelism + floating buffers)? The LocalBufferPool will be available

when

(usedBuffers+overdraftBuffers <=
exclusiveBuffers*parallelism+floatingBuffers)
and all subpartitions don't reach the maxBuffersPerChannel, right?

If yes, I think it can solve the problem of legacy source. There may be
some impact. If overdraftBuffers is large and only one buffer is used

to

process a single record, exclusive buffers*parallelism + floating

buffers

cannot be used. It may only be possible to use (exclusive buffers *
parallelism
+ floating buffers - overdraft buffers + 1). For throughput, if turn up

the

overdraft buffers, the flink user needs to turn up exclusive or

floating

buffers. And it also affects the InputChannel.

If not, I don't think it can solve the problem of legacy source. The

legacy

source don't check isAvailable, If there are the extra buffers, legacy
source
will use them up until block in requestMemory.


Thanks
fanrui

On Tue, May 3, 2022 at 3:39 AM Piotr Nowojski <[email protected]>
wrote:

Hi,

+1 for the general proposal from my side. It would be a nice

workaround

flatMaps, WindowOperators and large records issues with unaligned
checkpoints.

The first task is about ignoring max buffers per channel. This

means

if

we request a memory segment from LocalBufferPool and the
maxBuffersPerChannel is reached for this channel, we just ignore

that

and continue to allocate buffer while LocalBufferPool has it(it is
actually not a overdraft).

Do you mean to ignore it while processing records, but keep using
`maxBuffersPerChannel` when calculating the availability of the

output?

The second task is about the real overdraft. I am pretty convinced

now

that we, unfortunately, need configuration for limitation of

overdraft

number(because it is not ok if one subtask allocates all buffers of

one

TaskManager considering that several different jobs can be

submitted

on

this TaskManager). So idea is to have
maxOverdraftBuffersPerPartition(technically to say per

LocalBufferPool).

In this case, when a limit of buffers in LocalBufferPool is

reached,

LocalBufferPool can request additionally from NetworkBufferPool up

to

maxOverdraftBuffersPerPartition buffers.

+1 for just having this as a separate configuration. Is it a big

problem

that legacy sources would be ignoring it? Note that we already have
effectively hardcoded a single overdraft buffer.
`LocalBufferPool#checkAvailability` checks if there is a single

buffer

available and this works the same for all tasks (including legacy

source

tasks). Would it be a big issue if we changed it to check if at least
"overdraft number of buffers are available", where "overdraft number"

is

configurable, instead of the currently hardcoded value of "1"?

Best,
Piotrek

pt., 29 kwi 2022 o 17:04 rui fan <[email protected]> napisał(a):

Let me add some information about the LegacySource.

If we want to disable the overdraft buffer for LegacySource.
Could we add the enableOverdraft in LocalBufferPool?
The default value is false. If the getAvailableFuture is called,
change enableOverdraft=true. It indicates whether there are
checks isAvailable elsewhere.

I don't think it is elegant, but it's safe. Please correct me if

I'm

wrong.

Thanks
fanrui

On Fri, Apr 29, 2022 at 10:23 PM rui fan <[email protected]>

wrote:

Hi,

Thanks for your quick response.

For question 1/2/3, we think they are clear. We just need to

discuss

the

default value in PR.

For the legacy source, you are right. It's difficult for general
implementation.
Currently, we implement ensureRecordWriterIsAvailable() in
SourceFunction.SourceContext. And call it in our common

LegacySource,

e.g: FlinkKafkaConsumer. Over 90% of our Flink jobs consume

kafka,

so

fixing FlinkKafkaConsumer solved most of our problems.

Core code:
```
public void ensureRecordWriterIsAvailable() {
      if (recordWriter == null
           ||

!configuration.getBoolean(ExecutionCheckpointingOptions.ENABLE_UNALIGNED,

false)
           || recordWriter.isAvailable()) {
           return;
      }

      CompletableFuture<?> resumeFuture =

recordWriter.getAvailableFuture();

      try {
           resumeFuture.get();
      } catch (Throwable ignored) {
      }
}
```

LegacySource calls sourceContext.ensureRecordWriterIsAvailable()
before synchronized (checkpointLock) and collects records.
Please let me know if there is a better solution.

Thanks
fanrui

On Fri, Apr 29, 2022 at 9:45 PM Anton Kalashnikov <

[email protected]>

wrote:

Hi.

-- 1. Do you mean split this into two JIRAs or two PRs or two

commits

in a

     PR?

Perhaps, the separated ticket will be better since this task has

fewer

questions but we should find a solution for LegacySource first.

--  2. For the first task, if the flink user disables the

Unaligned

     Checkpoint, do we ignore max buffers per channel? Because

the

overdraft
     isn't useful for the Aligned Checkpoint, it still needs to

wait

for

     downstream Task to consume.

I think that the logic should be the same for AC and UC. As I

understand,

the overdraft maybe is not really helpful for AC but it doesn't

make

it

worse as well.

   3. For the second task
--      - The default value of maxOverdraftBuffersPerPartition

may

also

need
        to be discussed.

I think it should be a pretty small value or even 0 since it

kind

of

optimization and user should understand what they do(especially

if

we

implement the first task).

--      - If the user disables the Unaligned Checkpoint, can we

set

the

        maxOverdraftBuffersPerPartition=0? Because the overdraft

isn't

useful for
        the Aligned Checkpoint.

The same answer that above, if the overdraft doesn't make

degradation

for

the Aligned Checkpoint I don't think that we should make

difference

between

AC and UC.

     4. For the legacy source
--      - If enabling the Unaligned Checkpoint, it uses up to
        maxOverdraftBuffersPerPartition buffers.
        - If disabling the UC, it doesn't use the overdraft

buffer.

        - Do you think it's ok?

Ideally, I don't want to use overdraft for LegacySource at all

since

it

can lead to undesirable results especially if the limit is high.

At

least,

as I understand, it will always work in overdraft mode and it

will

borrow

maxOverdraftBuffersPerPartition buffers from the global pool

which

can

lead

to degradation of other subtasks on the same TaskManager.

--      - Actually, we added the checkAvailable logic for

LegacySource

in

our
        internal version. It works well.

I don't really understand how it is possible for general case

considering

that each user has their own implementation of

LegacySourceOperator

--   5. For the benchmark, do you have any suggestions? I

submitted

the

PR

     [1].

I haven't looked at it yet, but I'll try to do it soon.


29.04.2022 14:14, rui fan пишет:

Hi,

Thanks for your feedback. I have a servel of questions.

     1. Do you mean split this into two JIRAs or two PRs or two

commits

in a

     PR?
     2. For the first task, if the flink user disables the

Unaligned

     Checkpoint, do we ignore max buffers per channel? Because

the

overdraft

     isn't useful for the Aligned Checkpoint, it still needs to

wait

for

     downstream Task to consume.
     3. For the second task
        - The default value of maxOverdraftBuffersPerPartition

may

also

need

        to be discussed.
        - If the user disables the Unaligned Checkpoint, can we

set

the

        maxOverdraftBuffersPerPartition=0? Because the

overdraft

isn't

useful for

        the Aligned Checkpoint.
     4. For the legacy source
        - If enabling the Unaligned Checkpoint, it uses up to
        maxOverdraftBuffersPerPartition buffers.
        - If disabling the UC, it doesn't use the overdraft

buffer.

        - Do you think it's ok?
        - Actually, we added the checkAvailable logic for

LegacySource

in our

        internal version. It works well.
     5. For the benchmark, do you have any suggestions? I

submitted

the

PR

     [1].

[1] https://github.com/apache/flink-benchmarks/pull/54

Thanks
fanrui

On Fri, Apr 29, 2022 at 7:41 PM Anton Kalashnikov <

[email protected]

wrote:

Hi,

We discuss about it a little with Dawid Wysakowicz. Here is

some

conclusion:

First of all, let's split this into two tasks.

The first task is about ignoring max buffers per channel.

This

means

if

we request a memory segment from LocalBufferPool and the
maxBuffersPerChannel is reached for this channel, we just

ignore

that

and continue to allocate buffer while LocalBufferPool has

it(it

is

actually not a overdraft).

The second task is about the real overdraft. I am pretty

convinced

now

that we, unfortunately, need configuration for limitation of

overdraft

number(because it is not ok if one subtask allocates all

buffers

of

one

TaskManager considering that several different jobs can be

submitted

on

this TaskManager). So idea is to have
maxOverdraftBuffersPerPartition(technically to say per

LocalBufferPool).

In this case, when a limit of buffers in LocalBufferPool is

reached,

LocalBufferPool can request additionally from

NetworkBufferPool

up

to

maxOverdraftBuffersPerPartition buffers.


But it is still not clear how to handle LegacySource since it

actually

works as unlimited flatmap and it will always work in

overdraft

mode

which is not a target. So we still need to think about that.


    29.04.2022 11:11, rui fan пишет:

Hi Anton Kalashnikov,

I think you agree with we should limit the maximum number of

overdraft

segments that each LocalBufferPool can apply for, right?

I prefer to hard code the maxOverdraftBuffers due to don't

add

the

new

configuration. And I hope to hear more from the community.

Best wishes
fanrui

On Thu, Apr 28, 2022 at 12:39 PM rui fan <

[email protected]>

wrote:

Hi Anton Kalashnikov,

Thanks for your very clear reply, I think you are totally

right.

The 'maxBuffersNumber - buffersInUseNumber' can be used as

the

overdraft buffer, it won't need the new buffer

configuration.Flink

users

can turn up the maxBuffersNumber to control the overdraft

buffer

size.

Also, I‘d like to add some information. For safety, we

should

limit

the

maximum number of overdraft segments that each

LocalBufferPool

can apply for.

Why do we limit it?
Some operators don't check the `recordWriter.isAvailable`

during

processing records, such as LegacySource. I have mentioned

it

in

FLINK-26759 [1]. I'm not sure if there are other cases.

If don't add the limitation, the LegacySource will use up

all

remaining

memory in the NetworkBufferPool when the backpressure is

severe.

How to limit it?
I prefer to hard code the

`maxOverdraftBuffers=numberOfSubpartitions`

in the constructor of LocalBufferPool. The

maxOverdraftBuffers

is

just

for safety, and it should be enough for most flink jobs. Or

we

can

set

`maxOverdraftBuffers=Math.max(numberOfSubpartitions, 10)`

to

handle

some jobs of low parallelism.

Also if user don't enable the Unaligned Checkpoint, we can

set

maxOverdraftBuffers=0 in the constructor of

LocalBufferPool.

Because

the overdraft isn't useful for the Aligned Checkpoint.

Please correct me if I'm wrong. Thanks a lot.

[1] https://issues.apache.org/jira/browse/FLINK-26759

Best wishes
fanrui

On Thu, Apr 28, 2022 at 12:29 AM Anton Kalashnikov <

[email protected]>

wrote:

Hi fanrui,

Thanks for creating the FLIP.

In general, I think the overdraft is good idea and it

should

help

in

described above cases. Here are my thoughts about

configuration:

Please, correct me if I am wrong but as I understand right

now

we

have

following calculation.

maxBuffersNumber(per TaskManager) = Network

memory(calculated

via

taskmanager.memory.network.fraction,

taskmanager.memory.network.min,

taskmanager.memory.network.max and total memory size) /
taskmanager.memory.segment-size.

requiredBuffersNumber(per TaskManager) = (exclusive

buffers

parallelism + floating buffers) * subtasks number in

TaskManager

buffersInUseNumber = real number of buffers which used at

current

moment(always <= requiredBuffersNumber)

Ideally requiredBuffersNumber should be equal to

maxBuffersNumber

which

allows Flink work predictibly. But if

requiredBuffersNumber

maxBuffersNumber sometimes it is also fine(but not good)

since

not

all

required buffers really mandatory(e.g. it is ok if Flink

can

not

allocate floating buffers)

But if maxBuffersNumber > requiredBuffersNumber, as I

understand

Flink

just never use these leftovers buffers(maxBuffersNumber -
requiredBuffersNumber). Which I propose to use. ( we can

actualy

use

even difference 'requiredBuffersNumber -

buffersInUseNumber'

since

if

one TaskManager contains several operators including

'window'

which

can

temporally borrow buffers from the global pool).

My proposal, more specificaly(it relates only to

requesting

buffers

during processing single record while switching to

unavalability

between

records should be the same as we have it now):

* If one more buffer requested but maxBuffersPerChannel

reached,

then

just ignore this limitation and allocate this buffers from

any

place(from LocalBufferPool if it has something yet

otherwise

from

NetworkBufferPool)

* If LocalBufferPool exceeds limit, then temporally

allocate

it

from

NetworkBufferPool while it has something to allocate


Maybe I missed something and this solution won't work,

but I

like

it

since on the one hand, it work from the scratch without

any

configuration, on the other hand, it can be configuration

by

changing

proportion of maxBuffersNumber and requiredBuffersNumber.

The last thing that I want to say, I don't really want to

implement

new

configuration since even now it is not clear how to

correctly

configure

network buffers with existing configuration and I don't

want

to

complicate it, especially if it will be possible to

resolve

the

problem

automatically(as described above).


So is my understanding about network memory/buffers

correct?

--

Best regards,
Anton Kalashnikov

27.04.2022 07:46, rui fan пишет:

Hi everyone,

Unaligned Checkpoint (FLIP-76 [1]) is a major feature of

Flink.

It

effectively solves the problem of checkpoint timeout or

slow

checkpoint when backpressure is severe.

We found that UC(Unaligned Checkpoint) does not work well

when

the

back pressure is severe and multiple output buffers are

required

to

process a single record. FLINK-14396 [2] also mentioned

this

issue

before. So we propose the overdraft buffer to solve it.

I created FLINK-26762[3] and FLIP-227[4] to detail the

overdraft

buffer mechanism. After discussing with Anton

Kalashnikov,

there

are

still some points to discuss:

     * There are already a lot of buffer-related

configurations.

Do

we

       need to add a new configuration for the overdraft

buffer?

     * Where should the overdraft buffer use memory?
     * If the overdraft-buffer uses the memory remaining

in

the

       NetworkBufferPool, no new configuration needs to be

added.

     * If adding a new configuration:
         o Should we set the overdraft-memory-size at the

TM

level

or

the

           Task level?
         o Or set overdraft-buffers to indicate the number

of

           memory-segments that can be overdrawn.
         o What is the default value? How to set sensible

defaults?

Currently, I implemented a POC [5] and verified it using
flink-benchmarks [6]. The POC sets overdraft-buffers at

Task

level,

and default value is 10. That is: each LocalBufferPool

can

overdraw up

to 10 memory-segments.

Looking forward to your feedback!

Thanks,
fanrui

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints

[2] https://issues.apache.org/jira/browse/FLINK-14396
[3] https://issues.apache.org/jira/browse/FLINK-26762
[4]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer

[5]

https://github.com/1996fanrui/flink/commit/c7559d94767de97c24ea8c540878832138c8e8fe

[6] https://github.com/apache/flink-benchmarks/pull/54

--

Best regards,
Anton Kalashnikov

--

Best regards,
Anton Kalashnikov

--

Best regards,
Anton Kalashnikov

Re: [DISCUSS] FLIP-227: Support overdraft buffer

Reply via email to