Re: [DISCUSS] FLIP-227: Support overdraft buffer

Piotr Nowojski Fri, 06 May 2022 01:27:20 -0700

Hi,

I'm not sure. Maybe 5 will be fine? Anton, Dawid, what do you think?


Can you create a parent ticket for the whole FLIP to group all of the
issues together?

Also FLIP should be officially voted first.

Best,
Piotrek

pt., 6 maj 2022 o 09:08 rui fan <1996fan...@gmail.com> napisał(a):

> Hi Anton, Piotrek and Dawid,
>
> Thanks for your help.
>
> I created FLINK-27522[1] as the first task. And I will finish it asap.
>
> @Piotrek, for the default value, do you think it should be less
> than 5? What do you think about 3? Actually, I think 5 isn't big.
> It's 1 or 3 or 5 that doesn't matter much, the focus is on
> reasonably resolving deadlock problems. Or I push the second
> task to move forward first and we discuss the default value in PR.
>
> For the legacySource, I got your idea. And I propose we create
> the third task to handle it. Because it is independent and for
> compatibility with the old API. What do you think? I updated
> the third task on FLIP-227[2].
>
> If all is ok, I will create a JIRA for the third Task and add it to
> FLIP-227. And I will develop them from the first task to the
> third task.
>
> Thanks again for your help.
>
> [1] https://issues.apache.org/jira/browse/FLINK-27522
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
>
> Thanks
> fanrui
>
> On Fri, May 6, 2022 at 3:50 AM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
> > Hi fanrui,
> >
> > > How to identify legacySource?
> >
> > legacy sources are always using the SourceStreamTask class and
> > SourceStreamTask is used only for legacy sources. But I'm not sure how to
> > enable/disable that. Adding `disableOverdraft()` call in SourceStreamTask
> > would be better compared to relying on the `getAvailableFuture()` call
> > (isn't it used for back pressure metric anyway?). Ideally we should
> > enable/disable it in the constructors, but that might be tricky.
> >
> > > I prefer it to be between 5 and 10
> >
> > I would vote for a smaller value because of FLINK-13203
> >
> > Piotrek
> >
> >
> >
> > czw., 5 maj 2022 o 11:49 rui fan <1996fan...@gmail.com> napisał(a):
> >
> >> Hi,
> >>
> >> Thanks a lot for your discussion.
> >>
> >> After several discussions, I think it's clear now. I updated the
> >> "Proposed Changes" of FLIP-227[1]. If I have something
> >> missing, please help to add it to FLIP, or add it in the mail
> >> and I can add it to FLIP. If everything is OK, I will create a
> >> new JIRA for the first task, and use FLINK-26762[2] as the
> >> second task.
> >>
> >> About the legacy source, do we set maxOverdraftBuffersPerGate=0
> >> directly? How to identify legacySource? Or could we add
> >> the overdraftEnabled in LocalBufferPool? The default value
> >> is false. If the getAvailableFuture is called, change
> >> overdraftEnabled=true.
> >> It indicates whether there are checks isAvailable elsewhere.
> >> It might be more general, it can cover more cases.
> >>
> >> Also, I think the default value of 'max-overdraft-buffers-per-gate'
> >> needs to be confirmed. I prefer it to be between 5 and 10. How
> >> do you think?
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> >> [2] https://issues.apache.org/jira/browse/FLINK-26762
> >>
> >> Thanks
> >> fanrui
> >>
> >> On Thu, May 5, 2022 at 4:41 PM Piotr Nowojski <pnowoj...@apache.org>
> >> wrote:
> >>
> >>> Hi again,
> >>>
> >>> After sleeping over this, if both versions (reserve and overdraft) have
> >>> the same complexity, I would also prefer the overdraft.
> >>>
> >>> > `Integer.MAX_VALUE` as default value was my idea as well but now, as
> >>> > Dawid mentioned, I think it is dangerous since it is too implicit for
> >>> > the user and if the user submits one more job for the same TaskManger
> >>>
> >>> As I mentioned, it's not only an issue with multiple jobs. The same
> >>> problem can happen with different subtasks from the same job,
> potentially
> >>> leading to the FLINK-13203 deadlock [1]. With FLINK-13203 fixed, I
> would be
> >>> in favour of Integer.MAX_VALUE to be the default value, but as it is, I
> >>> think we should indeed play on the safe side and limit it.
> >>>
> >>> > I still don't understand how should be limited "reserve"
> >>> implementation.
> >>> > I mean if we have X buffers in total and the user sets overdraft
> equal
> >>> > to X we obviously can not reserve all buffers, but how many we are
> >>> > allowed to reserve? Should it be a different configuration like
> >>> > percentegeForReservedBuffers?
> >>>
> >>> The reserve could be defined as percentage, or as a fixed number of
> >>> buffers. But yes. In normal operation subtask would not use the
> reserve, as
> >>> if numberOfAvailableBuffers < reserve, the output would be not
> available.
> >>> Only in the flatMap/timers/huge records case the reserve could be used.
> >>>
> >>> > 1. If the total buffers of LocalBufferPool <= the reserve buffers,
> >>> will LocalBufferPool never be available? Can't process data?
> >>>
> >>> Of course we would need to make sure that never happens. So the reserve
> >>> should be < total buffer size.
> >>>
> >>> > 2. If the overdraft buffer use the extra buffers, when the downstream
> >>> > task inputBuffer is insufficient, it should fail to start the job,
> and
> >>> then
> >>> > restart? When the InputBuffer is initialized, it will apply for
> enough
> >>> > buffers, right?
> >>>
> >>> The failover if downstream can not allocate buffers is already
> >>> implemented FLINK-14872 [2]. There is a timeout for how long the task
> is
> >>> waiting for buffer allocation. However this doesn't prevent many
> >>> (potentially infinitely many) deadlock/restarts cycles. IMO the propper
> >>> solution for [1] would be 2b described in the ticket:
> >>>
> >>> > 2b. Assign extra buffers only once all of the tasks are RUNNING. This
> >>> is a simplified version of 2a, without tracking the tasks
> sink-to-source.
> >>>
> >>> But that's a pre-existing problem and I don't think we have to solve it
> >>> before implementing overdraft. I think we would need to solve it only
> >>> before setting Integer.MAX_VALUE as the default for the overdraft.
> Maybe I
> >>> would hesitate setting the overdraft to anything more then a couple of
> >>> buffers by default for the same reason.
> >>>
> >>> > Actually, I totally agree that we don't need a lot of buffers for
> >>> overdraft
> >>>
> >>> and
> >>>
> >>> > Also I agree we ignore the overdraftBuffers=numberOfSubpartitions.
> >>> > When we finish this feature and after users use it, if users feedback
> >>> > this issue we can discuss again.
> >>>
> >>> +1
> >>>
> >>> Piotrek
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-13203
> >>> [2] https://issues.apache.org/jira/browse/FLINK-14872
> >>>
> >>> czw., 5 maj 2022 o 05:52 rui fan <1996fan...@gmail.com> napisał(a):
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> I still have some questions.
> >>>>
> >>>> 1. If the total buffers of LocalBufferPool <= the reserve buffers,
> will
> >>>> LocalBufferPool never be available? Can't process data?
> >>>> 2. If the overdraft buffer use the extra buffers, when the downstream
> >>>> task inputBuffer is insufficient, it should fail to start the job, and
> >>>> then
> >>>> restart? When the InputBuffer is initialized, it will apply for enough
> >>>> buffers, right?
> >>>>
> >>>> Also I agree we ignore the overdraftBuffers=numberOfSubpartitions.
> >>>> When we finish this feature and after users use it, if users feedback
> >>>> this issue we can discuss again.
> >>>>
> >>>> Thanks
> >>>> fanrui
> >>>>
> >>>> On Wed, May 4, 2022 at 5:29 PM Dawid Wysakowicz <
> dwysakow...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> Hey all,
> >>>>>
> >>>>> I have not replied in the thread yet, but I was following the
> >>>>> discussion.
> >>>>>
> >>>>> Personally, I like Fanrui's and Anton's idea. As far as I understand
> >>>>> it
> >>>>> the idea to distinguish between inside flatMap & outside would be
> >>>>> fairly
> >>>>> simple, but maybe slightly indirect. The checkAvailability would
> >>>>> remain
> >>>>> unchanged and it is checked always between separate invocations of
> the
> >>>>> UDF. Therefore the overdraft buffers would not apply there. However
> >>>>> once
> >>>>> the pool says it is available, it means it has at least an initial
> >>>>> buffer. So any additional request without checking for availability
> >>>>> can
> >>>>> be considered to be inside of processing a single record. This does
> >>>>> not
> >>>>> hold just for the LegacySource as I don't think it actually checks
> for
> >>>>> the availability of buffers in the LocalBufferPool.
> >>>>>
> >>>>> In the offline chat with Anton, we also discussed if we need a limit
> >>>>> of
> >>>>> the number of buffers we could overdraft (or in other words if the
> >>>>> limit
> >>>>> should be equal to Integer.MAX_VALUE), but personally I'd prefer to
> >>>>> stay
> >>>>> on the safe side and have it limited. The pool of network buffers is
> >>>>> shared for the entire TaskManager, so it means it can be shared even
> >>>>> across tasks of separate jobs. However, I might be just unnecessarily
> >>>>> cautious here.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Dawid
> >>>>>
> >>>>> On 04/05/2022 10:54, Piotr Nowojski wrote:
> >>>>> > Hi,
> >>>>> >
> >>>>> > Thanks for the answers.
> >>>>> >
> >>>>> >> we may still need to discuss whether the
> >>>>> >> overdraft/reserve/spare should use extra buffers or buffers
> >>>>> >> in (exclusive + floating buffers)?
> >>>>> > and
> >>>>> >
> >>>>> >> These things resolve the different problems (at least as I see
> >>>>> that).
> >>>>> >> The current hardcoded "1"  says that we switch "availability" to
> >>>>> >> "unavailability" when one more buffer is left(actually a little
> less
> >>>>> >> than one buffer since we write the last piece of data to this last
> >>>>> >> buffer). The overdraft feature doesn't change this logic we still
> >>>>> want
> >>>>> >> to switch to "unavailability" in such a way but if we are already
> in
> >>>>> >> "unavailability" and we want more buffers then we can take
> >>>>> "overdraft
> >>>>> >> number" more. So we can not avoid this hardcoded "1" since we need
> >>>>> to
> >>>>> >> understand when we should switch to "unavailability"
> >>>>> > Ok, I see. So it seems to me that both of you have in mind to keep
> >>>>> the
> >>>>> > buffer pools as they are right now, but if we are in the middle of
> >>>>> > processing a record, we can request extra overdraft buffers on top
> of
> >>>>> > those? This is another way to implement the overdraft to what I was
> >>>>> > thinking. I was thinking about something like keeping the
> >>>>> "overdraft" or
> >>>>> > more precisely buffer "reserve" in the buffer pool. I think my
> >>>>> version
> >>>>> > would be easier to implement, because it is just fiddling with
> >>>>> min/max
> >>>>> > buffers calculation and slightly modified `checkAvailability()`
> >>>>> logic.
> >>>>> >
> >>>>> > On the other hand  what you have in mind would better utilise the
> >>>>> available
> >>>>> > memory, right? It would require more code changes (how would we
> know
> >>>>> when
> >>>>> > we are allowed to request the overdraft?). However, in this case, I
> >>>>> would
> >>>>> > be tempted to set the number of overdraft buffers by default to
> >>>>> > `Integer.MAX_VALUE`, and let the system request as many buffers as
> >>>>> > necessary. The only downside that I can think of (apart of higher
> >>>>> > complexity) would be higher chance of hitting a known/unsolved
> >>>>> deadlock [1]
> >>>>> > in a scenario:
> >>>>> > - downstream task hasn't yet started
> >>>>> > - upstream task requests overdraft and uses all available memory
> >>>>> segments
> >>>>> > from the global pool
> >>>>> > - upstream task is blocked, because downstream task hasn't started
> >>>>> yet and
> >>>>> > can not consume any data
> >>>>> > - downstream task tries to start, but can not, as there are no
> >>>>> available
> >>>>> > buffers
> >>>>> >
> >>>>> >> BTW, for watermark, the number of buffers it needs is
> >>>>> >> numberOfSubpartitions. So if
> overdraftBuffers=numberOfSubpartitions,
> >>>>> >> the watermark won't block in requestMemory.
> >>>>> > and
> >>>>> >
> >>>>> >> the best overdraft size will be equal to parallelism.
> >>>>> > That's a lot of buffers. I don't think we need that many for
> >>>>> broadcasting
> >>>>> > watermarks. Watermarks are small, and remember that every
> >>>>> subpartition has
> >>>>> > some partially filled/empty WIP buffer, so the vast majority of
> >>>>> > subpartitions will not need to request a new buffer.
> >>>>> >
> >>>>> > Best,
> >>>>> > Piotrek
> >>>>> >
> >>>>> > [1] https://issues.apache.org/jira/browse/FLINK-13203
> >>>>> >
> >>>>> > wt., 3 maj 2022 o 17:15 Anton Kalashnikov <kaa....@yandex.com>
> >>>>> napisał(a):
> >>>>> >
> >>>>> >> Hi,
> >>>>> >>
> >>>>> >>
> >>>>> >>   >> Do you mean to ignore it while processing records, but keep
> >>>>> using
> >>>>> >> `maxBuffersPerChannel` when calculating the availability of the
> >>>>> output?
> >>>>> >>
> >>>>> >>
> >>>>> >> Yes, it is correct.
> >>>>> >>
> >>>>> >>
> >>>>> >>   >> Would it be a big issue if we changed it to check if at least
> >>>>> >> "overdraft number of buffers are available", where "overdraft
> >>>>> number" is
> >>>>> >> configurable, instead of the currently hardcoded value of "1"?
> >>>>> >>
> >>>>> >>
> >>>>> >> These things resolve the different problems (at least as I see
> >>>>> that).
> >>>>> >> The current hardcoded "1"  says that we switch "availability" to
> >>>>> >> "unavailability" when one more buffer is left(actually a little
> less
> >>>>> >> than one buffer since we write the last piece of data to this last
> >>>>> >> buffer). The overdraft feature doesn't change this logic we still
> >>>>> want
> >>>>> >> to switch to "unavailability" in such a way but if we are already
> in
> >>>>> >> "unavailability" and we want more buffers then we can take
> >>>>> "overdraft
> >>>>> >> number" more. So we can not avoid this hardcoded "1" since we need
> >>>>> to
> >>>>> >> understand when we should switch to "unavailability"
> >>>>> >>
> >>>>> >>
> >>>>> >> -- About "reserve" vs "overdraft"
> >>>>> >>
> >>>>> >> As Fanrui mentioned above, perhaps, the best overdraft size will
> be
> >>>>> >> equal to parallelism. Also, the user can set any value he wants.
> So
> >>>>> even
> >>>>> >> if parallelism is small(~5) but the user's flatmap produces a lot
> of
> >>>>> >> data, the user can set 10 or even more. Which almost double the
> max
> >>>>> >> buffers and it will be impossible to reserve. At least we need to
> >>>>> figure
> >>>>> >> out how to protect from such cases (the limit for an overdraft?).
> So
> >>>>> >> actually it looks even more difficult than increasing the maximum
> >>>>> buffers.
> >>>>> >>
> >>>>> >> I want to emphasize that overdraft buffers are soft configuration
> >>>>> which
> >>>>> >> means it takes as many buffers as the global buffers pool has
> >>>>> >> available(maybe zero) but less than this configured value. It is
> >>>>> also
> >>>>> >> important to notice that perhaps, not many subtasks in TaskManager
> >>>>> will
> >>>>> >> be using this feature so we don't actually need a lot of available
> >>>>> >> buffers for every subtask(Here, I mean that if we have only one
> >>>>> >> window/flatmap operator and many other operators, then one
> >>>>> TaskManager
> >>>>> >> will have many ordinary subtasks which don't actually need
> >>>>> overdraft and
> >>>>> >> several subtasks that needs this feature). But in case of
> >>>>> reservation,
> >>>>> >> we will reserve some buffers for all operators even if they don't
> >>>>> really
> >>>>> >> need it.
> >>>>> >>
> >>>>> >>
> >>>>> >> -- Legacy source problem
> >>>>> >>
> >>>>> >> If we still want to change max buffers then it is problem for
> >>>>> >> LegacySources(since every subtask of source will always use these
> >>>>> >> overdraft). But right now, I think that we can force to set 0
> >>>>> overdraft
> >>>>> >> buffers for legacy subtasks in configuration during execution(if
> it
> >>>>> is
> >>>>> >> not too late for changing configuration in this place).
> >>>>> >>
> >>>>> >>
> >>>>> >> 03.05.2022 14:11, rui fan пишет:
> >>>>> >>> Hi
> >>>>> >>>
> >>>>> >>> Thanks for Martijn Visser and Piotrek's feedback.  I agree with
> >>>>> >>> ignoring the legacy source, it will affect our design. User
> should
> >>>>> >>> use the new Source Api as much as possible.
> >>>>> >>>
> >>>>> >>> Hi Piotrek, we may still need to discuss whether the
> >>>>> >>> overdraft/reserve/spare should use extra buffers or buffers
> >>>>> >>> in (exclusive + floating buffers)? They have some differences.
> >>>>> >>>
> >>>>> >>> If it uses extra buffers:
> >>>>> >>> 1.The LocalBufferPool will be available when (usedBuffers + 1
> >>>>> >>>    <= currentPoolSize) and all subpartitions don't reach the
> >>>>> >>> maxBuffersPerChannel.
> >>>>> >>>
> >>>>> >>> If it uses the buffers in (exclusive + floating buffers):
> >>>>> >>> 1. The LocalBufferPool will be available when (usedBuffers +
> >>>>> >>> overdraftBuffers <= currentPoolSize) and all subpartitions
> >>>>> >>> don't reach the maxBuffersPerChannel.
> >>>>> >>> 2. For low parallelism jobs, if overdraftBuffers is large(>8),
> the
> >>>>> >>> usedBuffers will be small. That is the LocalBufferPool will be
> >>>>> >>> easily unavailable. For throughput, if users turn up the
> >>>>> >>> overdraft buffers, they need to turn up exclusive or floating
> >>>>> >>> buffers. It also affects the InputChannel, and it's is unfriendly
> >>>>> >>> to users.
> >>>>> >>>
> >>>>> >>> So I prefer the overdraft to use extra buffers.
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> BTW, for watermark, the number of buffers it needs is
> >>>>> >>> numberOfSubpartitions. So if
> >>>>> overdraftBuffers=numberOfSubpartitions,
> >>>>> >>> the watermark won't block in requestMemory. But it has
> >>>>> >>> 2 problems:
> >>>>> >>> 1. It needs more overdraft buffers. If the overdraft uses
> >>>>> >>> (exclusive + floating buffers),  there will be fewer buffers
> >>>>> >>> available. Throughput may be affected.
> >>>>> >>> 2. The numberOfSubpartitions is different for each Task.
> >>>>> >>> So if users want to cover watermark using this feature,
> >>>>> >>> they don't know how to set the overdraftBuffers more r
> >>>>> >>> easonably. And if the parallelism is changed, users still
> >>>>> >>> need to change overdraftBuffers. It is unfriendly to users.
> >>>>> >>>
> >>>>> >>> So I propose we support overdraftBuffers=-1, It means
> >>>>> >>> we will automatically set overdraftBuffers=numberOfSubpartitions
> >>>>> >>> in the Constructor of LocalBufferPool.
> >>>>> >>>
> >>>>> >>> Please correct me if I'm wrong.
> >>>>> >>>
> >>>>> >>> Thanks
> >>>>> >>> fanrui
> >>>>> >>>
> >>>>> >>> On Tue, May 3, 2022 at 4:54 PM Piotr Nowojski <
> >>>>> pnowoj...@apache.org>
> >>>>> >> wrote:
> >>>>> >>>> Hi fanrui,
> >>>>> >>>>
> >>>>> >>>>> Do you mean don't add the extra buffers? We just use (exclusive
> >>>>> >> buffers *
> >>>>> >>>>> parallelism + floating buffers)? The LocalBufferPool will be
> >>>>> available
> >>>>> >>>> when
> >>>>> >>>>> (usedBuffers+overdraftBuffers <=
> >>>>> >>>> exclusiveBuffers*parallelism+floatingBuffers)
> >>>>> >>>>> and all subpartitions don't reach the maxBuffersPerChannel,
> >>>>> right?
> >>>>> >>>> I'm not sure. Definitely we would need to adjust the minimum
> >>>>> number of
> >>>>> >> the
> >>>>> >>>> required buffers, just as we did when we were implementing the
> non
> >>>>> >> blocking
> >>>>> >>>> outputs and adding availability logic to LocalBufferPool. Back
> >>>>> then we
> >>>>> >>>> added "+ 1" to the minimum number of buffers. Currently this
> >>>>> logic is
> >>>>> >>>> located
> >>>>> NettyShuffleUtils#getMinMaxNetworkBuffersPerResultPartition:
> >>>>> >>>>
> >>>>> >>>>> int min = isSortShuffle ? sortShuffleMinBuffers :
> >>>>> numSubpartitions + 1;
> >>>>> >>>> For performance reasons, we always require at least one buffer
> per
> >>>>> >>>> sub-partition. Otherwise performance falls drastically. Now if
> we
> >>>>> >> require 5
> >>>>> >>>> overdraft buffers for output to be available, we need to have
> >>>>> them on
> >>>>> >> top
> >>>>> >>>> of those "one buffer per sub-partition". So the logic should be
> >>>>> changed
> >>>>> >> to:
> >>>>> >>>>> int min = isSortShuffle ? sortShuffleMinBuffers :
> >>>>> numSubpartitions +
> >>>>> >>>> numOverdraftBuffers;
> >>>>> >>>>
> >>>>> >>>> Regarding increasing the number of max buffers I'm not sure. As
> >>>>> long as
> >>>>> >>>> "overdraft << max number of buffers", because all buffers on the
> >>>>> outputs
> >>>>> >>>> are shared across all sub-partitions. If we have 5 overdraft
> >>>>> buffers,
> >>>>> >> and
> >>>>> >>>> parallelism of 100, it doesn't matter in the grand scheme of
> >>>>> things if
> >>>>> >> we
> >>>>> >>>> make the output available if at least one single buffer is
> >>>>> available or
> >>>>> >> at
> >>>>> >>>> least 5 buffers are available out of ~200 (100 * 2 + 8). So
> >>>>> effects of
> >>>>> >>>> increasing the overdraft from 1 to for example 5 should be
> >>>>> negligible.
> >>>>> >> For
> >>>>> >>>> small parallelism, like 5, increasing overdraft from 1 to 5
> still
> >>>>> >> increases
> >>>>> >>>> the overdraft by only about 25%. So maybe we can keep the max as
> >>>>> it is?
> >>>>> >>>>
> >>>>> >>>> If so, maybe we should change the name from "overdraft" to
> "buffer
> >>>>> >> reserve"
> >>>>> >>>> or "spare buffers"? And document it as "number of buffers kept
> in
> >>>>> >> reserve
> >>>>> >>>> in case of flatMap/firing timers/huge records"?
> >>>>> >>>>
> >>>>> >>>> What do you think Fenrui, Anton?
> >>>>> >>>>
> >>>>> >>>> Re LegacySources. I agree we can kind of ignore them in the new
> >>>>> >> features,
> >>>>> >>>> as long as we don't brake the existing deployments too much.
> >>>>> >>>>
> >>>>> >>>> Best,
> >>>>> >>>> Piotrek
> >>>>> >>>>
> >>>>> >>>> wt., 3 maj 2022 o 09:20 Martijn Visser <mart...@ververica.com>
> >>>>> >> napisał(a):
> >>>>> >>>>> Hi everyone,
> >>>>> >>>>>
> >>>>> >>>>> Just wanted to chip in on the discussion of legacy sources:
> >>>>> IMHO, we
> >>>>> >>>> should
> >>>>> >>>>> not focus too much on improving/adding capabilities for legacy
> >>>>> sources.
> >>>>> >>>> We
> >>>>> >>>>> want to persuade and push users to use the new Source API. Yes,
> >>>>> this
> >>>>> >>>> means
> >>>>> >>>>> that there's work required by the end users to port any custom
> >>>>> source
> >>>>> >> to
> >>>>> >>>>> the new interface. The benefits of the new Source API should
> >>>>> outweigh
> >>>>> >>>> this.
> >>>>> >>>>> Anything that we build to support multiple interfaces means
> >>>>> adding more
> >>>>> >>>>> complexity and more possibilities for bugs. Let's try to make
> our
> >>>>> >> lives a
> >>>>> >>>>> little bit easier.
> >>>>> >>>>>
> >>>>> >>>>> Best regards,
> >>>>> >>>>>
> >>>>> >>>>> Martijn Visser
> >>>>> >>>>> https://twitter.com/MartijnVisser82
> >>>>> >>>>> https://github.com/MartijnVisser
> >>>>> >>>>>
> >>>>> >>>>>
> >>>>> >>>>> On Tue, 3 May 2022 at 07:50, rui fan <1996fan...@gmail.com>
> >>>>> wrote:
> >>>>> >>>>>
> >>>>> >>>>>> Hi Piotrek
> >>>>> >>>>>>
> >>>>> >>>>>>> Do you mean to ignore it while processing records, but keep
> >>>>> using
> >>>>> >>>>>>> `maxBuffersPerChannel` when calculating the availability of
> the
> >>>>> >>>> output?
> >>>>> >>>>>> I think yes, and please Anton Kalashnikov to help double
> check.
> >>>>> >>>>>>
> >>>>> >>>>>>> +1 for just having this as a separate configuration. Is it a
> >>>>> big
> >>>>> >>>>> problem
> >>>>> >>>>>>> that legacy sources would be ignoring it? Note that we
> already
> >>>>> have
> >>>>> >>>>>>> effectively hardcoded a single overdraft buffer.
> >>>>> >>>>>>> `LocalBufferPool#checkAvailability` checks if there is a
> single
> >>>>> >>>> buffer
> >>>>> >>>>>>> available and this works the same for all tasks (including
> >>>>> legacy
> >>>>> >>>>> source
> >>>>> >>>>>>> tasks). Would it be a big issue if we changed it to check if
> >>>>> at least
> >>>>> >>>>>>> "overdraft number of buffers are available", where "overdraft
> >>>>> number"
> >>>>> >>>>> is
> >>>>> >>>>>>> configurable, instead of the currently hardcoded value of
> "1"?
> >>>>> >>>>>> Do you mean don't add the extra buffers? We just use
> (exclusive
> >>>>> >>>> buffers *
> >>>>> >>>>>> parallelism + floating buffers)? The LocalBufferPool will be
> >>>>> available
> >>>>> >>>>> when
> >>>>> >>>>>> (usedBuffers+overdraftBuffers <=
> >>>>> >>>>>> exclusiveBuffers*parallelism+floatingBuffers)
> >>>>> >>>>>> and all subpartitions don't reach the maxBuffersPerChannel,
> >>>>> right?
> >>>>> >>>>>>
> >>>>> >>>>>> If yes, I think it can solve the problem of legacy source.
> >>>>> There may
> >>>>> >> be
> >>>>> >>>>>> some impact. If overdraftBuffers is large and only one buffer
> >>>>> is used
> >>>>> >>>> to
> >>>>> >>>>>> process a single record, exclusive buffers*parallelism +
> >>>>> floating
> >>>>> >>>> buffers
> >>>>> >>>>>> cannot be used. It may only be possible to use (exclusive
> >>>>> buffers *
> >>>>> >>>>>> parallelism
> >>>>> >>>>>> + floating buffers - overdraft buffers + 1). For throughput,
> if
> >>>>> turn
> >>>>> >> up
> >>>>> >>>>> the
> >>>>> >>>>>> overdraft buffers, the flink user needs to turn up exclusive
> or
> >>>>> >>>> floating
> >>>>> >>>>>> buffers. And it also affects the InputChannel.
> >>>>> >>>>>>
> >>>>> >>>>>> If not, I don't think it can solve the problem of legacy
> >>>>> source. The
> >>>>> >>>>> legacy
> >>>>> >>>>>> source don't check isAvailable, If there are the extra
> buffers,
> >>>>> legacy
> >>>>> >>>>>> source
> >>>>> >>>>>> will use them up until block in requestMemory.
> >>>>> >>>>>>
> >>>>> >>>>>>
> >>>>> >>>>>> Thanks
> >>>>> >>>>>> fanrui
> >>>>> >>>>>>
> >>>>> >>>>>> On Tue, May 3, 2022 at 3:39 AM Piotr Nowojski <
> >>>>> pnowoj...@apache.org>
> >>>>> >>>>>> wrote:
> >>>>> >>>>>>
> >>>>> >>>>>>> Hi,
> >>>>> >>>>>>>
> >>>>> >>>>>>> +1 for the general proposal from my side. It would be a nice
> >>>>> >>>> workaround
> >>>>> >>>>>>> flatMaps, WindowOperators and large records issues with
> >>>>> unaligned
> >>>>> >>>>>>> checkpoints.
> >>>>> >>>>>>>
> >>>>> >>>>>>>> The first task is about ignoring max buffers per channel.
> This
> >>>>> >>>> means
> >>>>> >>>>> if
> >>>>> >>>>>>>> we request a memory segment from LocalBufferPool and the
> >>>>> >>>>>>>> maxBuffersPerChannel is reached for this channel, we just
> >>>>> ignore
> >>>>> >>>> that
> >>>>> >>>>>>>> and continue to allocate buffer while LocalBufferPool has
> >>>>> it(it is
> >>>>> >>>>>>>> actually not a overdraft).
> >>>>> >>>>>>> Do you mean to ignore it while processing records, but keep
> >>>>> using
> >>>>> >>>>>>> `maxBuffersPerChannel` when calculating the availability of
> the
> >>>>> >>>> output?
> >>>>> >>>>>>>> The second task is about the real overdraft. I am pretty
> >>>>> convinced
> >>>>> >>>>> now
> >>>>> >>>>>>>> that we, unfortunately, need configuration for limitation of
> >>>>> >>>>> overdraft
> >>>>> >>>>>>>> number(because it is not ok if one subtask allocates all
> >>>>> buffers of
> >>>>> >>>>> one
> >>>>> >>>>>>>> TaskManager considering that several different jobs can be
> >>>>> >>>> submitted
> >>>>> >>>>> on
> >>>>> >>>>>>>> this TaskManager). So idea is to have
> >>>>> >>>>>>>> maxOverdraftBuffersPerPartition(technically to say per
> >>>>> >>>>>> LocalBufferPool).
> >>>>> >>>>>>>> In this case, when a limit of buffers in LocalBufferPool is
> >>>>> >>>> reached,
> >>>>> >>>>>>>> LocalBufferPool can request additionally from
> >>>>> NetworkBufferPool up
> >>>>> >>>> to
> >>>>> >>>>>>>> maxOverdraftBuffersPerPartition buffers.
> >>>>> >>>>>>> +1 for just having this as a separate configuration. Is it a
> >>>>> big
> >>>>> >>>>> problem
> >>>>> >>>>>>> that legacy sources would be ignoring it? Note that we
> already
> >>>>> have
> >>>>> >>>>>>> effectively hardcoded a single overdraft buffer.
> >>>>> >>>>>>> `LocalBufferPool#checkAvailability` checks if there is a
> single
> >>>>> >>>> buffer
> >>>>> >>>>>>> available and this works the same for all tasks (including
> >>>>> legacy
> >>>>> >>>>> source
> >>>>> >>>>>>> tasks). Would it be a big issue if we changed it to check if
> >>>>> at least
> >>>>> >>>>>>> "overdraft number of buffers are available", where "overdraft
> >>>>> number"
> >>>>> >>>>> is
> >>>>> >>>>>>> configurable, instead of the currently hardcoded value of
> "1"?
> >>>>> >>>>>>>
> >>>>> >>>>>>> Best,
> >>>>> >>>>>>> Piotrek
> >>>>> >>>>>>>
> >>>>> >>>>>>> pt., 29 kwi 2022 o 17:04 rui fan <1996fan...@gmail.com>
> >>>>> napisał(a):
> >>>>> >>>>>>>
> >>>>> >>>>>>>> Let me add some information about the LegacySource.
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> If we want to disable the overdraft buffer for LegacySource.
> >>>>> >>>>>>>> Could we add the enableOverdraft in LocalBufferPool?
> >>>>> >>>>>>>> The default value is false. If the getAvailableFuture is
> >>>>> called,
> >>>>> >>>>>>>> change enableOverdraft=true. It indicates whether there are
> >>>>> >>>>>>>> checks isAvailable elsewhere.
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> I don't think it is elegant, but it's safe. Please correct
> me
> >>>>> if
> >>>>> >>>> I'm
> >>>>> >>>>>>> wrong.
> >>>>> >>>>>>>> Thanks
> >>>>> >>>>>>>> fanrui
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> On Fri, Apr 29, 2022 at 10:23 PM rui fan <
> >>>>> 1996fan...@gmail.com>
> >>>>> >>>>> wrote:
> >>>>> >>>>>>>>> Hi,
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Thanks for your quick response.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> For question 1/2/3, we think they are clear. We just need
> to
> >>>>> >>>>> discuss
> >>>>> >>>>>>> the
> >>>>> >>>>>>>>> default value in PR.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> For the legacy source, you are right. It's difficult for
> >>>>> general
> >>>>> >>>>>>>>> implementation.
> >>>>> >>>>>>>>> Currently, we implement ensureRecordWriterIsAvailable() in
> >>>>> >>>>>>>>> SourceFunction.SourceContext. And call it in our common
> >>>>> >>>>> LegacySource,
> >>>>> >>>>>>>>> e.g: FlinkKafkaConsumer. Over 90% of our Flink jobs consume
> >>>>> >>>> kafka,
> >>>>> >>>>> so
> >>>>> >>>>>>>>> fixing FlinkKafkaConsumer solved most of our problems.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Core code:
> >>>>> >>>>>>>>> ```
> >>>>> >>>>>>>>> public void ensureRecordWriterIsAvailable() {
> >>>>> >>>>>>>>>        if (recordWriter == null
> >>>>> >>>>>>>>>             ||
> >>>>> >>>>>>>>>
> >>>>> >>
> >>>>>
> !configuration.getBoolean(ExecutionCheckpointingOptions.ENABLE_UNALIGNED,
> >>>>> >>>>>>>>> false)
> >>>>> >>>>>>>>>             || recordWriter.isAvailable()) {
> >>>>> >>>>>>>>>             return;
> >>>>> >>>>>>>>>        }
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>        CompletableFuture<?> resumeFuture =
> >>>>> >>>>>>>> recordWriter.getAvailableFuture();
> >>>>> >>>>>>>>>        try {
> >>>>> >>>>>>>>>             resumeFuture.get();
> >>>>> >>>>>>>>>        } catch (Throwable ignored) {
> >>>>> >>>>>>>>>        }
> >>>>> >>>>>>>>> }
> >>>>> >>>>>>>>> ```
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> LegacySource calls
> >>>>> sourceContext.ensureRecordWriterIsAvailable()
> >>>>> >>>>>>>>> before synchronized (checkpointLock) and collects records.
> >>>>> >>>>>>>>> Please let me know if there is a better solution.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Thanks
> >>>>> >>>>>>>>> fanrui
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> On Fri, Apr 29, 2022 at 9:45 PM Anton Kalashnikov <
> >>>>> >>>>>> kaa....@yandex.com>
> >>>>> >>>>>>>>> wrote:
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>> Hi.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> -- 1. Do you mean split this into two JIRAs or two PRs or
> >>>>> two
> >>>>> >>>>>> commits
> >>>>> >>>>>>>> in a
> >>>>> >>>>>>>>>>       PR?
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> Perhaps, the separated ticket will be better since this
> >>>>> task has
> >>>>> >>>>>> fewer
> >>>>> >>>>>>>>>> questions but we should find a solution for LegacySource
> >>>>> first.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> --  2. For the first task, if the flink user disables the
> >>>>> >>>>> Unaligned
> >>>>> >>>>>>>>>>       Checkpoint, do we ignore max buffers per channel?
> >>>>> Because
> >>>>> >>>> the
> >>>>> >>>>>>>>>> overdraft
> >>>>> >>>>>>>>>>       isn't useful for the Aligned Checkpoint, it still
> >>>>> needs to
> >>>>> >>>>> wait
> >>>>> >>>>>>> for
> >>>>> >>>>>>>>>>       downstream Task to consume.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> I think that the logic should be the same for AC and UC.
> As
> >>>>> I
> >>>>> >>>>>>>> understand,
> >>>>> >>>>>>>>>> the overdraft maybe is not really helpful for AC but it
> >>>>> doesn't
> >>>>> >>>>> make
> >>>>> >>>>>>> it
> >>>>> >>>>>>>>>> worse as well.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>     3. For the second task
> >>>>> >>>>>>>>>> --      - The default value of
> >>>>> maxOverdraftBuffersPerPartition
> >>>>> >>>> may
> >>>>> >>>>>>> also
> >>>>> >>>>>>>>>> need
> >>>>> >>>>>>>>>>          to be discussed.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> I think it should be a pretty small value or even 0 since
> it
> >>>>> >>>> kind
> >>>>> >>>>> of
> >>>>> >>>>>>>>>> optimization and user should understand what they
> >>>>> do(especially
> >>>>> >>>> if
> >>>>> >>>>>> we
> >>>>> >>>>>>>>>> implement the first task).
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> --      - If the user disables the Unaligned Checkpoint,
> >>>>> can we
> >>>>> >>>>> set
> >>>>> >>>>>>> the
> >>>>> >>>>>>>>>>          maxOverdraftBuffersPerPartition=0? Because the
> >>>>> overdraft
> >>>>> >>>>>> isn't
> >>>>> >>>>>>>>>> useful for
> >>>>> >>>>>>>>>>          the Aligned Checkpoint.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> The same answer that above, if the overdraft doesn't make
> >>>>> >>>>>> degradation
> >>>>> >>>>>>>> for
> >>>>> >>>>>>>>>> the Aligned Checkpoint I don't think that we should make
> >>>>> >>>>> difference
> >>>>> >>>>>>>> between
> >>>>> >>>>>>>>>> AC and UC.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>       4. For the legacy source
> >>>>> >>>>>>>>>> --      - If enabling the Unaligned Checkpoint, it uses up
> >>>>> to
> >>>>> >>>>>>>>>>          maxOverdraftBuffersPerPartition buffers.
> >>>>> >>>>>>>>>>          - If disabling the UC, it doesn't use the
> overdraft
> >>>>> >>>> buffer.
> >>>>> >>>>>>>>>>          - Do you think it's ok?
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> Ideally, I don't want to use overdraft for LegacySource at
> >>>>> all
> >>>>> >>>>> since
> >>>>> >>>>>>> it
> >>>>> >>>>>>>>>> can lead to undesirable results especially if the limit is
> >>>>> high.
> >>>>> >>>>> At
> >>>>> >>>>>>>> least,
> >>>>> >>>>>>>>>> as I understand, it will always work in overdraft mode and
> >>>>> it
> >>>>> >>>> will
> >>>>> >>>>>>>> borrow
> >>>>> >>>>>>>>>> maxOverdraftBuffersPerPartition buffers from the global
> pool
> >>>>> >>>> which
> >>>>> >>>>>> can
> >>>>> >>>>>>>> lead
> >>>>> >>>>>>>>>> to degradation of other subtasks on the same TaskManager.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> --      - Actually, we added the checkAvailable logic for
> >>>>> >>>>>> LegacySource
> >>>>> >>>>>>>> in
> >>>>> >>>>>>>>>> our
> >>>>> >>>>>>>>>>          internal version. It works well.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> I don't really understand how it is possible for general
> >>>>> case
> >>>>> >>>>>>>> considering
> >>>>> >>>>>>>>>> that each user has their own implementation of
> >>>>> >>>>> LegacySourceOperator
> >>>>> >>>>>>>>>> --   5. For the benchmark, do you have any suggestions? I
> >>>>> >>>>> submitted
> >>>>> >>>>>>> the
> >>>>> >>>>>>>> PR
> >>>>> >>>>>>>>>>       [1].
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> I haven't looked at it yet, but I'll try to do it soon.
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> 29.04.2022 14:14, rui fan пишет:
> >>>>> >>>>>>>>>>> Hi,
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> Thanks for your feedback. I have a servel of questions.
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>       1. Do you mean split this into two JIRAs or two PRs
> >>>>> or two
> >>>>> >>>>>>> commits
> >>>>> >>>>>>>>>> in a
> >>>>> >>>>>>>>>>>       PR?
> >>>>> >>>>>>>>>>>       2. For the first task, if the flink user disables
> the
> >>>>> >>>>>> Unaligned
> >>>>> >>>>>>>>>>>       Checkpoint, do we ignore max buffers per channel?
> >>>>> Because
> >>>>> >>>>> the
> >>>>> >>>>>>>>>> overdraft
> >>>>> >>>>>>>>>>>       isn't useful for the Aligned Checkpoint, it still
> >>>>> needs to
> >>>>> >>>>>> wait
> >>>>> >>>>>>>> for
> >>>>> >>>>>>>>>>>       downstream Task to consume.
> >>>>> >>>>>>>>>>>       3. For the second task
> >>>>> >>>>>>>>>>>          - The default value of
> >>>>> maxOverdraftBuffersPerPartition
> >>>>> >>>>> may
> >>>>> >>>>>>> also
> >>>>> >>>>>>>>>> need
> >>>>> >>>>>>>>>>>          to be discussed.
> >>>>> >>>>>>>>>>>          - If the user disables the Unaligned Checkpoint,
> >>>>> can we
> >>>>> >>>>> set
> >>>>> >>>>>>> the
> >>>>> >>>>>>>>>>>          maxOverdraftBuffersPerPartition=0? Because the
> >>>>> >>>> overdraft
> >>>>> >>>>>>> isn't
> >>>>> >>>>>>>>>> useful for
> >>>>> >>>>>>>>>>>          the Aligned Checkpoint.
> >>>>> >>>>>>>>>>>       4. For the legacy source
> >>>>> >>>>>>>>>>>          - If enabling the Unaligned Checkpoint, it uses
> >>>>> up to
> >>>>> >>>>>>>>>>>          maxOverdraftBuffersPerPartition buffers.
> >>>>> >>>>>>>>>>>          - If disabling the UC, it doesn't use the
> >>>>> overdraft
> >>>>> >>>>> buffer.
> >>>>> >>>>>>>>>>>          - Do you think it's ok?
> >>>>> >>>>>>>>>>>          - Actually, we added the checkAvailable logic
> for
> >>>>> >>>>>>> LegacySource
> >>>>> >>>>>>>>>> in our
> >>>>> >>>>>>>>>>>          internal version. It works well.
> >>>>> >>>>>>>>>>>       5. For the benchmark, do you have any suggestions?
> I
> >>>>> >>>>> submitted
> >>>>> >>>>>>> the
> >>>>> >>>>>>>>>> PR
> >>>>> >>>>>>>>>>>       [1].
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> [1] https://github.com/apache/flink-benchmarks/pull/54
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> Thanks
> >>>>> >>>>>>>>>>> fanrui
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>> On Fri, Apr 29, 2022 at 7:41 PM Anton Kalashnikov <
> >>>>> >>>>>>> kaa....@yandex.com
> >>>>> >>>>>>>>>>> wrote:
> >>>>> >>>>>>>>>>>
> >>>>> >>>>>>>>>>>> Hi,
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> We discuss about it a little with Dawid Wysakowicz. Here
> >>>>> is
> >>>>> >>>>> some
> >>>>> >>>>>>>>>>>> conclusion:
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> First of all, let's split this into two tasks.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> The first task is about ignoring max buffers per
> channel.
> >>>>> >>>> This
> >>>>> >>>>>>> means
> >>>>> >>>>>>>> if
> >>>>> >>>>>>>>>>>> we request a memory segment from LocalBufferPool and the
> >>>>> >>>>>>>>>>>> maxBuffersPerChannel is reached for this channel, we
> just
> >>>>> >>>>> ignore
> >>>>> >>>>>>> that
> >>>>> >>>>>>>>>>>> and continue to allocate buffer while LocalBufferPool
> has
> >>>>> >>>> it(it
> >>>>> >>>>>> is
> >>>>> >>>>>>>>>>>> actually not a overdraft).
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> The second task is about the real overdraft. I am pretty
> >>>>> >>>>>> convinced
> >>>>> >>>>>>>> now
> >>>>> >>>>>>>>>>>> that we, unfortunately, need configuration for
> limitation
> >>>>> of
> >>>>> >>>>>>>> overdraft
> >>>>> >>>>>>>>>>>> number(because it is not ok if one subtask allocates all
> >>>>> >>>>> buffers
> >>>>> >>>>>> of
> >>>>> >>>>>>>> one
> >>>>> >>>>>>>>>>>> TaskManager considering that several different jobs can
> be
> >>>>> >>>>>>> submitted
> >>>>> >>>>>>>> on
> >>>>> >>>>>>>>>>>> this TaskManager). So idea is to have
> >>>>> >>>>>>>>>>>> maxOverdraftBuffersPerPartition(technically to say per
> >>>>> >>>>>>>>>> LocalBufferPool).
> >>>>> >>>>>>>>>>>> In this case, when a limit of buffers in LocalBufferPool
> >>>>> is
> >>>>> >>>>>>> reached,
> >>>>> >>>>>>>>>>>> LocalBufferPool can request additionally from
> >>>>> >>>> NetworkBufferPool
> >>>>> >>>>>> up
> >>>>> >>>>>>> to
> >>>>> >>>>>>>>>>>> maxOverdraftBuffersPerPartition buffers.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> But it is still not clear how to handle LegacySource
> >>>>> since it
> >>>>> >>>>>>>> actually
> >>>>> >>>>>>>>>>>> works as unlimited flatmap and it will always work in
> >>>>> >>>> overdraft
> >>>>> >>>>>>> mode
> >>>>> >>>>>>>>>>>> which is not a target. So we still need to think about
> >>>>> that.
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>      29.04.2022 11:11, rui fan пишет:
> >>>>> >>>>>>>>>>>>> Hi Anton Kalashnikov,
> >>>>> >>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>> I think you agree with we should limit the maximum
> >>>>> number of
> >>>>> >>>>>>>> overdraft
> >>>>> >>>>>>>>>>>>> segments that each LocalBufferPool can apply for,
> right?
> >>>>> >>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>> I prefer to hard code the maxOverdraftBuffers due to
> >>>>> don't
> >>>>> >>>> add
> >>>>> >>>>>> the
> >>>>> >>>>>>>> new
> >>>>> >>>>>>>>>>>>> configuration. And I hope to hear more from the
> >>>>> community.
> >>>>> >>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>> Best wishes
> >>>>> >>>>>>>>>>>>> fanrui
> >>>>> >>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>> On Thu, Apr 28, 2022 at 12:39 PM rui fan <
> >>>>> >>>>> 1996fan...@gmail.com>
> >>>>> >>>>>>>>>> wrote:
> >>>>> >>>>>>>>>>>>>> Hi Anton Kalashnikov,
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> Thanks for your very clear reply, I think you are
> >>>>> totally
> >>>>> >>>>>> right.
> >>>>> >>>>>>>>>>>>>> The 'maxBuffersNumber - buffersInUseNumber' can be
> used
> >>>>> as
> >>>>> >>>>> the
> >>>>> >>>>>>>>>>>>>> overdraft buffer, it won't need the new buffer
> >>>>> >>>>>>> configuration.Flink
> >>>>> >>>>>>>>>> users
> >>>>> >>>>>>>>>>>>>> can turn up the maxBuffersNumber to control the
> >>>>> overdraft
> >>>>> >>>>>> buffer
> >>>>> >>>>>>>>>> size.
> >>>>> >>>>>>>>>>>>>> Also, I‘d like to add some information. For safety, we
> >>>>> >>>> should
> >>>>> >>>>>>> limit
> >>>>> >>>>>>>>>> the
> >>>>> >>>>>>>>>>>>>> maximum number of overdraft segments that each
> >>>>> >>>>> LocalBufferPool
> >>>>> >>>>>>>>>>>>>> can apply for.
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> Why do we limit it?
> >>>>> >>>>>>>>>>>>>> Some operators don't check the
> >>>>> `recordWriter.isAvailable`
> >>>>> >>>>>> during
> >>>>> >>>>>>>>>>>>>> processing records, such as LegacySource. I have
> >>>>> mentioned
> >>>>> >>>> it
> >>>>> >>>>>> in
> >>>>> >>>>>>>>>>>>>> FLINK-26759 [1]. I'm not sure if there are other
> cases.
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> If don't add the limitation, the LegacySource will use
> >>>>> up
> >>>>> >>>> all
> >>>>> >>>>>>>>>> remaining
> >>>>> >>>>>>>>>>>>>> memory in the NetworkBufferPool when the backpressure
> is
> >>>>> >>>>>> severe.
> >>>>> >>>>>>>>>>>>>> How to limit it?
> >>>>> >>>>>>>>>>>>>> I prefer to hard code the
> >>>>> >>>>>>>> `maxOverdraftBuffers=numberOfSubpartitions`
> >>>>> >>>>>>>>>>>>>> in the constructor of LocalBufferPool. The
> >>>>> >>>>> maxOverdraftBuffers
> >>>>> >>>>>> is
> >>>>> >>>>>>>>>> just
> >>>>> >>>>>>>>>>>>>> for safety, and it should be enough for most flink
> >>>>> jobs. Or
> >>>>> >>>>> we
> >>>>> >>>>>>> can
> >>>>> >>>>>>>>>> set
> >>>>> >>>>>>>>>>>>>> `maxOverdraftBuffers=Math.max(numberOfSubpartitions,
> >>>>> 10)`
> >>>>> >>>> to
> >>>>> >>>>>>> handle
> >>>>> >>>>>>>>>>>>>> some jobs of low parallelism.
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> Also if user don't enable the Unaligned Checkpoint, we
> >>>>> can
> >>>>> >>>>> set
> >>>>> >>>>>>>>>>>>>> maxOverdraftBuffers=0 in the constructor of
> >>>>> >>>> LocalBufferPool.
> >>>>> >>>>>>>> Because
> >>>>> >>>>>>>>>>>>>> the overdraft isn't useful for the Aligned Checkpoint.
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> Please correct me if I'm wrong. Thanks a lot.
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-26759
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> Best wishes
> >>>>> >>>>>>>>>>>>>> fanrui
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 12:29 AM Anton Kalashnikov <
> >>>>> >>>>>>>>>> kaa....@yandex.com>
> >>>>> >>>>>>>>>>>>>> wrote:
> >>>>> >>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> Hi fanrui,
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> Thanks for creating the FLIP.
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> In general, I think the overdraft is good idea and it
> >>>>> >>>> should
> >>>>> >>>>>>> help
> >>>>> >>>>>>>> in
> >>>>> >>>>>>>>>>>>>>> described above cases. Here are my thoughts about
> >>>>> >>>>>> configuration:
> >>>>> >>>>>>>>>>>>>>> Please, correct me if I am wrong but as I understand
> >>>>> right
> >>>>> >>>>> now
> >>>>> >>>>>>> we
> >>>>> >>>>>>>>>> have
> >>>>> >>>>>>>>>>>>>>> following calculation.
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> maxBuffersNumber(per TaskManager) = Network
> >>>>> >>>>> memory(calculated
> >>>>> >>>>>>> via
> >>>>> >>>>>>>>>>>>>>> taskmanager.memory.network.fraction,
> >>>>> >>>>>>>> taskmanager.memory.network.min,
> >>>>> >>>>>>>>>>>>>>> taskmanager.memory.network.max and total memory
> size) /
> >>>>> >>>>>>>>>>>>>>> taskmanager.memory.segment-size.
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> requiredBuffersNumber(per TaskManager) = (exclusive
> >>>>> >>>> buffers
> >>>>> >>>>> *
> >>>>> >>>>>>>>>>>>>>> parallelism + floating buffers) * subtasks number in
> >>>>> >>>>>> TaskManager
> >>>>> >>>>>>>>>>>>>>> buffersInUseNumber = real number of buffers which
> used
> >>>>> at
> >>>>> >>>>>>> current
> >>>>> >>>>>>>>>>>>>>> moment(always <= requiredBuffersNumber)
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> Ideally requiredBuffersNumber should be equal to
> >>>>> >>>>>>> maxBuffersNumber
> >>>>> >>>>>>>>>> which
> >>>>> >>>>>>>>>>>>>>> allows Flink work predictibly. But if
> >>>>> >>>> requiredBuffersNumber
> >>>>> >>>>>>>>>>>>>>> maxBuffersNumber sometimes it is also fine(but not
> >>>>> good)
> >>>>> >>>>> since
> >>>>> >>>>>>> not
> >>>>> >>>>>>>>>> all
> >>>>> >>>>>>>>>>>>>>> required buffers really mandatory(e.g. it is ok if
> >>>>> Flink
> >>>>> >>>> can
> >>>>> >>>>>> not
> >>>>> >>>>>>>>>>>>>>> allocate floating buffers)
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> But if maxBuffersNumber > requiredBuffersNumber, as I
> >>>>> >>>>>> understand
> >>>>> >>>>>>>>>> Flink
> >>>>> >>>>>>>>>>>>>>> just never use these leftovers
> >>>>> buffers(maxBuffersNumber -
> >>>>> >>>>>>>>>>>>>>> requiredBuffersNumber). Which I propose to use. ( we
> >>>>> can
> >>>>> >>>>>> actualy
> >>>>> >>>>>>>> use
> >>>>> >>>>>>>>>>>>>>> even difference 'requiredBuffersNumber -
> >>>>> >>>> buffersInUseNumber'
> >>>>> >>>>>>> since
> >>>>> >>>>>>>>>> if
> >>>>> >>>>>>>>>>>>>>> one TaskManager contains several operators including
> >>>>> >>>>> 'window'
> >>>>> >>>>>>>> which
> >>>>> >>>>>>>>>> can
> >>>>> >>>>>>>>>>>>>>> temporally borrow buffers from the global pool).
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> My proposal, more specificaly(it relates only to
> >>>>> >>>> requesting
> >>>>> >>>>>>>> buffers
> >>>>> >>>>>>>>>>>>>>> during processing single record while switching to
> >>>>> >>>>>> unavalability
> >>>>> >>>>>>>>>>>> between
> >>>>> >>>>>>>>>>>>>>> records should be the same as we have it now):
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> * If one more buffer requested but
> maxBuffersPerChannel
> >>>>> >>>>>> reached,
> >>>>> >>>>>>>>>> then
> >>>>> >>>>>>>>>>>>>>> just ignore this limitation and allocate this buffers
> >>>>> from
> >>>>> >>>>> any
> >>>>> >>>>>>>>>>>>>>> place(from LocalBufferPool if it has something yet
> >>>>> >>>> otherwise
> >>>>> >>>>>>> from
> >>>>> >>>>>>>>>>>>>>> NetworkBufferPool)
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> * If LocalBufferPool exceeds limit, then temporally
> >>>>> >>>> allocate
> >>>>> >>>>>> it
> >>>>> >>>>>>>> from
> >>>>> >>>>>>>>>>>>>>> NetworkBufferPool while it has something to allocate
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> Maybe I missed something and this solution won't
> work,
> >>>>> >>>> but I
> >>>>> >>>>>>> like
> >>>>> >>>>>>>> it
> >>>>> >>>>>>>>>>>>>>> since on the one hand, it work from the scratch
> without
> >>>>> >>>> any
> >>>>> >>>>>>>>>>>>>>> configuration, on the other hand, it can be
> >>>>> configuration
> >>>>> >>>> by
> >>>>> >>>>>>>>>> changing
> >>>>> >>>>>>>>>>>>>>> proportion of maxBuffersNumber and
> >>>>> requiredBuffersNumber.
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> The last thing that I want to say, I don't really
> want
> >>>>> to
> >>>>> >>>>>>>> implement
> >>>>> >>>>>>>>>> new
> >>>>> >>>>>>>>>>>>>>> configuration since even now it is not clear how to
> >>>>> >>>>> correctly
> >>>>> >>>>>>>>>> configure
> >>>>> >>>>>>>>>>>>>>> network buffers with existing configuration and I
> don't
> >>>>> >>>> want
> >>>>> >>>>>> to
> >>>>> >>>>>>>>>>>>>>> complicate it, especially if it will be possible to
> >>>>> >>>> resolve
> >>>>> >>>>>> the
> >>>>> >>>>>>>>>> problem
> >>>>> >>>>>>>>>>>>>>> automatically(as described above).
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> So is my understanding about network memory/buffers
> >>>>> >>>> correct?
> >>>>> >>>>>>>>>>>>>>> --
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> Best regards,
> >>>>> >>>>>>>>>>>>>>> Anton Kalashnikov
> >>>>> >>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>> 27.04.2022 07:46, rui fan пишет:
> >>>>> >>>>>>>>>>>>>>>> Hi everyone,
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>> Unaligned Checkpoint (FLIP-76 [1]) is a major
> feature
> >>>>> of
> >>>>> >>>>>> Flink.
> >>>>> >>>>>>>> It
> >>>>> >>>>>>>>>>>>>>>> effectively solves the problem of checkpoint timeout
> >>>>> or
> >>>>> >>>>> slow
> >>>>> >>>>>>>>>>>>>>>> checkpoint when backpressure is severe.
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>> We found that UC(Unaligned Checkpoint) does not work
> >>>>> well
> >>>>> >>>>>> when
> >>>>> >>>>>>>> the
> >>>>> >>>>>>>>>>>>>>>> back pressure is severe and multiple output buffers
> >>>>> are
> >>>>> >>>>>>> required
> >>>>> >>>>>>>> to
> >>>>> >>>>>>>>>>>>>>>> process a single record. FLINK-14396 [2] also
> >>>>> mentioned
> >>>>> >>>>> this
> >>>>> >>>>>>>> issue
> >>>>> >>>>>>>>>>>>>>>> before. So we propose the overdraft buffer to solve
> >>>>> it.
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>> I created FLINK-26762[3] and FLIP-227[4] to detail
> the
> >>>>> >>>>>>> overdraft
> >>>>> >>>>>>>>>>>>>>>> buffer mechanism. After discussing with Anton
> >>>>> >>>> Kalashnikov,
> >>>>> >>>>>>> there
> >>>>> >>>>>>>>>> are
> >>>>> >>>>>>>>>>>>>>>> still some points to discuss:
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>>       * There are already a lot of buffer-related
> >>>>> >>>>>> configurations.
> >>>>> >>>>>>>> Do
> >>>>> >>>>>>>>>> we
> >>>>> >>>>>>>>>>>>>>>>         need to add a new configuration for the
> >>>>> overdraft
> >>>>> >>>>>> buffer?
> >>>>> >>>>>>>>>>>>>>>>       * Where should the overdraft buffer use
> memory?
> >>>>> >>>>>>>>>>>>>>>>       * If the overdraft-buffer uses the memory
> >>>>> remaining
> >>>>> >>>> in
> >>>>> >>>>>> the
> >>>>> >>>>>>>>>>>>>>>>         NetworkBufferPool, no new configuration
> needs
> >>>>> to be
> >>>>> >>>>>>> added.
> >>>>> >>>>>>>>>>>>>>>>       * If adding a new configuration:
> >>>>> >>>>>>>>>>>>>>>>           o Should we set the overdraft-memory-size
> >>>>> at the
> >>>>> >>>> TM
> >>>>> >>>>>>> level
> >>>>> >>>>>>>>>> or
> >>>>> >>>>>>>>>>>> the
> >>>>> >>>>>>>>>>>>>>>>             Task level?
> >>>>> >>>>>>>>>>>>>>>>           o Or set overdraft-buffers to indicate the
> >>>>> number
> >>>>> >>>>> of
> >>>>> >>>>>>>>>>>>>>>>             memory-segments that can be overdrawn.
> >>>>> >>>>>>>>>>>>>>>>           o What is the default value? How to set
> >>>>> sensible
> >>>>> >>>>>>>> defaults?
> >>>>> >>>>>>>>>>>>>>>> Currently, I implemented a POC [5] and verified it
> >>>>> using
> >>>>> >>>>>>>>>>>>>>>> flink-benchmarks [6]. The POC sets overdraft-buffers
> >>>>> at
> >>>>> >>>>> Task
> >>>>> >>>>>>>> level,
> >>>>> >>>>>>>>>>>>>>>> and default value is 10. That is: each
> LocalBufferPool
> >>>>> >>>> can
> >>>>> >>>>>>>>>> overdraw up
> >>>>> >>>>>>>>>>>>>>>> to 10 memory-segments.
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>> Looking forward to your feedback!
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>> Thanks,
> >>>>> >>>>>>>>>>>>>>>> fanrui
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>>>>> [1]
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> >>>>> >>>>>>>>>>>>>>>> [2]
> https://issues.apache.org/jira/browse/FLINK-14396
> >>>>> >>>>>>>>>>>>>>>> [3]
> https://issues.apache.org/jira/browse/FLINK-26762
> >>>>> >>>>>>>>>>>>>>>> [4]
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> >>>>> >>>>>>>>>>>>>>>> [5]
> >>>>> >>>>>>>>>>>>>>>>
> >>>>> >>
> >>>>>
> https://github.com/1996fanrui/flink/commit/c7559d94767de97c24ea8c540878832138c8e8fe
> >>>>> >>>>>>>>>>>>>>>> [6]
> >>>>> https://github.com/apache/flink-benchmarks/pull/54
> >>>>> >>>>>>>>>>>> --
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>> Best regards,
> >>>>> >>>>>>>>>>>> Anton Kalashnikov
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>>>>
> >>>>> >>>>>>>>>> --
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>> Best regards,
> >>>>> >>>>>>>>>> Anton Kalashnikov
> >>>>> >>>>>>>>>>
> >>>>> >>>>>>>>>>
> >>>>> >> --
> >>>>> >>
> >>>>> >> Best regards,
> >>>>> >> Anton Kalashnikov
> >>>>> >>
> >>>>> >>
> >>>>>
> >>>>
>

Re: [DISCUSS] FLIP-227: Support overdraft buffer

Reply via email to