Re: [DISCUSS] FLIP-227: Support overdraft buffer

rui fan Fri, 06 May 2022 01:58:17 -0700

Hi

I created the FLINK-27530[1] as the parent ticket. And I
updated it to FLIP.


[1] https://issues.apache.org/jira/browse/FLINK-27530

Thanks
fanrui

On Fri, May 6, 2022 at 4:27 PM Piotr Nowojski <[email protected]> wrote:

> Hi,
>
> I'm not sure. Maybe 5 will be fine? Anton, Dawid, what do you think?
>
> Can you create a parent ticket for the whole FLIP to group all of the
> issues together?
>
> Also FLIP should be officially voted first.
>
> Best,
> Piotrek
>
> pt., 6 maj 2022 o 09:08 rui fan <[email protected]> napisał(a):
>
> > Hi Anton, Piotrek and Dawid,
> >
> > Thanks for your help.
> >
> > I created FLINK-27522[1] as the first task. And I will finish it asap.
> >
> > @Piotrek, for the default value, do you think it should be less
> > than 5? What do you think about 3? Actually, I think 5 isn't big.
> > It's 1 or 3 or 5 that doesn't matter much, the focus is on
> > reasonably resolving deadlock problems. Or I push the second
> > task to move forward first and we discuss the default value in PR.
> >
> > For the legacySource, I got your idea. And I propose we create
> > the third task to handle it. Because it is independent and for
> > compatibility with the old API. What do you think? I updated
> > the third task on FLIP-227[2].
> >
> > If all is ok, I will create a JIRA for the third Task and add it to
> > FLIP-227. And I will develop them from the first task to the
> > third task.
> >
> > Thanks again for your help.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-27522
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> >
> > Thanks
> > fanrui
> >
> > On Fri, May 6, 2022 at 3:50 AM Piotr Nowojski <[email protected]>
> > wrote:
> >
> > > Hi fanrui,
> > >
> > > > How to identify legacySource?
> > >
> > > legacy sources are always using the SourceStreamTask class and
> > > SourceStreamTask is used only for legacy sources. But I'm not sure how
> to
> > > enable/disable that. Adding `disableOverdraft()` call in
> SourceStreamTask
> > > would be better compared to relying on the `getAvailableFuture()` call
> > > (isn't it used for back pressure metric anyway?). Ideally we should
> > > enable/disable it in the constructors, but that might be tricky.
> > >
> > > > I prefer it to be between 5 and 10
> > >
> > > I would vote for a smaller value because of FLINK-13203
> > >
> > > Piotrek
> > >
> > >
> > >
> > > czw., 5 maj 2022 o 11:49 rui fan <[email protected]> napisał(a):
> > >
> > >> Hi,
> > >>
> > >> Thanks a lot for your discussion.
> > >>
> > >> After several discussions, I think it's clear now. I updated the
> > >> "Proposed Changes" of FLIP-227[1]. If I have something
> > >> missing, please help to add it to FLIP, or add it in the mail
> > >> and I can add it to FLIP. If everything is OK, I will create a
> > >> new JIRA for the first task, and use FLINK-26762[2] as the
> > >> second task.
> > >>
> > >> About the legacy source, do we set maxOverdraftBuffersPerGate=0
> > >> directly? How to identify legacySource? Or could we add
> > >> the overdraftEnabled in LocalBufferPool? The default value
> > >> is false. If the getAvailableFuture is called, change
> > >> overdraftEnabled=true.
> > >> It indicates whether there are checks isAvailable elsewhere.
> > >> It might be more general, it can cover more cases.
> > >>
> > >> Also, I think the default value of 'max-overdraft-buffers-per-gate'
> > >> needs to be confirmed. I prefer it to be between 5 and 10. How
> > >> do you think?
> > >>
> > >> [1]
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> > >> [2] https://issues.apache.org/jira/browse/FLINK-26762
> > >>
> > >> Thanks
> > >> fanrui
> > >>
> > >> On Thu, May 5, 2022 at 4:41 PM Piotr Nowojski <[email protected]>
> > >> wrote:
> > >>
> > >>> Hi again,
> > >>>
> > >>> After sleeping over this, if both versions (reserve and overdraft)
> have
> > >>> the same complexity, I would also prefer the overdraft.
> > >>>
> > >>> > `Integer.MAX_VALUE` as default value was my idea as well but now,
> as
> > >>> > Dawid mentioned, I think it is dangerous since it is too implicit
> for
> > >>> > the user and if the user submits one more job for the same
> TaskManger
> > >>>
> > >>> As I mentioned, it's not only an issue with multiple jobs. The same
> > >>> problem can happen with different subtasks from the same job,
> > potentially
> > >>> leading to the FLINK-13203 deadlock [1]. With FLINK-13203 fixed, I
> > would be
> > >>> in favour of Integer.MAX_VALUE to be the default value, but as it
> is, I
> > >>> think we should indeed play on the safe side and limit it.
> > >>>
> > >>> > I still don't understand how should be limited "reserve"
> > >>> implementation.
> > >>> > I mean if we have X buffers in total and the user sets overdraft
> > equal
> > >>> > to X we obviously can not reserve all buffers, but how many we are
> > >>> > allowed to reserve? Should it be a different configuration like
> > >>> > percentegeForReservedBuffers?
> > >>>
> > >>> The reserve could be defined as percentage, or as a fixed number of
> > >>> buffers. But yes. In normal operation subtask would not use the
> > reserve, as
> > >>> if numberOfAvailableBuffers < reserve, the output would be not
> > available.
> > >>> Only in the flatMap/timers/huge records case the reserve could be
> used.
> > >>>
> > >>> > 1. If the total buffers of LocalBufferPool <= the reserve buffers,
> > >>> will LocalBufferPool never be available? Can't process data?
> > >>>
> > >>> Of course we would need to make sure that never happens. So the
> reserve
> > >>> should be < total buffer size.
> > >>>
> > >>> > 2. If the overdraft buffer use the extra buffers, when the
> downstream
> > >>> > task inputBuffer is insufficient, it should fail to start the job,
> > and
> > >>> then
> > >>> > restart? When the InputBuffer is initialized, it will apply for
> > enough
> > >>> > buffers, right?
> > >>>
> > >>> The failover if downstream can not allocate buffers is already
> > >>> implemented FLINK-14872 [2]. There is a timeout for how long the task
> > is
> > >>> waiting for buffer allocation. However this doesn't prevent many
> > >>> (potentially infinitely many) deadlock/restarts cycles. IMO the
> propper
> > >>> solution for [1] would be 2b described in the ticket:
> > >>>
> > >>> > 2b. Assign extra buffers only once all of the tasks are RUNNING.
> This
> > >>> is a simplified version of 2a, without tracking the tasks
> > sink-to-source.
> > >>>
> > >>> But that's a pre-existing problem and I don't think we have to solve
> it
> > >>> before implementing overdraft. I think we would need to solve it only
> > >>> before setting Integer.MAX_VALUE as the default for the overdraft.
> > Maybe I
> > >>> would hesitate setting the overdraft to anything more then a couple
> of
> > >>> buffers by default for the same reason.
> > >>>
> > >>> > Actually, I totally agree that we don't need a lot of buffers for
> > >>> overdraft
> > >>>
> > >>> and
> > >>>
> > >>> > Also I agree we ignore the overdraftBuffers=numberOfSubpartitions.
> > >>> > When we finish this feature and after users use it, if users
> feedback
> > >>> > this issue we can discuss again.
> > >>>
> > >>> +1
> > >>>
> > >>> Piotrek
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/FLINK-13203
> > >>> [2] https://issues.apache.org/jira/browse/FLINK-14872
> > >>>
> > >>> czw., 5 maj 2022 o 05:52 rui fan <[email protected]> napisał(a):
> > >>>
> > >>>> Hi everyone,
> > >>>>
> > >>>> I still have some questions.
> > >>>>
> > >>>> 1. If the total buffers of LocalBufferPool <= the reserve buffers,
> > will
> > >>>> LocalBufferPool never be available? Can't process data?
> > >>>> 2. If the overdraft buffer use the extra buffers, when the
> downstream
> > >>>> task inputBuffer is insufficient, it should fail to start the job,
> and
> > >>>> then
> > >>>> restart? When the InputBuffer is initialized, it will apply for
> enough
> > >>>> buffers, right?
> > >>>>
> > >>>> Also I agree we ignore the overdraftBuffers=numberOfSubpartitions.
> > >>>> When we finish this feature and after users use it, if users
> feedback
> > >>>> this issue we can discuss again.
> > >>>>
> > >>>> Thanks
> > >>>> fanrui
> > >>>>
> > >>>> On Wed, May 4, 2022 at 5:29 PM Dawid Wysakowicz <
> > [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hey all,
> > >>>>>
> > >>>>> I have not replied in the thread yet, but I was following the
> > >>>>> discussion.
> > >>>>>
> > >>>>> Personally, I like Fanrui's and Anton's idea. As far as I
> understand
> > >>>>> it
> > >>>>> the idea to distinguish between inside flatMap & outside would be
> > >>>>> fairly
> > >>>>> simple, but maybe slightly indirect. The checkAvailability would
> > >>>>> remain
> > >>>>> unchanged and it is checked always between separate invocations of
> > the
> > >>>>> UDF. Therefore the overdraft buffers would not apply there. However
> > >>>>> once
> > >>>>> the pool says it is available, it means it has at least an initial
> > >>>>> buffer. So any additional request without checking for availability
> > >>>>> can
> > >>>>> be considered to be inside of processing a single record. This does
> > >>>>> not
> > >>>>> hold just for the LegacySource as I don't think it actually checks
> > for
> > >>>>> the availability of buffers in the LocalBufferPool.
> > >>>>>
> > >>>>> In the offline chat with Anton, we also discussed if we need a
> limit
> > >>>>> of
> > >>>>> the number of buffers we could overdraft (or in other words if the
> > >>>>> limit
> > >>>>> should be equal to Integer.MAX_VALUE), but personally I'd prefer to
> > >>>>> stay
> > >>>>> on the safe side and have it limited. The pool of network buffers
> is
> > >>>>> shared for the entire TaskManager, so it means it can be shared
> even
> > >>>>> across tasks of separate jobs. However, I might be just
> unnecessarily
> > >>>>> cautious here.
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Dawid
> > >>>>>
> > >>>>> On 04/05/2022 10:54, Piotr Nowojski wrote:
> > >>>>> > Hi,
> > >>>>> >
> > >>>>> > Thanks for the answers.
> > >>>>> >
> > >>>>> >> we may still need to discuss whether the
> > >>>>> >> overdraft/reserve/spare should use extra buffers or buffers
> > >>>>> >> in (exclusive + floating buffers)?
> > >>>>> > and
> > >>>>> >
> > >>>>> >> These things resolve the different problems (at least as I see
> > >>>>> that).
> > >>>>> >> The current hardcoded "1"  says that we switch "availability" to
> > >>>>> >> "unavailability" when one more buffer is left(actually a little
> > less
> > >>>>> >> than one buffer since we write the last piece of data to this
> last
> > >>>>> >> buffer). The overdraft feature doesn't change this logic we
> still
> > >>>>> want
> > >>>>> >> to switch to "unavailability" in such a way but if we are
> already
> > in
> > >>>>> >> "unavailability" and we want more buffers then we can take
> > >>>>> "overdraft
> > >>>>> >> number" more. So we can not avoid this hardcoded "1" since we
> need
> > >>>>> to
> > >>>>> >> understand when we should switch to "unavailability"
> > >>>>> > Ok, I see. So it seems to me that both of you have in mind to
> keep
> > >>>>> the
> > >>>>> > buffer pools as they are right now, but if we are in the middle
> of
> > >>>>> > processing a record, we can request extra overdraft buffers on
> top
> > of
> > >>>>> > those? This is another way to implement the overdraft to what I
> was
> > >>>>> > thinking. I was thinking about something like keeping the
> > >>>>> "overdraft" or
> > >>>>> > more precisely buffer "reserve" in the buffer pool. I think my
> > >>>>> version
> > >>>>> > would be easier to implement, because it is just fiddling with
> > >>>>> min/max
> > >>>>> > buffers calculation and slightly modified `checkAvailability()`
> > >>>>> logic.
> > >>>>> >
> > >>>>> > On the other hand  what you have in mind would better utilise the
> > >>>>> available
> > >>>>> > memory, right? It would require more code changes (how would we
> > know
> > >>>>> when
> > >>>>> > we are allowed to request the overdraft?). However, in this
> case, I
> > >>>>> would
> > >>>>> > be tempted to set the number of overdraft buffers by default to
> > >>>>> > `Integer.MAX_VALUE`, and let the system request as many buffers
> as
> > >>>>> > necessary. The only downside that I can think of (apart of higher
> > >>>>> > complexity) would be higher chance of hitting a known/unsolved
> > >>>>> deadlock [1]
> > >>>>> > in a scenario:
> > >>>>> > - downstream task hasn't yet started
> > >>>>> > - upstream task requests overdraft and uses all available memory
> > >>>>> segments
> > >>>>> > from the global pool
> > >>>>> > - upstream task is blocked, because downstream task hasn't
> started
> > >>>>> yet and
> > >>>>> > can not consume any data
> > >>>>> > - downstream task tries to start, but can not, as there are no
> > >>>>> available
> > >>>>> > buffers
> > >>>>> >
> > >>>>> >> BTW, for watermark, the number of buffers it needs is
> > >>>>> >> numberOfSubpartitions. So if
> > overdraftBuffers=numberOfSubpartitions,
> > >>>>> >> the watermark won't block in requestMemory.
> > >>>>> > and
> > >>>>> >
> > >>>>> >> the best overdraft size will be equal to parallelism.
> > >>>>> > That's a lot of buffers. I don't think we need that many for
> > >>>>> broadcasting
> > >>>>> > watermarks. Watermarks are small, and remember that every
> > >>>>> subpartition has
> > >>>>> > some partially filled/empty WIP buffer, so the vast majority of
> > >>>>> > subpartitions will not need to request a new buffer.
> > >>>>> >
> > >>>>> > Best,
> > >>>>> > Piotrek
> > >>>>> >
> > >>>>> > [1] https://issues.apache.org/jira/browse/FLINK-13203
> > >>>>> >
> > >>>>> > wt., 3 maj 2022 o 17:15 Anton Kalashnikov <[email protected]>
> > >>>>> napisał(a):
> > >>>>> >
> > >>>>> >> Hi,
> > >>>>> >>
> > >>>>> >>
> > >>>>> >>   >> Do you mean to ignore it while processing records, but keep
> > >>>>> using
> > >>>>> >> `maxBuffersPerChannel` when calculating the availability of the
> > >>>>> output?
> > >>>>> >>
> > >>>>> >>
> > >>>>> >> Yes, it is correct.
> > >>>>> >>
> > >>>>> >>
> > >>>>> >>   >> Would it be a big issue if we changed it to check if at
> least
> > >>>>> >> "overdraft number of buffers are available", where "overdraft
> > >>>>> number" is
> > >>>>> >> configurable, instead of the currently hardcoded value of "1"?
> > >>>>> >>
> > >>>>> >>
> > >>>>> >> These things resolve the different problems (at least as I see
> > >>>>> that).
> > >>>>> >> The current hardcoded "1"  says that we switch "availability" to
> > >>>>> >> "unavailability" when one more buffer is left(actually a little
> > less
> > >>>>> >> than one buffer since we write the last piece of data to this
> last
> > >>>>> >> buffer). The overdraft feature doesn't change this logic we
> still
> > >>>>> want
> > >>>>> >> to switch to "unavailability" in such a way but if we are
> already
> > in
> > >>>>> >> "unavailability" and we want more buffers then we can take
> > >>>>> "overdraft
> > >>>>> >> number" more. So we can not avoid this hardcoded "1" since we
> need
> > >>>>> to
> > >>>>> >> understand when we should switch to "unavailability"
> > >>>>> >>
> > >>>>> >>
> > >>>>> >> -- About "reserve" vs "overdraft"
> > >>>>> >>
> > >>>>> >> As Fanrui mentioned above, perhaps, the best overdraft size will
> > be
> > >>>>> >> equal to parallelism. Also, the user can set any value he wants.
> > So
> > >>>>> even
> > >>>>> >> if parallelism is small(~5) but the user's flatmap produces a
> lot
> > of
> > >>>>> >> data, the user can set 10 or even more. Which almost double the
> > max
> > >>>>> >> buffers and it will be impossible to reserve. At least we need
> to
> > >>>>> figure
> > >>>>> >> out how to protect from such cases (the limit for an
> overdraft?).
> > So
> > >>>>> >> actually it looks even more difficult than increasing the
> maximum
> > >>>>> buffers.
> > >>>>> >>
> > >>>>> >> I want to emphasize that overdraft buffers are soft
> configuration
> > >>>>> which
> > >>>>> >> means it takes as many buffers as the global buffers pool has
> > >>>>> >> available(maybe zero) but less than this configured value. It is
> > >>>>> also
> > >>>>> >> important to notice that perhaps, not many subtasks in
> TaskManager
> > >>>>> will
> > >>>>> >> be using this feature so we don't actually need a lot of
> available
> > >>>>> >> buffers for every subtask(Here, I mean that if we have only one
> > >>>>> >> window/flatmap operator and many other operators, then one
> > >>>>> TaskManager
> > >>>>> >> will have many ordinary subtasks which don't actually need
> > >>>>> overdraft and
> > >>>>> >> several subtasks that needs this feature). But in case of
> > >>>>> reservation,
> > >>>>> >> we will reserve some buffers for all operators even if they
> don't
> > >>>>> really
> > >>>>> >> need it.
> > >>>>> >>
> > >>>>> >>
> > >>>>> >> -- Legacy source problem
> > >>>>> >>
> > >>>>> >> If we still want to change max buffers then it is problem for
> > >>>>> >> LegacySources(since every subtask of source will always use
> these
> > >>>>> >> overdraft). But right now, I think that we can force to set 0
> > >>>>> overdraft
> > >>>>> >> buffers for legacy subtasks in configuration during execution(if
> > it
> > >>>>> is
> > >>>>> >> not too late for changing configuration in this place).
> > >>>>> >>
> > >>>>> >>
> > >>>>> >> 03.05.2022 14:11, rui fan пишет:
> > >>>>> >>> Hi
> > >>>>> >>>
> > >>>>> >>> Thanks for Martijn Visser and Piotrek's feedback.  I agree with
> > >>>>> >>> ignoring the legacy source, it will affect our design. User
> > should
> > >>>>> >>> use the new Source Api as much as possible.
> > >>>>> >>>
> > >>>>> >>> Hi Piotrek, we may still need to discuss whether the
> > >>>>> >>> overdraft/reserve/spare should use extra buffers or buffers
> > >>>>> >>> in (exclusive + floating buffers)? They have some differences.
> > >>>>> >>>
> > >>>>> >>> If it uses extra buffers:
> > >>>>> >>> 1.The LocalBufferPool will be available when (usedBuffers + 1
> > >>>>> >>>    <= currentPoolSize) and all subpartitions don't reach the
> > >>>>> >>> maxBuffersPerChannel.
> > >>>>> >>>
> > >>>>> >>> If it uses the buffers in (exclusive + floating buffers):
> > >>>>> >>> 1. The LocalBufferPool will be available when (usedBuffers +
> > >>>>> >>> overdraftBuffers <= currentPoolSize) and all subpartitions
> > >>>>> >>> don't reach the maxBuffersPerChannel.
> > >>>>> >>> 2. For low parallelism jobs, if overdraftBuffers is large(>8),
> > the
> > >>>>> >>> usedBuffers will be small. That is the LocalBufferPool will be
> > >>>>> >>> easily unavailable. For throughput, if users turn up the
> > >>>>> >>> overdraft buffers, they need to turn up exclusive or floating
> > >>>>> >>> buffers. It also affects the InputChannel, and it's is
> unfriendly
> > >>>>> >>> to users.
> > >>>>> >>>
> > >>>>> >>> So I prefer the overdraft to use extra buffers.
> > >>>>> >>>
> > >>>>> >>>
> > >>>>> >>> BTW, for watermark, the number of buffers it needs is
> > >>>>> >>> numberOfSubpartitions. So if
> > >>>>> overdraftBuffers=numberOfSubpartitions,
> > >>>>> >>> the watermark won't block in requestMemory. But it has
> > >>>>> >>> 2 problems:
> > >>>>> >>> 1. It needs more overdraft buffers. If the overdraft uses
> > >>>>> >>> (exclusive + floating buffers),  there will be fewer buffers
> > >>>>> >>> available. Throughput may be affected.
> > >>>>> >>> 2. The numberOfSubpartitions is different for each Task.
> > >>>>> >>> So if users want to cover watermark using this feature,
> > >>>>> >>> they don't know how to set the overdraftBuffers more r
> > >>>>> >>> easonably. And if the parallelism is changed, users still
> > >>>>> >>> need to change overdraftBuffers. It is unfriendly to users.
> > >>>>> >>>
> > >>>>> >>> So I propose we support overdraftBuffers=-1, It means
> > >>>>> >>> we will automatically set
> overdraftBuffers=numberOfSubpartitions
> > >>>>> >>> in the Constructor of LocalBufferPool.
> > >>>>> >>>
> > >>>>> >>> Please correct me if I'm wrong.
> > >>>>> >>>
> > >>>>> >>> Thanks
> > >>>>> >>> fanrui
> > >>>>> >>>
> > >>>>> >>> On Tue, May 3, 2022 at 4:54 PM Piotr Nowojski <
> > >>>>> [email protected]>
> > >>>>> >> wrote:
> > >>>>> >>>> Hi fanrui,
> > >>>>> >>>>
> > >>>>> >>>>> Do you mean don't add the extra buffers? We just use
> (exclusive
> > >>>>> >> buffers *
> > >>>>> >>>>> parallelism + floating buffers)? The LocalBufferPool will be
> > >>>>> available
> > >>>>> >>>> when
> > >>>>> >>>>> (usedBuffers+overdraftBuffers <=
> > >>>>> >>>> exclusiveBuffers*parallelism+floatingBuffers)
> > >>>>> >>>>> and all subpartitions don't reach the maxBuffersPerChannel,
> > >>>>> right?
> > >>>>> >>>> I'm not sure. Definitely we would need to adjust the minimum
> > >>>>> number of
> > >>>>> >> the
> > >>>>> >>>> required buffers, just as we did when we were implementing the
> > non
> > >>>>> >> blocking
> > >>>>> >>>> outputs and adding availability logic to LocalBufferPool. Back
> > >>>>> then we
> > >>>>> >>>> added "+ 1" to the minimum number of buffers. Currently this
> > >>>>> logic is
> > >>>>> >>>> located
> > >>>>> NettyShuffleUtils#getMinMaxNetworkBuffersPerResultPartition:
> > >>>>> >>>>
> > >>>>> >>>>> int min = isSortShuffle ? sortShuffleMinBuffers :
> > >>>>> numSubpartitions + 1;
> > >>>>> >>>> For performance reasons, we always require at least one buffer
> > per
> > >>>>> >>>> sub-partition. Otherwise performance falls drastically. Now if
> > we
> > >>>>> >> require 5
> > >>>>> >>>> overdraft buffers for output to be available, we need to have
> > >>>>> them on
> > >>>>> >> top
> > >>>>> >>>> of those "one buffer per sub-partition". So the logic should
> be
> > >>>>> changed
> > >>>>> >> to:
> > >>>>> >>>>> int min = isSortShuffle ? sortShuffleMinBuffers :
> > >>>>> numSubpartitions +
> > >>>>> >>>> numOverdraftBuffers;
> > >>>>> >>>>
> > >>>>> >>>> Regarding increasing the number of max buffers I'm not sure.
> As
> > >>>>> long as
> > >>>>> >>>> "overdraft << max number of buffers", because all buffers on
> the
> > >>>>> outputs
> > >>>>> >>>> are shared across all sub-partitions. If we have 5 overdraft
> > >>>>> buffers,
> > >>>>> >> and
> > >>>>> >>>> parallelism of 100, it doesn't matter in the grand scheme of
> > >>>>> things if
> > >>>>> >> we
> > >>>>> >>>> make the output available if at least one single buffer is
> > >>>>> available or
> > >>>>> >> at
> > >>>>> >>>> least 5 buffers are available out of ~200 (100 * 2 + 8). So
> > >>>>> effects of
> > >>>>> >>>> increasing the overdraft from 1 to for example 5 should be
> > >>>>> negligible.
> > >>>>> >> For
> > >>>>> >>>> small parallelism, like 5, increasing overdraft from 1 to 5
> > still
> > >>>>> >> increases
> > >>>>> >>>> the overdraft by only about 25%. So maybe we can keep the max
> as
> > >>>>> it is?
> > >>>>> >>>>
> > >>>>> >>>> If so, maybe we should change the name from "overdraft" to
> > "buffer
> > >>>>> >> reserve"
> > >>>>> >>>> or "spare buffers"? And document it as "number of buffers kept
> > in
> > >>>>> >> reserve
> > >>>>> >>>> in case of flatMap/firing timers/huge records"?
> > >>>>> >>>>
> > >>>>> >>>> What do you think Fenrui, Anton?
> > >>>>> >>>>
> > >>>>> >>>> Re LegacySources. I agree we can kind of ignore them in the
> new
> > >>>>> >> features,
> > >>>>> >>>> as long as we don't brake the existing deployments too much.
> > >>>>> >>>>
> > >>>>> >>>> Best,
> > >>>>> >>>> Piotrek
> > >>>>> >>>>
> > >>>>> >>>> wt., 3 maj 2022 o 09:20 Martijn Visser <[email protected]
> >
> > >>>>> >> napisał(a):
> > >>>>> >>>>> Hi everyone,
> > >>>>> >>>>>
> > >>>>> >>>>> Just wanted to chip in on the discussion of legacy sources:
> > >>>>> IMHO, we
> > >>>>> >>>> should
> > >>>>> >>>>> not focus too much on improving/adding capabilities for
> legacy
> > >>>>> sources.
> > >>>>> >>>> We
> > >>>>> >>>>> want to persuade and push users to use the new Source API.
> Yes,
> > >>>>> this
> > >>>>> >>>> means
> > >>>>> >>>>> that there's work required by the end users to port any
> custom
> > >>>>> source
> > >>>>> >> to
> > >>>>> >>>>> the new interface. The benefits of the new Source API should
> > >>>>> outweigh
> > >>>>> >>>> this.
> > >>>>> >>>>> Anything that we build to support multiple interfaces means
> > >>>>> adding more
> > >>>>> >>>>> complexity and more possibilities for bugs. Let's try to make
> > our
> > >>>>> >> lives a
> > >>>>> >>>>> little bit easier.
> > >>>>> >>>>>
> > >>>>> >>>>> Best regards,
> > >>>>> >>>>>
> > >>>>> >>>>> Martijn Visser
> > >>>>> >>>>> https://twitter.com/MartijnVisser82
> > >>>>> >>>>> https://github.com/MartijnVisser
> > >>>>> >>>>>
> > >>>>> >>>>>
> > >>>>> >>>>> On Tue, 3 May 2022 at 07:50, rui fan <[email protected]>
> > >>>>> wrote:
> > >>>>> >>>>>
> > >>>>> >>>>>> Hi Piotrek
> > >>>>> >>>>>>
> > >>>>> >>>>>>> Do you mean to ignore it while processing records, but keep
> > >>>>> using
> > >>>>> >>>>>>> `maxBuffersPerChannel` when calculating the availability of
> > the
> > >>>>> >>>> output?
> > >>>>> >>>>>> I think yes, and please Anton Kalashnikov to help double
> > check.
> > >>>>> >>>>>>
> > >>>>> >>>>>>> +1 for just having this as a separate configuration. Is it
> a
> > >>>>> big
> > >>>>> >>>>> problem
> > >>>>> >>>>>>> that legacy sources would be ignoring it? Note that we
> > already
> > >>>>> have
> > >>>>> >>>>>>> effectively hardcoded a single overdraft buffer.
> > >>>>> >>>>>>> `LocalBufferPool#checkAvailability` checks if there is a
> > single
> > >>>>> >>>> buffer
> > >>>>> >>>>>>> available and this works the same for all tasks (including
> > >>>>> legacy
> > >>>>> >>>>> source
> > >>>>> >>>>>>> tasks). Would it be a big issue if we changed it to check
> if
> > >>>>> at least
> > >>>>> >>>>>>> "overdraft number of buffers are available", where
> "overdraft
> > >>>>> number"
> > >>>>> >>>>> is
> > >>>>> >>>>>>> configurable, instead of the currently hardcoded value of
> > "1"?
> > >>>>> >>>>>> Do you mean don't add the extra buffers? We just use
> > (exclusive
> > >>>>> >>>> buffers *
> > >>>>> >>>>>> parallelism + floating buffers)? The LocalBufferPool will be
> > >>>>> available
> > >>>>> >>>>> when
> > >>>>> >>>>>> (usedBuffers+overdraftBuffers <=
> > >>>>> >>>>>> exclusiveBuffers*parallelism+floatingBuffers)
> > >>>>> >>>>>> and all subpartitions don't reach the maxBuffersPerChannel,
> > >>>>> right?
> > >>>>> >>>>>>
> > >>>>> >>>>>> If yes, I think it can solve the problem of legacy source.
> > >>>>> There may
> > >>>>> >> be
> > >>>>> >>>>>> some impact. If overdraftBuffers is large and only one
> buffer
> > >>>>> is used
> > >>>>> >>>> to
> > >>>>> >>>>>> process a single record, exclusive buffers*parallelism +
> > >>>>> floating
> > >>>>> >>>> buffers
> > >>>>> >>>>>> cannot be used. It may only be possible to use (exclusive
> > >>>>> buffers *
> > >>>>> >>>>>> parallelism
> > >>>>> >>>>>> + floating buffers - overdraft buffers + 1). For throughput,
> > if
> > >>>>> turn
> > >>>>> >> up
> > >>>>> >>>>> the
> > >>>>> >>>>>> overdraft buffers, the flink user needs to turn up exclusive
> > or
> > >>>>> >>>> floating
> > >>>>> >>>>>> buffers. And it also affects the InputChannel.
> > >>>>> >>>>>>
> > >>>>> >>>>>> If not, I don't think it can solve the problem of legacy
> > >>>>> source. The
> > >>>>> >>>>> legacy
> > >>>>> >>>>>> source don't check isAvailable, If there are the extra
> > buffers,
> > >>>>> legacy
> > >>>>> >>>>>> source
> > >>>>> >>>>>> will use them up until block in requestMemory.
> > >>>>> >>>>>>
> > >>>>> >>>>>>
> > >>>>> >>>>>> Thanks
> > >>>>> >>>>>> fanrui
> > >>>>> >>>>>>
> > >>>>> >>>>>> On Tue, May 3, 2022 at 3:39 AM Piotr Nowojski <
> > >>>>> [email protected]>
> > >>>>> >>>>>> wrote:
> > >>>>> >>>>>>
> > >>>>> >>>>>>> Hi,
> > >>>>> >>>>>>>
> > >>>>> >>>>>>> +1 for the general proposal from my side. It would be a
> nice
> > >>>>> >>>> workaround
> > >>>>> >>>>>>> flatMaps, WindowOperators and large records issues with
> > >>>>> unaligned
> > >>>>> >>>>>>> checkpoints.
> > >>>>> >>>>>>>
> > >>>>> >>>>>>>> The first task is about ignoring max buffers per channel.
> > This
> > >>>>> >>>> means
> > >>>>> >>>>> if
> > >>>>> >>>>>>>> we request a memory segment from LocalBufferPool and the
> > >>>>> >>>>>>>> maxBuffersPerChannel is reached for this channel, we just
> > >>>>> ignore
> > >>>>> >>>> that
> > >>>>> >>>>>>>> and continue to allocate buffer while LocalBufferPool has
> > >>>>> it(it is
> > >>>>> >>>>>>>> actually not a overdraft).
> > >>>>> >>>>>>> Do you mean to ignore it while processing records, but keep
> > >>>>> using
> > >>>>> >>>>>>> `maxBuffersPerChannel` when calculating the availability of
> > the
> > >>>>> >>>> output?
> > >>>>> >>>>>>>> The second task is about the real overdraft. I am pretty
> > >>>>> convinced
> > >>>>> >>>>> now
> > >>>>> >>>>>>>> that we, unfortunately, need configuration for limitation
> of
> > >>>>> >>>>> overdraft
> > >>>>> >>>>>>>> number(because it is not ok if one subtask allocates all
> > >>>>> buffers of
> > >>>>> >>>>> one
> > >>>>> >>>>>>>> TaskManager considering that several different jobs can be
> > >>>>> >>>> submitted
> > >>>>> >>>>> on
> > >>>>> >>>>>>>> this TaskManager). So idea is to have
> > >>>>> >>>>>>>> maxOverdraftBuffersPerPartition(technically to say per
> > >>>>> >>>>>> LocalBufferPool).
> > >>>>> >>>>>>>> In this case, when a limit of buffers in LocalBufferPool
> is
> > >>>>> >>>> reached,
> > >>>>> >>>>>>>> LocalBufferPool can request additionally from
> > >>>>> NetworkBufferPool up
> > >>>>> >>>> to
> > >>>>> >>>>>>>> maxOverdraftBuffersPerPartition buffers.
> > >>>>> >>>>>>> +1 for just having this as a separate configuration. Is it
> a
> > >>>>> big
> > >>>>> >>>>> problem
> > >>>>> >>>>>>> that legacy sources would be ignoring it? Note that we
> > already
> > >>>>> have
> > >>>>> >>>>>>> effectively hardcoded a single overdraft buffer.
> > >>>>> >>>>>>> `LocalBufferPool#checkAvailability` checks if there is a
> > single
> > >>>>> >>>> buffer
> > >>>>> >>>>>>> available and this works the same for all tasks (including
> > >>>>> legacy
> > >>>>> >>>>> source
> > >>>>> >>>>>>> tasks). Would it be a big issue if we changed it to check
> if
> > >>>>> at least
> > >>>>> >>>>>>> "overdraft number of buffers are available", where
> "overdraft
> > >>>>> number"
> > >>>>> >>>>> is
> > >>>>> >>>>>>> configurable, instead of the currently hardcoded value of
> > "1"?
> > >>>>> >>>>>>>
> > >>>>> >>>>>>> Best,
> > >>>>> >>>>>>> Piotrek
> > >>>>> >>>>>>>
> > >>>>> >>>>>>> pt., 29 kwi 2022 o 17:04 rui fan <[email protected]>
> > >>>>> napisał(a):
> > >>>>> >>>>>>>
> > >>>>> >>>>>>>> Let me add some information about the LegacySource.
> > >>>>> >>>>>>>>
> > >>>>> >>>>>>>> If we want to disable the overdraft buffer for
> LegacySource.
> > >>>>> >>>>>>>> Could we add the enableOverdraft in LocalBufferPool?
> > >>>>> >>>>>>>> The default value is false. If the getAvailableFuture is
> > >>>>> called,
> > >>>>> >>>>>>>> change enableOverdraft=true. It indicates whether there
> are
> > >>>>> >>>>>>>> checks isAvailable elsewhere.
> > >>>>> >>>>>>>>
> > >>>>> >>>>>>>> I don't think it is elegant, but it's safe. Please correct
> > me
> > >>>>> if
> > >>>>> >>>> I'm
> > >>>>> >>>>>>> wrong.
> > >>>>> >>>>>>>> Thanks
> > >>>>> >>>>>>>> fanrui
> > >>>>> >>>>>>>>
> > >>>>> >>>>>>>> On Fri, Apr 29, 2022 at 10:23 PM rui fan <
> > >>>>> [email protected]>
> > >>>>> >>>>> wrote:
> > >>>>> >>>>>>>>> Hi,
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> Thanks for your quick response.
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> For question 1/2/3, we think they are clear. We just need
> > to
> > >>>>> >>>>> discuss
> > >>>>> >>>>>>> the
> > >>>>> >>>>>>>>> default value in PR.
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> For the legacy source, you are right. It's difficult for
> > >>>>> general
> > >>>>> >>>>>>>>> implementation.
> > >>>>> >>>>>>>>> Currently, we implement ensureRecordWriterIsAvailable()
> in
> > >>>>> >>>>>>>>> SourceFunction.SourceContext. And call it in our common
> > >>>>> >>>>> LegacySource,
> > >>>>> >>>>>>>>> e.g: FlinkKafkaConsumer. Over 90% of our Flink jobs
> consume
> > >>>>> >>>> kafka,
> > >>>>> >>>>> so
> > >>>>> >>>>>>>>> fixing FlinkKafkaConsumer solved most of our problems.
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> Core code:
> > >>>>> >>>>>>>>> ```
> > >>>>> >>>>>>>>> public void ensureRecordWriterIsAvailable() {
> > >>>>> >>>>>>>>>        if (recordWriter == null
> > >>>>> >>>>>>>>>             ||
> > >>>>> >>>>>>>>>
> > >>>>> >>
> > >>>>>
> > !configuration.getBoolean(ExecutionCheckpointingOptions.ENABLE_UNALIGNED,
> > >>>>> >>>>>>>>> false)
> > >>>>> >>>>>>>>>             || recordWriter.isAvailable()) {
> > >>>>> >>>>>>>>>             return;
> > >>>>> >>>>>>>>>        }
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>>        CompletableFuture<?> resumeFuture =
> > >>>>> >>>>>>>> recordWriter.getAvailableFuture();
> > >>>>> >>>>>>>>>        try {
> > >>>>> >>>>>>>>>             resumeFuture.get();
> > >>>>> >>>>>>>>>        } catch (Throwable ignored) {
> > >>>>> >>>>>>>>>        }
> > >>>>> >>>>>>>>> }
> > >>>>> >>>>>>>>> ```
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> LegacySource calls
> > >>>>> sourceContext.ensureRecordWriterIsAvailable()
> > >>>>> >>>>>>>>> before synchronized (checkpointLock) and collects
> records.
> > >>>>> >>>>>>>>> Please let me know if there is a better solution.
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> Thanks
> > >>>>> >>>>>>>>> fanrui
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>> On Fri, Apr 29, 2022 at 9:45 PM Anton Kalashnikov <
> > >>>>> >>>>>> [email protected]>
> > >>>>> >>>>>>>>> wrote:
> > >>>>> >>>>>>>>>
> > >>>>> >>>>>>>>>> Hi.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> -- 1. Do you mean split this into two JIRAs or two PRs
> or
> > >>>>> two
> > >>>>> >>>>>> commits
> > >>>>> >>>>>>>> in a
> > >>>>> >>>>>>>>>>       PR?
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> Perhaps, the separated ticket will be better since this
> > >>>>> task has
> > >>>>> >>>>>> fewer
> > >>>>> >>>>>>>>>> questions but we should find a solution for LegacySource
> > >>>>> first.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> --  2. For the first task, if the flink user disables
> the
> > >>>>> >>>>> Unaligned
> > >>>>> >>>>>>>>>>       Checkpoint, do we ignore max buffers per channel?
> > >>>>> Because
> > >>>>> >>>> the
> > >>>>> >>>>>>>>>> overdraft
> > >>>>> >>>>>>>>>>       isn't useful for the Aligned Checkpoint, it still
> > >>>>> needs to
> > >>>>> >>>>> wait
> > >>>>> >>>>>>> for
> > >>>>> >>>>>>>>>>       downstream Task to consume.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> I think that the logic should be the same for AC and UC.
> > As
> > >>>>> I
> > >>>>> >>>>>>>> understand,
> > >>>>> >>>>>>>>>> the overdraft maybe is not really helpful for AC but it
> > >>>>> doesn't
> > >>>>> >>>>> make
> > >>>>> >>>>>>> it
> > >>>>> >>>>>>>>>> worse as well.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>>     3. For the second task
> > >>>>> >>>>>>>>>> --      - The default value of
> > >>>>> maxOverdraftBuffersPerPartition
> > >>>>> >>>> may
> > >>>>> >>>>>>> also
> > >>>>> >>>>>>>>>> need
> > >>>>> >>>>>>>>>>          to be discussed.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> I think it should be a pretty small value or even 0
> since
> > it
> > >>>>> >>>> kind
> > >>>>> >>>>> of
> > >>>>> >>>>>>>>>> optimization and user should understand what they
> > >>>>> do(especially
> > >>>>> >>>> if
> > >>>>> >>>>>> we
> > >>>>> >>>>>>>>>> implement the first task).
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> --      - If the user disables the Unaligned Checkpoint,
> > >>>>> can we
> > >>>>> >>>>> set
> > >>>>> >>>>>>> the
> > >>>>> >>>>>>>>>>          maxOverdraftBuffersPerPartition=0? Because the
> > >>>>> overdraft
> > >>>>> >>>>>> isn't
> > >>>>> >>>>>>>>>> useful for
> > >>>>> >>>>>>>>>>          the Aligned Checkpoint.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> The same answer that above, if the overdraft doesn't
> make
> > >>>>> >>>>>> degradation
> > >>>>> >>>>>>>> for
> > >>>>> >>>>>>>>>> the Aligned Checkpoint I don't think that we should make
> > >>>>> >>>>> difference
> > >>>>> >>>>>>>> between
> > >>>>> >>>>>>>>>> AC and UC.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>>       4. For the legacy source
> > >>>>> >>>>>>>>>> --      - If enabling the Unaligned Checkpoint, it uses
> up
> > >>>>> to
> > >>>>> >>>>>>>>>>          maxOverdraftBuffersPerPartition buffers.
> > >>>>> >>>>>>>>>>          - If disabling the UC, it doesn't use the
> > overdraft
> > >>>>> >>>> buffer.
> > >>>>> >>>>>>>>>>          - Do you think it's ok?
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> Ideally, I don't want to use overdraft for LegacySource
> at
> > >>>>> all
> > >>>>> >>>>> since
> > >>>>> >>>>>>> it
> > >>>>> >>>>>>>>>> can lead to undesirable results especially if the limit
> is
> > >>>>> high.
> > >>>>> >>>>> At
> > >>>>> >>>>>>>> least,
> > >>>>> >>>>>>>>>> as I understand, it will always work in overdraft mode
> and
> > >>>>> it
> > >>>>> >>>> will
> > >>>>> >>>>>>>> borrow
> > >>>>> >>>>>>>>>> maxOverdraftBuffersPerPartition buffers from the global
> > pool
> > >>>>> >>>> which
> > >>>>> >>>>>> can
> > >>>>> >>>>>>>> lead
> > >>>>> >>>>>>>>>> to degradation of other subtasks on the same
> TaskManager.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> --      - Actually, we added the checkAvailable logic
> for
> > >>>>> >>>>>> LegacySource
> > >>>>> >>>>>>>> in
> > >>>>> >>>>>>>>>> our
> > >>>>> >>>>>>>>>>          internal version. It works well.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> I don't really understand how it is possible for general
> > >>>>> case
> > >>>>> >>>>>>>> considering
> > >>>>> >>>>>>>>>> that each user has their own implementation of
> > >>>>> >>>>> LegacySourceOperator
> > >>>>> >>>>>>>>>> --   5. For the benchmark, do you have any suggestions?
> I
> > >>>>> >>>>> submitted
> > >>>>> >>>>>>> the
> > >>>>> >>>>>>>> PR
> > >>>>> >>>>>>>>>>       [1].
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> I haven't looked at it yet, but I'll try to do it soon.
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> 29.04.2022 14:14, rui fan пишет:
> > >>>>> >>>>>>>>>>> Hi,
> > >>>>> >>>>>>>>>>>
> > >>>>> >>>>>>>>>>> Thanks for your feedback. I have a servel of questions.
> > >>>>> >>>>>>>>>>>
> > >>>>> >>>>>>>>>>>       1. Do you mean split this into two JIRAs or two
> PRs
> > >>>>> or two
> > >>>>> >>>>>>> commits
> > >>>>> >>>>>>>>>> in a
> > >>>>> >>>>>>>>>>>       PR?
> > >>>>> >>>>>>>>>>>       2. For the first task, if the flink user disables
> > the
> > >>>>> >>>>>> Unaligned
> > >>>>> >>>>>>>>>>>       Checkpoint, do we ignore max buffers per channel?
> > >>>>> Because
> > >>>>> >>>>> the
> > >>>>> >>>>>>>>>> overdraft
> > >>>>> >>>>>>>>>>>       isn't useful for the Aligned Checkpoint, it still
> > >>>>> needs to
> > >>>>> >>>>>> wait
> > >>>>> >>>>>>>> for
> > >>>>> >>>>>>>>>>>       downstream Task to consume.
> > >>>>> >>>>>>>>>>>       3. For the second task
> > >>>>> >>>>>>>>>>>          - The default value of
> > >>>>> maxOverdraftBuffersPerPartition
> > >>>>> >>>>> may
> > >>>>> >>>>>>> also
> > >>>>> >>>>>>>>>> need
> > >>>>> >>>>>>>>>>>          to be discussed.
> > >>>>> >>>>>>>>>>>          - If the user disables the Unaligned
> Checkpoint,
> > >>>>> can we
> > >>>>> >>>>> set
> > >>>>> >>>>>>> the
> > >>>>> >>>>>>>>>>>          maxOverdraftBuffersPerPartition=0? Because the
> > >>>>> >>>> overdraft
> > >>>>> >>>>>>> isn't
> > >>>>> >>>>>>>>>> useful for
> > >>>>> >>>>>>>>>>>          the Aligned Checkpoint.
> > >>>>> >>>>>>>>>>>       4. For the legacy source
> > >>>>> >>>>>>>>>>>          - If enabling the Unaligned Checkpoint, it
> uses
> > >>>>> up to
> > >>>>> >>>>>>>>>>>          maxOverdraftBuffersPerPartition buffers.
> > >>>>> >>>>>>>>>>>          - If disabling the UC, it doesn't use the
> > >>>>> overdraft
> > >>>>> >>>>> buffer.
> > >>>>> >>>>>>>>>>>          - Do you think it's ok?
> > >>>>> >>>>>>>>>>>          - Actually, we added the checkAvailable logic
> > for
> > >>>>> >>>>>>> LegacySource
> > >>>>> >>>>>>>>>> in our
> > >>>>> >>>>>>>>>>>          internal version. It works well.
> > >>>>> >>>>>>>>>>>       5. For the benchmark, do you have any
> suggestions?
> > I
> > >>>>> >>>>> submitted
> > >>>>> >>>>>>> the
> > >>>>> >>>>>>>>>> PR
> > >>>>> >>>>>>>>>>>       [1].
> > >>>>> >>>>>>>>>>>
> > >>>>> >>>>>>>>>>> [1] https://github.com/apache/flink-benchmarks/pull/54
> > >>>>> >>>>>>>>>>>
> > >>>>> >>>>>>>>>>> Thanks
> > >>>>> >>>>>>>>>>> fanrui
> > >>>>> >>>>>>>>>>>
> > >>>>> >>>>>>>>>>> On Fri, Apr 29, 2022 at 7:41 PM Anton Kalashnikov <
> > >>>>> >>>>>>> [email protected]
> > >>>>> >>>>>>>>>>> wrote:
> > >>>>> >>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> Hi,
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> We discuss about it a little with Dawid Wysakowicz.
> Here
> > >>>>> is
> > >>>>> >>>>> some
> > >>>>> >>>>>>>>>>>> conclusion:
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> First of all, let's split this into two tasks.
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> The first task is about ignoring max buffers per
> > channel.
> > >>>>> >>>> This
> > >>>>> >>>>>>> means
> > >>>>> >>>>>>>> if
> > >>>>> >>>>>>>>>>>> we request a memory segment from LocalBufferPool and
> the
> > >>>>> >>>>>>>>>>>> maxBuffersPerChannel is reached for this channel, we
> > just
> > >>>>> >>>>> ignore
> > >>>>> >>>>>>> that
> > >>>>> >>>>>>>>>>>> and continue to allocate buffer while LocalBufferPool
> > has
> > >>>>> >>>> it(it
> > >>>>> >>>>>> is
> > >>>>> >>>>>>>>>>>> actually not a overdraft).
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> The second task is about the real overdraft. I am
> pretty
> > >>>>> >>>>>> convinced
> > >>>>> >>>>>>>> now
> > >>>>> >>>>>>>>>>>> that we, unfortunately, need configuration for
> > limitation
> > >>>>> of
> > >>>>> >>>>>>>> overdraft
> > >>>>> >>>>>>>>>>>> number(because it is not ok if one subtask allocates
> all
> > >>>>> >>>>> buffers
> > >>>>> >>>>>> of
> > >>>>> >>>>>>>> one
> > >>>>> >>>>>>>>>>>> TaskManager considering that several different jobs
> can
> > be
> > >>>>> >>>>>>> submitted
> > >>>>> >>>>>>>> on
> > >>>>> >>>>>>>>>>>> this TaskManager). So idea is to have
> > >>>>> >>>>>>>>>>>> maxOverdraftBuffersPerPartition(technically to say per
> > >>>>> >>>>>>>>>> LocalBufferPool).
> > >>>>> >>>>>>>>>>>> In this case, when a limit of buffers in
> LocalBufferPool
> > >>>>> is
> > >>>>> >>>>>>> reached,
> > >>>>> >>>>>>>>>>>> LocalBufferPool can request additionally from
> > >>>>> >>>> NetworkBufferPool
> > >>>>> >>>>>> up
> > >>>>> >>>>>>> to
> > >>>>> >>>>>>>>>>>> maxOverdraftBuffersPerPartition buffers.
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> But it is still not clear how to handle LegacySource
> > >>>>> since it
> > >>>>> >>>>>>>> actually
> > >>>>> >>>>>>>>>>>> works as unlimited flatmap and it will always work in
> > >>>>> >>>> overdraft
> > >>>>> >>>>>>> mode
> > >>>>> >>>>>>>>>>>> which is not a target. So we still need to think about
> > >>>>> that.
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>      29.04.2022 11:11, rui fan пишет:
> > >>>>> >>>>>>>>>>>>> Hi Anton Kalashnikov,
> > >>>>> >>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>> I think you agree with we should limit the maximum
> > >>>>> number of
> > >>>>> >>>>>>>> overdraft
> > >>>>> >>>>>>>>>>>>> segments that each LocalBufferPool can apply for,
> > right?
> > >>>>> >>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>> I prefer to hard code the maxOverdraftBuffers due to
> > >>>>> don't
> > >>>>> >>>> add
> > >>>>> >>>>>> the
> > >>>>> >>>>>>>> new
> > >>>>> >>>>>>>>>>>>> configuration. And I hope to hear more from the
> > >>>>> community.
> > >>>>> >>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>> Best wishes
> > >>>>> >>>>>>>>>>>>> fanrui
> > >>>>> >>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>> On Thu, Apr 28, 2022 at 12:39 PM rui fan <
> > >>>>> >>>>> [email protected]>
> > >>>>> >>>>>>>>>> wrote:
> > >>>>> >>>>>>>>>>>>>> Hi Anton Kalashnikov,
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> Thanks for your very clear reply, I think you are
> > >>>>> totally
> > >>>>> >>>>>> right.
> > >>>>> >>>>>>>>>>>>>> The 'maxBuffersNumber - buffersInUseNumber' can be
> > used
> > >>>>> as
> > >>>>> >>>>> the
> > >>>>> >>>>>>>>>>>>>> overdraft buffer, it won't need the new buffer
> > >>>>> >>>>>>> configuration.Flink
> > >>>>> >>>>>>>>>> users
> > >>>>> >>>>>>>>>>>>>> can turn up the maxBuffersNumber to control the
> > >>>>> overdraft
> > >>>>> >>>>>> buffer
> > >>>>> >>>>>>>>>> size.
> > >>>>> >>>>>>>>>>>>>> Also, I‘d like to add some information. For safety,
> we
> > >>>>> >>>> should
> > >>>>> >>>>>>> limit
> > >>>>> >>>>>>>>>> the
> > >>>>> >>>>>>>>>>>>>> maximum number of overdraft segments that each
> > >>>>> >>>>> LocalBufferPool
> > >>>>> >>>>>>>>>>>>>> can apply for.
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> Why do we limit it?
> > >>>>> >>>>>>>>>>>>>> Some operators don't check the
> > >>>>> `recordWriter.isAvailable`
> > >>>>> >>>>>> during
> > >>>>> >>>>>>>>>>>>>> processing records, such as LegacySource. I have
> > >>>>> mentioned
> > >>>>> >>>> it
> > >>>>> >>>>>> in
> > >>>>> >>>>>>>>>>>>>> FLINK-26759 [1]. I'm not sure if there are other
> > cases.
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> If don't add the limitation, the LegacySource will
> use
> > >>>>> up
> > >>>>> >>>> all
> > >>>>> >>>>>>>>>> remaining
> > >>>>> >>>>>>>>>>>>>> memory in the NetworkBufferPool when the
> backpressure
> > is
> > >>>>> >>>>>> severe.
> > >>>>> >>>>>>>>>>>>>> How to limit it?
> > >>>>> >>>>>>>>>>>>>> I prefer to hard code the
> > >>>>> >>>>>>>> `maxOverdraftBuffers=numberOfSubpartitions`
> > >>>>> >>>>>>>>>>>>>> in the constructor of LocalBufferPool. The
> > >>>>> >>>>> maxOverdraftBuffers
> > >>>>> >>>>>> is
> > >>>>> >>>>>>>>>> just
> > >>>>> >>>>>>>>>>>>>> for safety, and it should be enough for most flink
> > >>>>> jobs. Or
> > >>>>> >>>>> we
> > >>>>> >>>>>>> can
> > >>>>> >>>>>>>>>> set
> > >>>>> >>>>>>>>>>>>>> `maxOverdraftBuffers=Math.max(numberOfSubpartitions,
> > >>>>> 10)`
> > >>>>> >>>> to
> > >>>>> >>>>>>> handle
> > >>>>> >>>>>>>>>>>>>> some jobs of low parallelism.
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> Also if user don't enable the Unaligned Checkpoint,
> we
> > >>>>> can
> > >>>>> >>>>> set
> > >>>>> >>>>>>>>>>>>>> maxOverdraftBuffers=0 in the constructor of
> > >>>>> >>>> LocalBufferPool.
> > >>>>> >>>>>>>> Because
> > >>>>> >>>>>>>>>>>>>> the overdraft isn't useful for the Aligned
> Checkpoint.
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> Please correct me if I'm wrong. Thanks a lot.
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> [1]
> https://issues.apache.org/jira/browse/FLINK-26759
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> Best wishes
> > >>>>> >>>>>>>>>>>>>> fanrui
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 12:29 AM Anton Kalashnikov <
> > >>>>> >>>>>>>>>> [email protected]>
> > >>>>> >>>>>>>>>>>>>> wrote:
> > >>>>> >>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> Hi fanrui,
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> Thanks for creating the FLIP.
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> In general, I think the overdraft is good idea and
> it
> > >>>>> >>>> should
> > >>>>> >>>>>>> help
> > >>>>> >>>>>>>> in
> > >>>>> >>>>>>>>>>>>>>> described above cases. Here are my thoughts about
> > >>>>> >>>>>> configuration:
> > >>>>> >>>>>>>>>>>>>>> Please, correct me if I am wrong but as I
> understand
> > >>>>> right
> > >>>>> >>>>> now
> > >>>>> >>>>>>> we
> > >>>>> >>>>>>>>>> have
> > >>>>> >>>>>>>>>>>>>>> following calculation.
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> maxBuffersNumber(per TaskManager) = Network
> > >>>>> >>>>> memory(calculated
> > >>>>> >>>>>>> via
> > >>>>> >>>>>>>>>>>>>>> taskmanager.memory.network.fraction,
> > >>>>> >>>>>>>> taskmanager.memory.network.min,
> > >>>>> >>>>>>>>>>>>>>> taskmanager.memory.network.max and total memory
> > size) /
> > >>>>> >>>>>>>>>>>>>>> taskmanager.memory.segment-size.
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> requiredBuffersNumber(per TaskManager) = (exclusive
> > >>>>> >>>> buffers
> > >>>>> >>>>> *
> > >>>>> >>>>>>>>>>>>>>> parallelism + floating buffers) * subtasks number
> in
> > >>>>> >>>>>> TaskManager
> > >>>>> >>>>>>>>>>>>>>> buffersInUseNumber = real number of buffers which
> > used
> > >>>>> at
> > >>>>> >>>>>>> current
> > >>>>> >>>>>>>>>>>>>>> moment(always <= requiredBuffersNumber)
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> Ideally requiredBuffersNumber should be equal to
> > >>>>> >>>>>>> maxBuffersNumber
> > >>>>> >>>>>>>>>> which
> > >>>>> >>>>>>>>>>>>>>> allows Flink work predictibly. But if
> > >>>>> >>>> requiredBuffersNumber
> > >>>>> >>>>>>>>>>>>>>> maxBuffersNumber sometimes it is also fine(but not
> > >>>>> good)
> > >>>>> >>>>> since
> > >>>>> >>>>>>> not
> > >>>>> >>>>>>>>>> all
> > >>>>> >>>>>>>>>>>>>>> required buffers really mandatory(e.g. it is ok if
> > >>>>> Flink
> > >>>>> >>>> can
> > >>>>> >>>>>> not
> > >>>>> >>>>>>>>>>>>>>> allocate floating buffers)
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> But if maxBuffersNumber > requiredBuffersNumber,
> as I
> > >>>>> >>>>>> understand
> > >>>>> >>>>>>>>>> Flink
> > >>>>> >>>>>>>>>>>>>>> just never use these leftovers
> > >>>>> buffers(maxBuffersNumber -
> > >>>>> >>>>>>>>>>>>>>> requiredBuffersNumber). Which I propose to use. (
> we
> > >>>>> can
> > >>>>> >>>>>> actualy
> > >>>>> >>>>>>>> use
> > >>>>> >>>>>>>>>>>>>>> even difference 'requiredBuffersNumber -
> > >>>>> >>>> buffersInUseNumber'
> > >>>>> >>>>>>> since
> > >>>>> >>>>>>>>>> if
> > >>>>> >>>>>>>>>>>>>>> one TaskManager contains several operators
> including
> > >>>>> >>>>> 'window'
> > >>>>> >>>>>>>> which
> > >>>>> >>>>>>>>>> can
> > >>>>> >>>>>>>>>>>>>>> temporally borrow buffers from the global pool).
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> My proposal, more specificaly(it relates only to
> > >>>>> >>>> requesting
> > >>>>> >>>>>>>> buffers
> > >>>>> >>>>>>>>>>>>>>> during processing single record while switching to
> > >>>>> >>>>>> unavalability
> > >>>>> >>>>>>>>>>>> between
> > >>>>> >>>>>>>>>>>>>>> records should be the same as we have it now):
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> * If one more buffer requested but
> > maxBuffersPerChannel
> > >>>>> >>>>>> reached,
> > >>>>> >>>>>>>>>> then
> > >>>>> >>>>>>>>>>>>>>> just ignore this limitation and allocate this
> buffers
> > >>>>> from
> > >>>>> >>>>> any
> > >>>>> >>>>>>>>>>>>>>> place(from LocalBufferPool if it has something yet
> > >>>>> >>>> otherwise
> > >>>>> >>>>>>> from
> > >>>>> >>>>>>>>>>>>>>> NetworkBufferPool)
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> * If LocalBufferPool exceeds limit, then temporally
> > >>>>> >>>> allocate
> > >>>>> >>>>>> it
> > >>>>> >>>>>>>> from
> > >>>>> >>>>>>>>>>>>>>> NetworkBufferPool while it has something to
> allocate
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> Maybe I missed something and this solution won't
> > work,
> > >>>>> >>>> but I
> > >>>>> >>>>>>> like
> > >>>>> >>>>>>>> it
> > >>>>> >>>>>>>>>>>>>>> since on the one hand, it work from the scratch
> > without
> > >>>>> >>>> any
> > >>>>> >>>>>>>>>>>>>>> configuration, on the other hand, it can be
> > >>>>> configuration
> > >>>>> >>>> by
> > >>>>> >>>>>>>>>> changing
> > >>>>> >>>>>>>>>>>>>>> proportion of maxBuffersNumber and
> > >>>>> requiredBuffersNumber.
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> The last thing that I want to say, I don't really
> > want
> > >>>>> to
> > >>>>> >>>>>>>> implement
> > >>>>> >>>>>>>>>> new
> > >>>>> >>>>>>>>>>>>>>> configuration since even now it is not clear how to
> > >>>>> >>>>> correctly
> > >>>>> >>>>>>>>>> configure
> > >>>>> >>>>>>>>>>>>>>> network buffers with existing configuration and I
> > don't
> > >>>>> >>>> want
> > >>>>> >>>>>> to
> > >>>>> >>>>>>>>>>>>>>> complicate it, especially if it will be possible to
> > >>>>> >>>> resolve
> > >>>>> >>>>>> the
> > >>>>> >>>>>>>>>> problem
> > >>>>> >>>>>>>>>>>>>>> automatically(as described above).
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> So is my understanding about network memory/buffers
> > >>>>> >>>> correct?
> > >>>>> >>>>>>>>>>>>>>> --
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> Best regards,
> > >>>>> >>>>>>>>>>>>>>> Anton Kalashnikov
> > >>>>> >>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>> 27.04.2022 07:46, rui fan пишет:
> > >>>>> >>>>>>>>>>>>>>>> Hi everyone,
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>> Unaligned Checkpoint (FLIP-76 [1]) is a major
> > feature
> > >>>>> of
> > >>>>> >>>>>> Flink.
> > >>>>> >>>>>>>> It
> > >>>>> >>>>>>>>>>>>>>>> effectively solves the problem of checkpoint
> timeout
> > >>>>> or
> > >>>>> >>>>> slow
> > >>>>> >>>>>>>>>>>>>>>> checkpoint when backpressure is severe.
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>> We found that UC(Unaligned Checkpoint) does not
> work
> > >>>>> well
> > >>>>> >>>>>> when
> > >>>>> >>>>>>>> the
> > >>>>> >>>>>>>>>>>>>>>> back pressure is severe and multiple output
> buffers
> > >>>>> are
> > >>>>> >>>>>>> required
> > >>>>> >>>>>>>> to
> > >>>>> >>>>>>>>>>>>>>>> process a single record. FLINK-14396 [2] also
> > >>>>> mentioned
> > >>>>> >>>>> this
> > >>>>> >>>>>>>> issue
> > >>>>> >>>>>>>>>>>>>>>> before. So we propose the overdraft buffer to
> solve
> > >>>>> it.
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>> I created FLINK-26762[3] and FLIP-227[4] to detail
> > the
> > >>>>> >>>>>>> overdraft
> > >>>>> >>>>>>>>>>>>>>>> buffer mechanism. After discussing with Anton
> > >>>>> >>>> Kalashnikov,
> > >>>>> >>>>>>> there
> > >>>>> >>>>>>>>>> are
> > >>>>> >>>>>>>>>>>>>>>> still some points to discuss:
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>>       * There are already a lot of buffer-related
> > >>>>> >>>>>> configurations.
> > >>>>> >>>>>>>> Do
> > >>>>> >>>>>>>>>> we
> > >>>>> >>>>>>>>>>>>>>>>         need to add a new configuration for the
> > >>>>> overdraft
> > >>>>> >>>>>> buffer?
> > >>>>> >>>>>>>>>>>>>>>>       * Where should the overdraft buffer use
> > memory?
> > >>>>> >>>>>>>>>>>>>>>>       * If the overdraft-buffer uses the memory
> > >>>>> remaining
> > >>>>> >>>> in
> > >>>>> >>>>>> the
> > >>>>> >>>>>>>>>>>>>>>>         NetworkBufferPool, no new configuration
> > needs
> > >>>>> to be
> > >>>>> >>>>>>> added.
> > >>>>> >>>>>>>>>>>>>>>>       * If adding a new configuration:
> > >>>>> >>>>>>>>>>>>>>>>           o Should we set the
> overdraft-memory-size
> > >>>>> at the
> > >>>>> >>>> TM
> > >>>>> >>>>>>> level
> > >>>>> >>>>>>>>>> or
> > >>>>> >>>>>>>>>>>> the
> > >>>>> >>>>>>>>>>>>>>>>             Task level?
> > >>>>> >>>>>>>>>>>>>>>>           o Or set overdraft-buffers to indicate
> the
> > >>>>> number
> > >>>>> >>>>> of
> > >>>>> >>>>>>>>>>>>>>>>             memory-segments that can be overdrawn.
> > >>>>> >>>>>>>>>>>>>>>>           o What is the default value? How to set
> > >>>>> sensible
> > >>>>> >>>>>>>> defaults?
> > >>>>> >>>>>>>>>>>>>>>> Currently, I implemented a POC [5] and verified it
> > >>>>> using
> > >>>>> >>>>>>>>>>>>>>>> flink-benchmarks [6]. The POC sets
> overdraft-buffers
> > >>>>> at
> > >>>>> >>>>> Task
> > >>>>> >>>>>>>> level,
> > >>>>> >>>>>>>>>>>>>>>> and default value is 10. That is: each
> > LocalBufferPool
> > >>>>> >>>> can
> > >>>>> >>>>>>>>>> overdraw up
> > >>>>> >>>>>>>>>>>>>>>> to 10 memory-segments.
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>> Looking forward to your feedback!
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>> Thanks,
> > >>>>> >>>>>>>>>>>>>>>> fanrui
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>>>>> [1]
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>
> > >>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > >>>>> >>>>>>>>>>>>>>>> [2]
> > https://issues.apache.org/jira/browse/FLINK-14396
> > >>>>> >>>>>>>>>>>>>>>> [3]
> > https://issues.apache.org/jira/browse/FLINK-26762
> > >>>>> >>>>>>>>>>>>>>>> [4]
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>
> > >>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> > >>>>> >>>>>>>>>>>>>>>> [5]
> > >>>>> >>>>>>>>>>>>>>>>
> > >>>>> >>
> > >>>>>
> >
> https://github.com/1996fanrui/flink/commit/c7559d94767de97c24ea8c540878832138c8e8fe
> > >>>>> >>>>>>>>>>>>>>>> [6]
> > >>>>> https://github.com/apache/flink-benchmarks/pull/54
> > >>>>> >>>>>>>>>>>> --
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>> Best regards,
> > >>>>> >>>>>>>>>>>> Anton Kalashnikov
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>>>>
> > >>>>> >>>>>>>>>> --
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>> Best regards,
> > >>>>> >>>>>>>>>> Anton Kalashnikov
> > >>>>> >>>>>>>>>>
> > >>>>> >>>>>>>>>>
> > >>>>> >> --
> > >>>>> >>
> > >>>>> >> Best regards,
> > >>>>> >> Anton Kalashnikov
> > >>>>> >>
> > >>>>> >>
> > >>>>>
> > >>>>
> >
>

Re: [DISCUSS] FLIP-227: Support overdraft buffer

Reply via email to