Piotr Nowojski created FLINK-24694:
--
Summary: Translate "Checkpointing under backpressure" page into
Chinese
Key: FLINK-24694
URL: https://issues.apache.org/jira/browse/FLINK-24694
Proj
buffers, we could first logging and processing these
> >> buffers
> >> >> and if they do not have buffers, we can still processing the buffers
> >> from
> >> >> the channels that has seen barriers. Therefore, It seems prioritizing
> >>
ot;drained"
>> during
>> >> the timeout, as pointed out by Stephan. With such a timeout, we are
>> very
>> >> likely not need to snapshot the input buffers, which would be very
>> similar
>> >> to the current aligned checkpoint mechanism.
&g
ay get "drained"
> during
> >> the timeout, as pointed out by Stephan. With such a timeout, we are very
> >> likely not need to snapshot the input buffers, which would be very
> similar
> >> to the current aligned checkpoint mechanism.
> >>
> >> Best,
&
gt; implement to make task could continue processing buffers as soon as
>> possible.
>>
>> Thanks for the further explannation of requirements for speeding up
>> checkpoints in backpressure scenario. To make the savepoint finish quickly
>> and then tune the setting to avoid backpressure is really a pratical case.
>> I think this s
---
> From:zhijiang
> Send Time:2019 Aug. 15 (Thu.) 02:22
> To:dev
> Subject:Re: Checkpointing under backpressure
>
> > For the checkpoint to complete, any buffer that
> > arrived prior to the barrier would be to be part of the checkpointed
> state.
>
> Yes, I agree
Subject:Re: Checkpointing under backpressure
> For the checkpoint to complete, any buffer that
> arrived prior to the barrier would be to be part of the checkpointed state.
Yes, I agree.
> So wouldn't it be important to finish persisting these buffers as fast as
> possible by pri
> >>> snapshot
> > > >>>>> isolation. Also, the original snapshotting gives a lot of
> potential
> > > for
> > > >>>>> flink to make proper transactional commits externally.
> > > >>>>>>
> > > &
ss issue also applies to the
savepoint use case. We would need to be able to take a savepoint fast in
order to roll forward a fix that can alleviate the backpressure (like
changing parallelism or making a different configuration change).
>
> Best,
> Zhijiang
>
savepoint use case. We would need to be able to take a savepoint fast in
order to roll forward a fix that can alleviate the backpressure (like
changing parallelism or making a different configuration change).
>
> Best,
> Zhijiang
> --
> From:Stephan Ewen
ign. E.g. the sink commit delay might not be coverd by unaligned
solution.
Best,
Zhijiang
------
From:Stephan Ewen
Send Time:2019年8月14日(星期三) 17:43
To:dev
Subject:Re: Checkpointing under backpressure
Quick note: The current implementation is
Align ->
> >>>>>>>
> >>>>>>> Paris:
> >>>>>>>
> >>>>>>> Thanks for the explanation Paris. I’m starting to understand this
> more
> >>>>> and I like the idea of snapshot
>>>>> Another thing is that from the wiki description I understood that the
>>>>> initial checkpointing is not initialised by any checkpoint barrier, but
>>> by
>>>>> an independent call/message from the Observer. I haven’t played with
>>> this
>>>>> idea a lot, but I had some discussion with Nico and it s
ata)
>>>>>> b) for any input channel for which it hasn’t yet received checkpoint
>>>> barrier, the data are being added to the checkpoint
>>>>>> c) once a channel (for example l1) receives a checkpoint barrier, the
>>>> Task blocks reads from that channel (?)
>>>
;> slowest Tasks. Right?
> >>>>
> >>>> Couple of intriguing thoughts are:
> >>>> 3. checkpoint barriers overtaking the output buffers
> >>>> 4. can we keep processing some data (in order to not waste CPU cycles)
> >> after we have taking the snapshot of the
;> intriguing. But probably there is also a benefit of to not continue
>> reading
>>>>> I1 since that could speed up retrieval from I2. Also, if the user code
>> is
>>>>> the cause of backpressure, this would avoid pumping more data into the
>>>>> pro
n a fixed interval with less IO overhead.
>
> Best,
> Yun
>
> --
> From:Piotr Nowojski
> Send Time:2019 Aug. 14 (Wed.) 18:38
> To:Paris Carbone
> Cc:dev ; zhijiang ; Nico
> Kruber
> Subject:Re: Checkpointing under backpressure
>
y similar with
> the
> >>>> way of overtaking we proposed before.
> >>>>
> >>>> There are some tiny difference:
> >>>> The way of overtaking might need to snapshot all the input/output
> queues.
> >>>> Chandy Lamport seems only need to snaphost (n-1) input
(Wed.) 18:38
To:Paris Carbone
Cc:dev ; zhijiang ; Nico
Kruber
Subject:Re: Checkpointing under backpressure
Hi again,
Zhu Zhu let me think about this more. Maybe as Paris is writing, we do not need
to block any channels at all, at least assuming credit base flow control.
Regarding what should
he state sizea bit. But normally
>>>> there should be less buffers for the first input channel with barrier.
>>>> The output barrier still follows with regular data stream in Chandy
>>>> Lamport, the same way as current flink. For overtaking way, we need to pay
>>&g
> channel, so the Chandy Lamport could benefit well. But for the case of all
>>> balanced heavy load input channels, I mean the first arrived barrier might
>>> still take much time, then the overtaking way could still fit well to speed
>>> up checkpoint.
>>> A
on downstream side.
> >> In the backpressure caused by data skew, the first barrier in almost
> empty
> >> input channel should arrive much eariler than the last heavy load input
> >> channel, so the Chandy Lamport could benefit well. But for the case of
> all
>
lly
>> considering some implementation details .
>>
>> Best,
>> Zhijiang
>> --
>> From:Paris Carbone
>> Send Time:2019年8月13日(星期二) 14:03
>> To:dev
>> Cc:zhijiang
>> Subj
jiang
--
From:Thomas Weise
Send Time:2019年8月14日(星期三) 06:00
To:dev ; zhijiang
Cc:Paris Carbone
Subject:Re: Checkpointing under backpressure
Great discussion! I'm excited that this is already under consideration! Are
there any JIRAs or other traces of discussion to follow?
Paris, if I
-
> From:Paris Carbone
> Send Time:2019年8月13日(星期二) 14:03
> To:dev
> Cc:zhijiang
> Subject:Re: Checkpointing under backpressure
>
> yes! It’s quite similar I think. Though mind that the devil is in the
> details, i.
proposed suggestion is helpful on my side, especially considering
some implementation details .
Best,
Zhijiang
--
From:Paris Carbone
Send Time:2019年8月13日(星期二) 14:03
To:dev
Cc:zhijiang
Subject:Re: Checkpointing under backpressure
yes! It’s quite similar I think. Though mind that the devil is in the details,
i.e., the temporal order actions are taken.
To clarify, let us say you have a task T with two input channels I1 and I2.
The Chandy Lamport execution flow is the following:
1) T receives barrier from I1 and...
2) ..
Thanks for the input. Regarding the Chandy-Lamport snapshots don’t you still
have to wait for the “checkpoint barrier” to arrive in order to know when have
you already received all possible messages from the upstream tasks/operators?
So instead of processing the “in flight” messages (as the Flin
Interesting problem! Thanks for bringing it up Thomas.
Ignore/Correct me if I am wrong but I believe Chandy-Lamport snapshots [1]
would help out solve this problem more elegantly without sacrificing
correctness.
- They do not need alignment, only (async) logging for in-flight records
between th
Hi Thomas,
As Zhijiang has responded, we are now in the process of discussing how to
address this issue and one of the solution that we are discussing is exactly
what you are proposing: checkpoint barriers overtaking the in flight data and
make the in flight data part of the checkpoint.
If eve
Hi Thomas,
Thanks for proposing this concern. The barrier alignment takes long time in
backpressure case which could cause several problems:
1. Checkpoint timeout as you mentioned.
2. The recovery cost is high once failover, because much data needs to be
replayed.
3. The delay for commit-based s
Hi,
One of the major operational difficulties we observe with Flink are
checkpoint timeouts under backpressure. I'm looking for both confirmation
of my understanding of the current behavior as well as pointers for future
improvement work:
Prior to introduction of credit based flow control in the
32 matches
Mail list logo