Re: [DISCUSS] FLIP-325: Support configuring end-to-end allowed latency

Piotr Nowojski Wed, 05 Jul 2023 10:54:42 -0700

Hi,

Thanks for this proposal, this is a very much needed thing that should be
addressed in Flink.


I think there is one thing that hasn't been discussed neither here nor in
FLIP-309. Given that we have
three dimensions:
- e2e latency/checkpointing interval
- enabling some kind of batching/buffering on the operator level
- how much resources we want to allocate to the job

How do we want Flink to adjust itself between those three? For example:
a) Should we assume that given Job has a fixed amount of assigned resources
and make it paramount that
  Flink doesn't exceed those available resources? So in case of
backpressure, we
  should extend checkpointing intervals, emit records less frequently and
in batches.
b) Or should we assume that the amount of resources is flexible (up to a
point?), and the desired e2e latency
  is the paramount aspect? So in case of backpressure, we should still
adhere to the configured e2e latency,
  and wait for the user or autoscaler to scale up the job?

In case of a), I think the concept of "isProcessingBacklog" is not needed,
we could steer the behaviour only
using the backpressure information.

On the other hand, in case of b), "isProcessingBacklog" information might
be helpful, to let Flink know that
we can safely decrease the e2e latency/checkpoint interval even if there is
no backpressure, to use fewer
resources (and let the autoscaler scale down the job).

Do we want to have both, or only one of those? Do a) and b) complement one
another? If job is backpressured,
we should follow a) and expose to autoscaler/users information "Hey! I'm
barely keeping up! I need more resources!".
While, when there is no backpressure and latency doesn't matter
(isProcessingBacklog=true), we can limit the resource
usage.

And a couple of more concrete remarks about the current proposal.

1.

> I think the goal is to allow users to specify an end-to-end latency
budget for the job.

I fully agree with this, but in that case, why are you proposing to add
`execution.flush.interval`? That's
yet another parameter that would affect e2e latency, without actually
defining it. We already have things
like: execution.checkpointing.interval, execution.buffer-timeout. I'm
pretty sure very few Flink users would be
able to configure or understand all of them.

I think we should simplify configuration and try to define
"execution.end-to-end-latency" so the runtime
could derive other things from this new configuration.

2. How do you envision `#flush()` and `#snapshotState()` to be connected?
So far, `#snapshotState()`
was considered as a kind of `#flush()` signal. Do we need both? Shouldn't
`#flush()` be implicitly or
explicitly attached to the `#snapshotState()` call?

3. What about unaligned checkpoints if we have separate `#flush()`
event/signal?

4. How should this be working in at-least-once mode (especially sources
that are configured to be working
in at-least-once mode)?.

5. How is this FLIP connected with FLI-327? I think they are trying to
achieve basically the same thing:
optimise when data should be flushed/committed to balance between
throughput and latency.

6.

> Add RecordAttributesBuilder and RecordAttributes that extends
StreamElement to provide operator with essential
> information about the records they receive, such as whether the records
are already stale due to backlog.

Passing along `RecordAttribute` for every `StreamElement` would be an
extremely inefficient solution.

If at all, this should be a marker propagated through the JobGraph vie
Events or sent from JM to TMs via an RPC
that would mark "backlog processing started/ended". Note that Events might
be costly, as they need to be
broadcasted. So with a job having 5 keyBy exchanges and parallelism of
1000, the number of events sent is
~4 000 000, while the number of RPCs would be only 5000.

In case we want to only check for the backpressure, we don't need any extra
signal. Operators/subtasks can
get that information very easily from the TMs runtime.

Best,
Piotrek

czw., 29 cze 2023 o 17:19 Dong Lin <[email protected]> napisał(a):

> Hi Shammon,
>
> Thanks for your comments. Please see my reply inline.
>
> On Thu, Jun 29, 2023 at 6:01 PM Shammon FY <[email protected]> wrote:
>
> > Hi Dong and Yunfeng,
> >
> > Thanks for bringing up this discussion.
> >
> > As described in the FLIP, the differences between `end-to-end latency`
> and
> > `table.exec.mini-batch.allow-latency` are: "It allows users to specify
> the
> > end-to-end latency, whereas table.exec.mini-batch.allow-latency applies
> to
> > each operator. If there are N operators on the path from source to sink,
> > the end-to-end latency could be up to
> table.exec.mini-batch.allow-latency *
> > N".
> >
> > If I understand correctly, `table.exec.mini-batch.allow-latency` is also
> > applied to the end-to-end latency for a job, maybe @Jack Wu can give more
> > information.
> >
>
> Based on what I can tell from the doc/code and offline discussion, I
> believe table.exec.mini-batch.allow-latency is not applied to the
> end-to-end latency for a job.
>
> It is mentioned here
> <
> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/config/
> >
> that
> table.exec.mini-batch.allow-latency is "the maximum latency can be used for
> MiniBatch to buffer input records". I think we should have mentioned that
> the config is applied to the end-to-end latency in this doc if it is indeed
> the case.
>
>
> > So, from my perspective, and please correct me if I'm misunderstand, the
> > targets of this FLIP may include the following:
> >
> > 1. Support a mechanism like  `mini-batch` in SQL for `DataStream`, which
> > will collect data in the operator and flush data when it receives a
> `flush`
> > event, in the FLIP it is `FlushEvent`.
> >
>
> I think the goal is to allow users to specify an end-to-end latency budget
> for the job. IMO it is quite different from the `mini-batch` in SQL.
>
>
> >
> > 2. Support dynamic `latency` according to the progress of job, such as
> > snapshot stage and after that.
> >
> > To do that, I have some questions:
> >
> > 1. I didn't understand the purpose of public interface
> `RecordAttributes`.
> > I think `FlushEvent` in the FLIP is enough, and different
> > `DynamicFlushStrategy` can be added to generate flush events to address
> > different needs, such as a static interval similar to mini-batch in SQL
> or
> > collect the information `isProcessingBacklog` and metrics to generate
> > `FlushEvent` which is described in your FLIP? If hudi sink needs the
> > `isBacklog` flag, the hudi `SplitEnumerator` can create an operator event
> > and send it to hudi source reader.
> >
>
> Suppose we only have FlushEvent, then operators (e.g. Hudi Sink) will not
> know they can buffer data in the following scenario:
>
> - execution.allowed-latency is not configured and use the default value
> null.
> - The job is reading from HybridSource and HybridSource says
> isBacklog=true.
>
> Also note that Hudi Sink might not be the only operators that can benefit
> from knowing isBacklog=true. Other sinks and aggregation operators (e.g.
> CoGroup) can also increase throughput by buffering/sorting records when
> there is backlog. So it seems simpler to pass RecordAttributes to these
> operators than asking every operator developer to create operator event
> etc.
>
>
> >
> > 2. How is this new mechanism unified with SQL's mini-batch mechanism? As
> > far as I am concerned, SQL implements mini-batch mechanism based on
> > watermark, I think it is very unreasonable to have two different
> > implementation in SQL and DataStream.
> >
>
> I think we can deprecate table.exec.mini-batch.allow-latency later
> once execution.allowed-latency is ready for production usage. This is
> mentioned in the "Compatibility, Deprecation, and Migration Plan" section.
>
> If there is a config that supports user specifying the e2e latency, it is
> probably reasonable for this config to work for both DataStream and SQL.
>
>
> > 3. I notice that the `CheckpointCoordinator` will generate `FlushEvent`,
> > which information about `FlushEvent` will be stored in
> >
>
> CheckpointCoordinator might need to send FlushEvent before triggering
> checkpoint in order to deal with the two-phase commit sinks. The algorithm
> is specified in the "Proposed Changes" section.
>
>
> > `Checkpoint`? What is the alignment strategy for FlushEvent in the
> > operator? The operator will flush the data when it receives all
> > `FlushEvent` from upstream with the same ID or do flush for each
> > `FlushEvent`? Can you give more detailed proposal about that? We also
> have
> > a demand for this piece, thanks
> >
>
> After an operator has received a FlushEvent:
> - If the ID of the received FlushEvent is larger than the largest ID this
> operator has received, then flush() is triggered for this operator and the
> operator should broadcast FlushEvent to downstream operators.
> - Otherwise, this FlushEvent is ignored.
>
> This behavior is specified in the Java doc of the FlushEvent.
>
> Can you see if this answers your questions?
>
> Best,
> Dong
>
>
> >
> >
> > Best,
> > Shammon FY
> >
> >
> >
> > On Thu, Jun 29, 2023 at 4:35 PM Martijn Visser <[email protected]
> >
> > wrote:
> >
> >> Hi Dong and Yunfeng,
> >>
> >> Thanks for the FLIP. What's not clear for me is what's the expected
> >> behaviour when the allowed latency can't be met, for whatever reason.
> >> Given that we're talking about an "allowed latency", it implies that
> >> something has gone wrong and should fail? Isn't this more a minimum
> >> latency that you're proposing?
> >>
> >> There's also the part about the Hudi Sink processing records
> >> immediately upon arrival. Given that the SinkV2 API provides the
> >> ability for custom post and pre-commit topologies [1], specifically
> >> targeted to avoid generating multiple small files, why isn't that
> >> sufficient for the Hudi Sink? It would be great to see that added
> >> under Rejected Alternatives if this is indeed not sufficient.
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction
> >>
> >> On Sun, Jun 25, 2023 at 4:25 AM Yunfeng Zhou
> >> <[email protected]> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > Dong(cc'ed) and I are opening this thread to discuss our proposal to
> >> > support configuring end-to-end allowed latency for Flink jobs, which
> >> > has been documented in FLIP-325
> >> > <
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-325%3A+Support+configuring+end-to-end+allowed+latency
> >> >.
> >> >
> >> > By configuring the latency requirement for a Flink job, users would be
> >> > able to optimize the throughput and overhead of the job while still
> >> > acceptably increasing latency. This approach is particularly useful
> >> > when dealing with records that do not require immediate processing and
> >> > emission upon arrival.
> >> >
> >> > Please refer to the FLIP document for more details about the proposed
> >> > design and implementation. We welcome any feedback and opinions on
> >> > this proposal.
> >> >
> >> > Best regards.
> >> >
> >> > Dong and Yunfeng
> >>
> >
>

Re: [DISCUSS] FLIP-325: Support configuring end-to-end allowed latency

Reply via email to