Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Piotr Nowojski Tue, 07 Nov 2023 02:27:57 -0800

Hi Rui,

Thanks for the comments!


> 1. I see the trace just supports Span? Does it support trace events?
> I'm not sure whether tracing events is reasonable for TraceReporter.
> If it supports, flink can report checkpoint and checkpoint path
proactively.
> Currently, checkpoint lists or the latest checkpoint can only be fetched
> by external components or platforms. And report is more timely and
> efficient than fetch.

No, currently the `TraceReporter` that I'm introducing supports only single
span traces.
So currently neither events on their own, nor events inside spans are not
supported.
This is done just for the sake of simplicity, and test out the basic
functionality. But I think,
those currently missing features should be added at some point in
the future.

About structured logging (basically events?) I vaguely remember some
discussions about
that. It might be a much larger topic, so I would prefer to leave it out of
the scope of this
FLIP.

> 2. This FLIP just monitors the checkpoint and task recovery, right?

Yes, it only adds single span traces for checkpointing and
recovery/initialisation - one
span per whole job per either recovery/initialization process or per each
checkpoint.

> Could we add more operations in this FLIP? In our production, we
> added a lot of trace reporters for job starts and scheduler operation.
> They are useful if some jobs start slowly, because they will affect
> the job availability. For example:
> - From JobManager process is started to JobGraph is created
> - From JobGraph is created to JobMaster is created
> - From JobMaster is created to job is running
> - From start request tm from yarn or kubernetes to all tms are ready
> - etc

I think those could be indeed useful. If you would like to contribute them
in the future,
I would be happy to review the FLIP for it :)

> Of course, this FLIP doesn't include them is fine for me. The first
version
> only initializes the interface and common operations, and we can add
> more operations in the future

Yes, that's exactly my thinking :)

Best,
Piotrek

wt., 7 lis 2023 o 10:05 Rui Fan <[email protected]> napisał(a):

> Hi Piotr,
>
> Thanks for driving this proposal! The trace reporter is useful to
> check a lot of duration monitors inside of Flink.
>
> I have some questions about this proposal:
>
> 1. I see the trace just supports Span? Does it support trace events?
> I'm not sure whether tracing events is reasonable for TraceReporter.
> If it supports, flink can report checkpoint and checkpoint path
> proactively.
> Currently, checkpoint lists or the latest checkpoint can only be fetched
> by external components or platforms. And report is more timely and
> efficient than fetch.
>
> 2. This FLIP just monitors the checkpoint and task recovery, right?
> Could we add more operations in this FLIP? In our production, we
> added a lot of trace reporters for job starts and scheduler operation.
> They are useful if some jobs start slowly, because they will affect
> the job availability. For example:
> - From JobManager process is started to JobGraph is created
> - From JobGraph is created to JobMaster is created
> - From JobMaster is created to job is running
> - From start request tm from yarn or kubernetes to all tms are ready
> - etc
>
> Of course, this FLIP doesn't include them is fine for me. The first version
> only initializes the interface and common operations, and we can add
> more operations in the future.
>
> Best,
> Rui
>
> On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <[email protected]>
> wrote:
>
> > Hi all!
> >
> > I would like to start a discussion on FLIP-384: Introduce TraceReporter
> and
> > use it to create checkpointing and recovery traces [1].
> >
> > This proposal intends to improve observability of Flink's Checkpointing
> and
> > Recovery/Initialization operations, by adding support for reporting
> traces
> > from Flink. In the future, reporting traces can be of course used for
> other
> > use cases and also by users.
> >
> > There are also two other follow up FLIPS, FLIP-385 [2] and FLIP-386 [3],
> > which expand the basic functionality introduced in FLIP-384 [1].
> >
> > Please let me know what you think!
> >
> > Best,
> > Piotr Nowojski
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> >
>

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to