Hi Piotr,

Thanks for driving this proposal! The trace reporter is useful to
check a lot of duration monitors inside of Flink.

I have some questions about this proposal:

1. I see the trace just supports Span? Does it support trace events?
I'm not sure whether tracing events is reasonable for TraceReporter.
If it supports, flink can report checkpoint and checkpoint path proactively.
Currently, checkpoint lists or the latest checkpoint can only be fetched
by external components or platforms. And report is more timely and
efficient than fetch.

2. This FLIP just monitors the checkpoint and task recovery, right?
Could we add more operations in this FLIP? In our production, we
added a lot of trace reporters for job starts and scheduler operation.
They are useful if some jobs start slowly, because they will affect
the job availability. For example:
- From JobManager process is started to JobGraph is created
- From JobGraph is created to JobMaster is created
- From JobMaster is created to job is running
- From start request tm from yarn or kubernetes to all tms are ready
- etc

Of course, this FLIP doesn't include them is fine for me. The first version
only initializes the interface and common operations, and we can add
more operations in the future.

Best,
Rui

On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi all!
>
> I would like to start a discussion on FLIP-384: Introduce TraceReporter and
> use it to create checkpointing and recovery traces [1].
>
> This proposal intends to improve observability of Flink's Checkpointing and
> Recovery/Initialization operations, by adding support for reporting traces
> from Flink. In the future, reporting traces can be of course used for other
> use cases and also by users.
>
> There are also two other follow up FLIPS, FLIP-385 [2] and FLIP-386 [3],
> which expand the basic functionality introduced in FLIP-384 [1].
>
> Please let me know what you think!
>
> Best,
> Piotr Nowojski
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> [3]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
>

Reply via email to