Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Jinzhong Li Wed, 08 Nov 2023 05:17:59 -0800

Hi Piotr,

Thanks for driving this proposal!   I strongly agree that the existing
metric APIs are not suitable for monitoring restore/checkpoint behavior!


I think the TM-level recovery/checkpointing traces are necessary in the
future. In our production environment, we sometimes encounter that job
recovery time is very long (30min+), due to several subTask heavy disk
traffic. The TM-level recovery trace is helpful for troubleshooting such
issues.

Best
Jinzhong

On Wed, Nov 8, 2023 at 5:09 PM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi Zakelly,
>
> Thanks for the comments. Quick answer for both of your questions would be
> that it probably should be
> left as a future work. For more detailed answers please take a look below
> :)
>
> > Does it mean the inclusion and subdivision relationships of spans defined
> > by "parent_id" are not supported? I think it is a very necessary feature
> > for the trace.
>
> Yes exactly, that is the current limitation. This could be solved somehow
> one way or another in the future.
>
> Support for reporting multi span traces all at once - for example
> `CheckpointStatsTracker` running JM,
> could upon checkpoint completion create in one place the whole structure of
> parent spans, to have for
> example one span per each subtask. This would be a relatively easy follow
> up.
>
> However, if we would like to create true distributed traces, with spans
> reported from many different
> components, potentially both on JM and TM, the problem is a bit deeper. The
> issue in that case is how
> to actually fill out `parrent_id` and `trace_id`? Passing some context
> entity as a java object would be
> unfeasible. That would require too many changes in too many places. I think
> the only realistic way
> to do it, would be to have a deterministic generator of `parten_id` and
> `trace_id` values.
>
> For example we could create the parent trace/span of the checkpoint on JM,
> and set those ids to
> something like: `jobId#attemptId#checkpointId`. Each subtask then could
> re-generate those ids
> and subtasks' checkpoint span would have an id of
> `jobId#attemptId#checkpointId#subTaskId`.
> Note that this is just an example, as most likely distributed spans for
> checkpointing do not make
> sense, as we can generate them much easier on the JM anyway.
>
> > In addition to checkpoint and recovery, I believe the trace would also be
> > valuable for performance tuning. If Flink can trace and visualize the
> time
> > cost of each operator and stage for a sampled record, users would be able
> > to easily determine the end-to-end latency and identify performance
> issues
> > for optimization. Looking forward to seeing these in the future.
>
> I'm not sure if I understand the proposal - I don't know how traces could
> be used for this purpose?
> Traces are perfect for one of events (like checkpointing, recovery, etc),
> not for continuous monitoring
> (like processing records). That's what metrics are. Creating trace and
> span(s) per each record would
> be prohibitively expensive.
>
> Unless you mean in batch/bounded jobs? Then yes, we could create a bounded
> job trace, with spans
> for every stage/task/subtask.
>
> Best,
> Piotrek
>
>
> śr., 8 lis 2023 o 05:30 Zakelly Lan <zakelly....@gmail.com> napisał(a):
>
> > Hi Piotr,
> >
> > Happy to see the trace! Thanks for this proposal.
> >
> > One minor question: It is mentioned in the interface of Span:
> >
> > Currently we don't support traces with multiple spans. Each span is
> > > self-contained and represents things like a checkpoint or recovery.
> >
> >
> > Does it mean the inclusion and subdivision relationships of spans defined
> > by "parent_id" are not supported? I think it is a very necessary feature
> > for the trace.
> >
> > In addition to checkpoint and recovery, I believe the trace would also be
> > valuable for performance tuning. If Flink can trace and visualize the
> time
> > cost of each operator and stage for a sampled record, users would be able
> > to easily determine the end-to-end latency and identify performance
> issues
> > for optimization. Looking forward to seeing these in the future.
> >
> > Best,
> > Zakelly
> >
> >
> > On Tue, Nov 7, 2023 at 6:27 PM Piotr Nowojski <pnowoj...@apache.org>
> > wrote:
> >
> > > Hi Rui,
> > >
> > > Thanks for the comments!
> > >
> > > > 1. I see the trace just supports Span? Does it support trace events?
> > > > I'm not sure whether tracing events is reasonable for TraceReporter.
> > > > If it supports, flink can report checkpoint and checkpoint path
> > > proactively.
> > > > Currently, checkpoint lists or the latest checkpoint can only be
> > fetched
> > > > by external components or platforms. And report is more timely and
> > > > efficient than fetch.
> > >
> > > No, currently the `TraceReporter` that I'm introducing supports only
> > single
> > > span traces.
> > > So currently neither events on their own, nor events inside spans are
> not
> > > supported.
> > > This is done just for the sake of simplicity, and test out the basic
> > > functionality. But I think,
> > > those currently missing features should be added at some point in
> > > the future.
> > >
> > > About structured logging (basically events?) I vaguely remember some
> > > discussions about
> > > that. It might be a much larger topic, so I would prefer to leave it
> out
> > of
> > > the scope of this
> > > FLIP.
> > >
> > > > 2. This FLIP just monitors the checkpoint and task recovery, right?
> > >
> > > Yes, it only adds single span traces for checkpointing and
> > > recovery/initialisation - one
> > > span per whole job per either recovery/initialization process or per
> each
> > > checkpoint.
> > >
> > > > Could we add more operations in this FLIP? In our production, we
> > > > added a lot of trace reporters for job starts and scheduler
> operation.
> > > > They are useful if some jobs start slowly, because they will affect
> > > > the job availability. For example:
> > > > - From JobManager process is started to JobGraph is created
> > > > - From JobGraph is created to JobMaster is created
> > > > - From JobMaster is created to job is running
> > > > - From start request tm from yarn or kubernetes to all tms are ready
> > > > - etc
> > >
> > > I think those could be indeed useful. If you would like to contribute
> > them
> > > in the future,
> > > I would be happy to review the FLIP for it :)
> > >
> > > > Of course, this FLIP doesn't include them is fine for me. The first
> > > version
> > > > only initializes the interface and common operations, and we can add
> > > > more operations in the future
> > >
> > > Yes, that's exactly my thinking :)
> > >
> > > Best,
> > > Piotrek
> > >
> > > wt., 7 lis 2023 o 10:05 Rui Fan <1996fan...@gmail.com> napisał(a):
> > >
> > > > Hi Piotr,
> > > >
> > > > Thanks for driving this proposal! The trace reporter is useful to
> > > > check a lot of duration monitors inside of Flink.
> > > >
> > > > I have some questions about this proposal:
> > > >
> > > > 1. I see the trace just supports Span? Does it support trace events?
> > > > I'm not sure whether tracing events is reasonable for TraceReporter.
> > > > If it supports, flink can report checkpoint and checkpoint path
> > > > proactively.
> > > > Currently, checkpoint lists or the latest checkpoint can only be
> > fetched
> > > > by external components or platforms. And report is more timely and
> > > > efficient than fetch.
> > > >
> > > > 2. This FLIP just monitors the checkpoint and task recovery, right?
> > > > Could we add more operations in this FLIP? In our production, we
> > > > added a lot of trace reporters for job starts and scheduler
> operation.
> > > > They are useful if some jobs start slowly, because they will affect
> > > > the job availability. For example:
> > > > - From JobManager process is started to JobGraph is created
> > > > - From JobGraph is created to JobMaster is created
> > > > - From JobMaster is created to job is running
> > > > - From start request tm from yarn or kubernetes to all tms are ready
> > > > - etc
> > > >
> > > > Of course, this FLIP doesn't include them is fine for me. The first
> > > version
> > > > only initializes the interface and common operations, and we can add
> > > > more operations in the future.
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <pnowoj...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi all!
> > > > >
> > > > > I would like to start a discussion on FLIP-384: Introduce
> > TraceReporter
> > > > and
> > > > > use it to create checkpointing and recovery traces [1].
> > > > >
> > > > > This proposal intends to improve observability of Flink's
> > Checkpointing
> > > > and
> > > > > Recovery/Initialization operations, by adding support for reporting
> > > > traces
> > > > > from Flink. In the future, reporting traces can be of course used
> for
> > > > other
> > > > > use cases and also by users.
> > > > >
> > > > > There are also two other follow up FLIPS, FLIP-385 [2] and FLIP-386
> > > [3],
> > > > > which expand the basic functionality introduced in FLIP-384 [1].
> > > > >
> > > > > Please let me know what you think!
> > > > >
> > > > > Best,
> > > > > Piotr Nowojski
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to