Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Hangxiang Yu Wed, 08 Nov 2023 20:28:09 -0800

Hi, Piotr.
Thanks for the proposal.
Just as we discussed in FLINK-23411, +1 for supporting trace/span to
monitor metrics like checkpoint and recovery.


We could also do many things based on this mechanism:
1. more fine-grained metrics about checkpoint and recovery. For example,
some stage info about unaligned checkpoint or generic increment checkpoint
(changelog) which is a bit difficult to report in the current metric system.
2. users could also report their own-defined operator metrics to their own
distributed tracing system which may be traced together with other jobs or
systems.

Look forward to this feature!


On Thu, Nov 9, 2023 at 11:40 AM Zakelly Lan <zakelly....@gmail.com> wrote:

> Hi Piotr,
>
> Thanks for your detailed explanation! I could see the challenge of
> implementing traces with multiple spans and agree to put it in the future
> work. I personally prefer the idea of generating multi span traces for
> checkpoints on the JM only.
>
> > I'm not sure if I understand the proposal - I don't know how traces could
> > be used for this purpose?
> > Traces are perfect for one of events (like checkpointing, recovery, etc),
> > not for continuous monitoring
> > (like processing records). That's what metrics are. Creating trace and
> > span(s) per each record would
> > be prohibitively expensive.
>
> My original thought was to show how much time a sampled record is processed
> within each operator in stream processing. By saying 'sampled', I mean we
> won't generate a trace for every record due to the high cost involved.
> Instead, we could only trace ONE record from source when the user requests
> it (via REST API or Web UI) or when triggered periodically at a very low
> frequency. However after re-thinking my idea, I realized it's hard to
> define the whole lifecycle of a record since it is transformed into
> different forms among operators. We could discuss this in future after the
> basic trace is implemented in Flink.
>
> > Unless you mean in batch/bounded jobs? Then yes, we could create a
> bounded
> > job trace, with spans
> > for every stage/task/subtask.
>
> Oh yes, batch jobs could definitely leverage the trace.
>
> Best,
> Zakelly
>
>
> On Wed, Nov 8, 2023 at 9:18 PM Jinzhong Li <lijinzhong2...@gmail.com>
> wrote:
>
> > Hi Piotr,
> >
> > Thanks for driving this proposal!   I strongly agree that the existing
> > metric APIs are not suitable for monitoring restore/checkpoint behavior!
> >
> > I think the TM-level recovery/checkpointing traces are necessary in the
> > future. In our production environment, we sometimes encounter that job
> > recovery time is very long (30min+), due to several subTask heavy disk
> > traffic. The TM-level recovery trace is helpful for troubleshooting such
> > issues.
> >
> > Best
> > Jinzhong
> >
> > On Wed, Nov 8, 2023 at 5:09 PM Piotr Nowojski <pnowoj...@apache.org>
> > wrote:
> >
> > > Hi Zakelly,
> > >
> > > Thanks for the comments. Quick answer for both of your questions would
> be
> > > that it probably should be
> > > left as a future work. For more detailed answers please take a look
> below
> > > :)
> > >
> > > > Does it mean the inclusion and subdivision relationships of spans
> > defined
> > > > by "parent_id" are not supported? I think it is a very necessary
> > feature
> > > > for the trace.
> > >
> > > Yes exactly, that is the current limitation. This could be solved
> somehow
> > > one way or another in the future.
> > >
> > > Support for reporting multi span traces all at once - for example
> > > `CheckpointStatsTracker` running JM,
> > > could upon checkpoint completion create in one place the whole
> structure
> > of
> > > parent spans, to have for
> > > example one span per each subtask. This would be a relatively easy
> follow
> > > up.
> > >
> > > However, if we would like to create true distributed traces, with spans
> > > reported from many different
> > > components, potentially both on JM and TM, the problem is a bit deeper.
> > The
> > > issue in that case is how
> > > to actually fill out `parrent_id` and `trace_id`? Passing some context
> > > entity as a java object would be
> > > unfeasible. That would require too many changes in too many places. I
> > think
> > > the only realistic way
> > > to do it, would be to have a deterministic generator of `parten_id` and
> > > `trace_id` values.
> > >
> > > For example we could create the parent trace/span of the checkpoint on
> > JM,
> > > and set those ids to
> > > something like: `jobId#attemptId#checkpointId`. Each subtask then could
> > > re-generate those ids
> > > and subtasks' checkpoint span would have an id of
> > > `jobId#attemptId#checkpointId#subTaskId`.
> > > Note that this is just an example, as most likely distributed spans for
> > > checkpointing do not make
> > > sense, as we can generate them much easier on the JM anyway.
> > >
> > > > In addition to checkpoint and recovery, I believe the trace would
> also
> > be
> > > > valuable for performance tuning. If Flink can trace and visualize the
> > > time
> > > > cost of each operator and stage for a sampled record, users would be
> > able
> > > > to easily determine the end-to-end latency and identify performance
> > > issues
> > > > for optimization. Looking forward to seeing these in the future.
> > >
> > > I'm not sure if I understand the proposal - I don't know how traces
> could
> > > be used for this purpose?
> > > Traces are perfect for one of events (like checkpointing, recovery,
> etc),
> > > not for continuous monitoring
> > > (like processing records). That's what metrics are. Creating trace and
> > > span(s) per each record would
> > > be prohibitively expensive.
> > >
> > > Unless you mean in batch/bounded jobs? Then yes, we could create a
> > bounded
> > > job trace, with spans
> > > for every stage/task/subtask.
> > >
> > > Best,
> > > Piotrek
> > >
> > >
> > > śr., 8 lis 2023 o 05:30 Zakelly Lan <zakelly....@gmail.com>
> napisał(a):
> > >
> > > > Hi Piotr,
> > > >
> > > > Happy to see the trace! Thanks for this proposal.
> > > >
> > > > One minor question: It is mentioned in the interface of Span:
> > > >
> > > > Currently we don't support traces with multiple spans. Each span is
> > > > > self-contained and represents things like a checkpoint or recovery.
> > > >
> > > >
> > > > Does it mean the inclusion and subdivision relationships of spans
> > defined
> > > > by "parent_id" are not supported? I think it is a very necessary
> > feature
> > > > for the trace.
> > > >
> > > > In addition to checkpoint and recovery, I believe the trace would
> also
> > be
> > > > valuable for performance tuning. If Flink can trace and visualize the
> > > time
> > > > cost of each operator and stage for a sampled record, users would be
> > able
> > > > to easily determine the end-to-end latency and identify performance
> > > issues
> > > > for optimization. Looking forward to seeing these in the future.
> > > >
> > > > Best,
> > > > Zakelly
> > > >
> > > >
> > > > On Tue, Nov 7, 2023 at 6:27 PM Piotr Nowojski <pnowoj...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi Rui,
> > > > >
> > > > > Thanks for the comments!
> > > > >
> > > > > > 1. I see the trace just supports Span? Does it support trace
> > events?
> > > > > > I'm not sure whether tracing events is reasonable for
> > TraceReporter.
> > > > > > If it supports, flink can report checkpoint and checkpoint path
> > > > > proactively.
> > > > > > Currently, checkpoint lists or the latest checkpoint can only be
> > > > fetched
> > > > > > by external components or platforms. And report is more timely
> and
> > > > > > efficient than fetch.
> > > > >
> > > > > No, currently the `TraceReporter` that I'm introducing supports
> only
> > > > single
> > > > > span traces.
> > > > > So currently neither events on their own, nor events inside spans
> are
> > > not
> > > > > supported.
> > > > > This is done just for the sake of simplicity, and test out the
> basic
> > > > > functionality. But I think,
> > > > > those currently missing features should be added at some point in
> > > > > the future.
> > > > >
> > > > > About structured logging (basically events?) I vaguely remember
> some
> > > > > discussions about
> > > > > that. It might be a much larger topic, so I would prefer to leave
> it
> > > out
> > > > of
> > > > > the scope of this
> > > > > FLIP.
> > > > >
> > > > > > 2. This FLIP just monitors the checkpoint and task recovery,
> right?
> > > > >
> > > > > Yes, it only adds single span traces for checkpointing and
> > > > > recovery/initialisation - one
> > > > > span per whole job per either recovery/initialization process or
> per
> > > each
> > > > > checkpoint.
> > > > >
> > > > > > Could we add more operations in this FLIP? In our production, we
> > > > > > added a lot of trace reporters for job starts and scheduler
> > > operation.
> > > > > > They are useful if some jobs start slowly, because they will
> affect
> > > > > > the job availability. For example:
> > > > > > - From JobManager process is started to JobGraph is created
> > > > > > - From JobGraph is created to JobMaster is created
> > > > > > - From JobMaster is created to job is running
> > > > > > - From start request tm from yarn or kubernetes to all tms are
> > ready
> > > > > > - etc
> > > > >
> > > > > I think those could be indeed useful. If you would like to
> contribute
> > > > them
> > > > > in the future,
> > > > > I would be happy to review the FLIP for it :)
> > > > >
> > > > > > Of course, this FLIP doesn't include them is fine for me. The
> first
> > > > > version
> > > > > > only initializes the interface and common operations, and we can
> > add
> > > > > > more operations in the future
> > > > >
> > > > > Yes, that's exactly my thinking :)
> > > > >
> > > > > Best,
> > > > > Piotrek
> > > > >
> > > > > wt., 7 lis 2023 o 10:05 Rui Fan <1996fan...@gmail.com> napisał(a):
> > > > >
> > > > > > Hi Piotr,
> > > > > >
> > > > > > Thanks for driving this proposal! The trace reporter is useful to
> > > > > > check a lot of duration monitors inside of Flink.
> > > > > >
> > > > > > I have some questions about this proposal:
> > > > > >
> > > > > > 1. I see the trace just supports Span? Does it support trace
> > events?
> > > > > > I'm not sure whether tracing events is reasonable for
> > TraceReporter.
> > > > > > If it supports, flink can report checkpoint and checkpoint path
> > > > > > proactively.
> > > > > > Currently, checkpoint lists or the latest checkpoint can only be
> > > > fetched
> > > > > > by external components or platforms. And report is more timely
> and
> > > > > > efficient than fetch.
> > > > > >
> > > > > > 2. This FLIP just monitors the checkpoint and task recovery,
> right?
> > > > > > Could we add more operations in this FLIP? In our production, we
> > > > > > added a lot of trace reporters for job starts and scheduler
> > > operation.
> > > > > > They are useful if some jobs start slowly, because they will
> affect
> > > > > > the job availability. For example:
> > > > > > - From JobManager process is started to JobGraph is created
> > > > > > - From JobGraph is created to JobMaster is created
> > > > > > - From JobMaster is created to job is running
> > > > > > - From start request tm from yarn or kubernetes to all tms are
> > ready
> > > > > > - etc
> > > > > >
> > > > > > Of course, this FLIP doesn't include them is fine for me. The
> first
> > > > > version
> > > > > > only initializes the interface and common operations, and we can
> > add
> > > > > > more operations in the future.
> > > > > >
> > > > > > Best,
> > > > > > Rui
> > > > > >
> > > > > > On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <
> > pnowoj...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all!
> > > > > > >
> > > > > > > I would like to start a discussion on FLIP-384: Introduce
> > > > TraceReporter
> > > > > > and
> > > > > > > use it to create checkpointing and recovery traces [1].
> > > > > > >
> > > > > > > This proposal intends to improve observability of Flink's
> > > > Checkpointing
> > > > > > and
> > > > > > > Recovery/Initialization operations, by adding support for
> > reporting
> > > > > > traces
> > > > > > > from Flink. In the future, reporting traces can be of course
> used
> > > for
> > > > > > other
> > > > > > > use cases and also by users.
> > > > > > >
> > > > > > > There are also two other follow up FLIPS, FLIP-385 [2] and
> > FLIP-386
> > > > > [3],
> > > > > > > which expand the basic functionality introduced in FLIP-384
> [1].
> > > > > > >
> > > > > > > Please let me know what you think!
> > > > > > >
> > > > > > > Best,
> > > > > > > Piotr Nowojski
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > > > > > [2]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> > > > > > > [3]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Hangxiang.

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to