Re: [DISCUSS] FLIP-361: Improve GC Metrics

Xintong Song Wed, 06 Sep 2023 01:08:59 -0700

>
> Just so I understand correctly, do you suggest adding a metric for
> delta(Time) / delta(Count) since the last reporting ?
> <Collector>.TimePerGc or <Collector>.AverageTime would make sense.
> AverageTime may be a bit nicer :)
>


Yes, that's what I mean.

My only concern is how useful this will be in reality. If there are only
> (or several) long pauses then the msPerSec metrics will show it already,
> and if there is a single long pause that may not be shown at all if there
> are several shorter pauses as well with this metric.


Let's say we measure this for every minute and see a 900 msPerSec (which
means 54s within the minute are spent on GC). This may come from a single
GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs with
less time each. As the default heartbeat timeout is 50s, the former means
there's likely a heartbeat timeout due to the GC pause, while the latter
means otherwise.


Mathematically, it is possible that there's 1 long pause together with
several short pauses within the same measurement period, making the long
pause not observable with AverageTime. However, from my experience, such a
pattern is not normal in reality. My observation is that GCs happen at a
similar time usually take a similar length of time. Admittedly, this is not
a hard guarantee.


Best,

Xintong



On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra <[email protected]> wrote:

> Matt Wang,
>
> I think the currently exposed info is all that is available through
> GarbageCollectorMXBean. This FLIP does not aim to introduce a new more
> granular way of reporting the per collector metrics, that would require a
> new mechanism and may be a breaking change.
>
> We basically want to simply extend the current reporting here with the rate
> metrics and the total metrics.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 9:24 AM Matt Wang <[email protected]> wrote:
>
> > Hi Gyula,
> >
> > +1 for this proposal.
> >
> > Do we need to add a metric to record the count of different
> > collectors? Now there is only a total count. For example,
> > for G1, there is no way to distinguish whether it is the
> > young generation or the old generation.
> >
> >
> >
> > --
> >
> > Best,
> > Matt Wang
> >
> >
> > ---- Replied Message ----
> > | From | Gyula Fóra<[email protected]> |
> > | Date | 09/6/2023 15:03 |
> > | To | <[email protected]> |
> > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > Thanks Xintong!
> >
> > Just so I understand correctly, do you suggest adding a metric for
> > delta(Time) / delta(Count) since the last reporting ?
> > <Collector>.TimePerGc or <Collector>.AverageTime would make sense.
> > AverageTime may be a bit nicer :)
> >
> > My only concern is how useful this will be in reality. If there are only
> > (or several) long pauses then the msPerSec metrics will show it already,
> > and if there is a single long pause that may not be shown at all if there
> > are several shorter pauses as well with this metric.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <[email protected]>
> wrote:
> >
> > Thanks for bringing this up, Gyula.
> >
> > The proposed changes make sense to me. +1 for them.
> >
> > In addition to the proposed changes, I wonder if we should also add
> > something like timePerGc? This would help understand whether there are
> long
> > pauses, due to GC STW, that may lead to rpc unresponsiveness and
> heartbeat
> > timeouts. Ideally, we'd like to understand the max pause time per STW in
> a
> > recent time window. However, I don't see an easy way to separate the
> pause
> > time of each STW. Deriving the overall time per GC from the existing
> > metrics (time-increment / count-increment) seems to be a good
> alternative.
> > WDYT?
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <[email protected]> wrote:
> >
> > Thanks for the clarification!
> >
> > By default the meterview measures for 1 minute sounds good to me!
> >
> > +1 for this proposal.
> >
> > Best,
> > Rui
> >
> > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <[email protected]> wrote:
> >
> > Thanks for the feedback Rui,
> >
> > The rates would be computed using the MeterView class (like for any
> > other
> > rate metric), just because we report the value per second it doesn't
> > mean
> > that we measure in a second granularity.
> > By default the meterview measures for 1 minute and then we calculate
> > the
> > per second rates, but we can increase the timespan if necessary.
> >
> > So I don't think we run into this problem in practice and we can keep
> > the
> > metric aligned with other time rate metrics like busyTimeMsPerSec etc.
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <[email protected]> wrote:
> >
> > Hi Gyula,
> >
> > +1 for this proposal. The current GC metric is really unfriendly.
> >
> > I have a concern with your proposed rate metric: the rate is
> > perSecond
> > instead of per minute. I'm unsure whether it's suitable for GC
> > metric.
> >
> > There are two reasons why I suspect perSecond may not be well
> > compatible with GC metric:
> >
> > 1. GCs are usually infrequent and may only occur for a small number
> > of time periods within a minute.
> >
> > Metrics are collected periodically, for example, reported every
> > minute.
> > If the result reported by the GC metric is 1s/perSecond, it does not
> > mean that the GC of the TM is serious, because there may be no GC
> > in the remaining 59s.
> >
> > On the contrary, the GC metric reports 0s/perSecond, which does not
> > mean that the GC of the TM is not serious, and the GC may be very
> > serious in the remaining 59s.
> >
> > 2. Stop-the-world may cause the metric to fail(delay) to report
> >
> > The TM will stop the world during GC, especially full GC. It means
> > the metric cannot be collected or reported during full GC.
> >
> > So the collected GC metric may never be 1s/perSecond. This metric
> > may always be good because the metric will only be reported when
> > the GC is not severe.
> >
> >
> > If these concerns make sense, how about updating the GC rate
> > at minute level?
> >
> > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > this Gauge every second, it is:
> > GC Total.Time of current time - GC total time of one miunte ago.
> >
> > Best,
> > Rui
> >
> > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <[email protected]>
> > wrote:
> >
> > Hi Gyula,
> >
> > +1 The proposed changes make sense and are in line with what is
> > available for other metrics, e.g. number of records processed.
> >
> > -Max
> >
> > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <[email protected]>
> > wrote:
> >
> > Hi Devs,
> >
> > I would like to start a discussion on FLIP-361: Improve GC
> > Metrics
> > [1].
> >
> > The current Flink GC metrics [2] are not very useful for
> > monitoring
> > purposes as they require post processing logic that is also
> > dependent
> > on
> > the current runtime environment.
> >
> > Problems:
> > - Total time is not very relevant for long running applications,
> > only
> > the
> > rate of change (msPerSec)
> > - In most cases it's best to simply aggregate the time/count
> > across
> > the
> > different GabrageCollectors, however the specific collectors are
> > dependent
> > on the current Java runtime
> >
> > We propose to improve the current situation by:
> > - Exposing rate metrics per GarbageCollector
> > - Exposing aggregated Total time/count/rate metrics
> >
> > These new metrics are all derived from the existing ones with
> > minimal
> > overhead.
> >
> > Looking forward to your feedback.
> >
> > Cheers,
> > Gyula
> >
> > [1]
> >
> >
> >
> >
> >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > [2]
> >
> >
> >
> >
> >
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> >
> >
> >
> >
> >
> >
>

Re: [DISCUSS] FLIP-361: Improve GC Metrics

Reply via email to