Re: [DISCUSS] FLIP-361: Improve GC Metrics

Gyula Fóra Wed, 06 Sep 2023 01:01:14 -0700

Matt Wang,

I think the currently exposed info is all that is available through
GarbageCollectorMXBean. This FLIP does not aim to introduce a new more
granular way of reporting the per collector metrics, that would require a
new mechanism and may be a breaking change.


We basically want to simply extend the current reporting here with the rate
metrics and the total metrics.

Gyula

On Wed, Sep 6, 2023 at 9:24 AM Matt Wang <[email protected]> wrote:

> Hi Gyula,
>
> +1 for this proposal.
>
> Do we need to add a metric to record the count of different
> collectors? Now there is only a total count. For example,
> for G1, there is no way to distinguish whether it is the
> young generation or the old generation.
>
>
>
> --
>
> Best,
> Matt Wang
>
>
> ---- Replied Message ----
> | From | Gyula Fóra<[email protected]> |
> | Date | 09/6/2023 15:03 |
> | To | <[email protected]> |
> | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> Thanks Xintong!
>
> Just so I understand correctly, do you suggest adding a metric for
> delta(Time) / delta(Count) since the last reporting ?
> <Collector>.TimePerGc or <Collector>.AverageTime would make sense.
> AverageTime may be a bit nicer :)
>
> My only concern is how useful this will be in reality. If there are only
> (or several) long pauses then the msPerSec metrics will show it already,
> and if there is a single long pause that may not be shown at all if there
> are several shorter pauses as well with this metric.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <[email protected]> wrote:
>
> Thanks for bringing this up, Gyula.
>
> The proposed changes make sense to me. +1 for them.
>
> In addition to the proposed changes, I wonder if we should also add
> something like timePerGc? This would help understand whether there are long
> pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
> timeouts. Ideally, we'd like to understand the max pause time per STW in a
> recent time window. However, I don't see an easy way to separate the pause
> time of each STW. Deriving the overall time per GC from the existing
> metrics (time-increment / count-increment) seems to be a good alternative.
> WDYT?
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <[email protected]> wrote:
>
> Thanks for the clarification!
>
> By default the meterview measures for 1 minute sounds good to me!
>
> +1 for this proposal.
>
> Best,
> Rui
>
> On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <[email protected]> wrote:
>
> Thanks for the feedback Rui,
>
> The rates would be computed using the MeterView class (like for any
> other
> rate metric), just because we report the value per second it doesn't
> mean
> that we measure in a second granularity.
> By default the meterview measures for 1 minute and then we calculate
> the
> per second rates, but we can increase the timespan if necessary.
>
> So I don't think we run into this problem in practice and we can keep
> the
> metric aligned with other time rate metrics like busyTimeMsPerSec etc.
>
> Cheers,
> Gyula
>
> On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <[email protected]> wrote:
>
> Hi Gyula,
>
> +1 for this proposal. The current GC metric is really unfriendly.
>
> I have a concern with your proposed rate metric: the rate is
> perSecond
> instead of per minute. I'm unsure whether it's suitable for GC
> metric.
>
> There are two reasons why I suspect perSecond may not be well
> compatible with GC metric:
>
> 1. GCs are usually infrequent and may only occur for a small number
> of time periods within a minute.
>
> Metrics are collected periodically, for example, reported every
> minute.
> If the result reported by the GC metric is 1s/perSecond, it does not
> mean that the GC of the TM is serious, because there may be no GC
> in the remaining 59s.
>
> On the contrary, the GC metric reports 0s/perSecond, which does not
> mean that the GC of the TM is not serious, and the GC may be very
> serious in the remaining 59s.
>
> 2. Stop-the-world may cause the metric to fail(delay) to report
>
> The TM will stop the world during GC, especially full GC. It means
> the metric cannot be collected or reported during full GC.
>
> So the collected GC metric may never be 1s/perSecond. This metric
> may always be good because the metric will only be reported when
> the GC is not severe.
>
>
> If these concerns make sense, how about updating the GC rate
> at minute level?
>
> We can define the type to Gauge for TimeMsPerMiunte, and updating
> this Gauge every second, it is:
> GC Total.Time of current time - GC total time of one miunte ago.
>
> Best,
> Rui
>
> On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <[email protected]>
> wrote:
>
> Hi Gyula,
>
> +1 The proposed changes make sense and are in line with what is
> available for other metrics, e.g. number of records processed.
>
> -Max
>
> On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <[email protected]>
> wrote:
>
> Hi Devs,
>
> I would like to start a discussion on FLIP-361: Improve GC
> Metrics
> [1].
>
> The current Flink GC metrics [2] are not very useful for
> monitoring
> purposes as they require post processing logic that is also
> dependent
> on
> the current runtime environment.
>
> Problems:
> - Total time is not very relevant for long running applications,
> only
> the
> rate of change (msPerSec)
> - In most cases it's best to simply aggregate the time/count
> across
> the
> different GabrageCollectors, however the specific collectors are
> dependent
> on the current Java runtime
>
> We propose to improve the current situation by:
> - Exposing rate metrics per GarbageCollector
> - Exposing aggregated Total time/count/rate metrics
>
> These new metrics are all derived from the existing ones with
> minimal
> overhead.
>
> Looking forward to your feedback.
>
> Cheers,
> Gyula
>
> [1]
>
>
>
>
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> [2]
>
>
>
>
>
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
>
>
>
>
>
>

Re: [DISCUSS] FLIP-361: Improve GC Metrics

Reply via email to