Matt Wang, I think the currently exposed info is all that is available through GarbageCollectorMXBean. This FLIP does not aim to introduce a new more granular way of reporting the per collector metrics, that would require a new mechanism and may be a breaking change.
We basically want to simply extend the current reporting here with the rate metrics and the total metrics. Gyula On Wed, Sep 6, 2023 at 9:24 AM Matt Wang <wang...@163.com> wrote: > Hi Gyula, > > +1 for this proposal. > > Do we need to add a metric to record the count of different > collectors? Now there is only a total count. For example, > for G1, there is no way to distinguish whether it is the > young generation or the old generation. > > > > -- > > Best, > Matt Wang > > > ---- Replied Message ---- > | From | Gyula Fóra<gyula.f...@gmail.com> | > | Date | 09/6/2023 15:03 | > | To | <dev@flink.apache.org> | > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics | > Thanks Xintong! > > Just so I understand correctly, do you suggest adding a metric for > delta(Time) / delta(Count) since the last reporting ? > <Collector>.TimePerGc or <Collector>.AverageTime would make sense. > AverageTime may be a bit nicer :) > > My only concern is how useful this will be in reality. If there are only > (or several) long pauses then the msPerSec metrics will show it already, > and if there is a single long pause that may not be shown at all if there > are several shorter pauses as well with this metric. > > Gyula > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <tonysong...@gmail.com> wrote: > > Thanks for bringing this up, Gyula. > > The proposed changes make sense to me. +1 for them. > > In addition to the proposed changes, I wonder if we should also add > something like timePerGc? This would help understand whether there are long > pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat > timeouts. Ideally, we'd like to understand the max pause time per STW in a > recent time window. However, I don't see an easy way to separate the pause > time of each STW. Deriving the overall time per GC from the existing > metrics (time-increment / count-increment) seems to be a good alternative. > WDYT? > > Best, > > Xintong > > > > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote: > > Thanks for the clarification! > > By default the meterview measures for 1 minute sounds good to me! > > +1 for this proposal. > > Best, > Rui > > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > > Thanks for the feedback Rui, > > The rates would be computed using the MeterView class (like for any > other > rate metric), just because we report the value per second it doesn't > mean > that we measure in a second granularity. > By default the meterview measures for 1 minute and then we calculate > the > per second rates, but we can increase the timespan if necessary. > > So I don't think we run into this problem in practice and we can keep > the > metric aligned with other time rate metrics like busyTimeMsPerSec etc. > > Cheers, > Gyula > > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote: > > Hi Gyula, > > +1 for this proposal. The current GC metric is really unfriendly. > > I have a concern with your proposed rate metric: the rate is > perSecond > instead of per minute. I'm unsure whether it's suitable for GC > metric. > > There are two reasons why I suspect perSecond may not be well > compatible with GC metric: > > 1. GCs are usually infrequent and may only occur for a small number > of time periods within a minute. > > Metrics are collected periodically, for example, reported every > minute. > If the result reported by the GC metric is 1s/perSecond, it does not > mean that the GC of the TM is serious, because there may be no GC > in the remaining 59s. > > On the contrary, the GC metric reports 0s/perSecond, which does not > mean that the GC of the TM is not serious, and the GC may be very > serious in the remaining 59s. > > 2. Stop-the-world may cause the metric to fail(delay) to report > > The TM will stop the world during GC, especially full GC. It means > the metric cannot be collected or reported during full GC. > > So the collected GC metric may never be 1s/perSecond. This metric > may always be good because the metric will only be reported when > the GC is not severe. > > > If these concerns make sense, how about updating the GC rate > at minute level? > > We can define the type to Gauge for TimeMsPerMiunte, and updating > this Gauge every second, it is: > GC Total.Time of current time - GC total time of one miunte ago. > > Best, > Rui > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <m...@apache.org> > wrote: > > Hi Gyula, > > +1 The proposed changes make sense and are in line with what is > available for other metrics, e.g. number of records processed. > > -Max > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <gyula.f...@gmail.com> > wrote: > > Hi Devs, > > I would like to start a discussion on FLIP-361: Improve GC > Metrics > [1]. > > The current Flink GC metrics [2] are not very useful for > monitoring > purposes as they require post processing logic that is also > dependent > on > the current runtime environment. > > Problems: > - Total time is not very relevant for long running applications, > only > the > rate of change (msPerSec) > - In most cases it's best to simply aggregate the time/count > across > the > different GabrageCollectors, however the specific collectors are > dependent > on the current Java runtime > > We propose to improve the current situation by: > - Exposing rate metrics per GarbageCollector > - Exposing aggregated Total time/count/rate metrics > > These new metrics are all derived from the existing ones with > minimal > overhead. > > Looking forward to your feedback. > > Cheers, > Gyula > > [1] > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics > [2] > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection > > > > > >