sodonnel commented on PR #7140: URL: https://github.com/apache/ozone/pull/7140#issuecomment-2340407654
> Hi @jianghuazhu thanks for working on this. This will be useful for container reconciliation in [HDDS-11345](https://issues.apache.org/jira/browse/HDDS-11345) as well. I think we still have to work out the differences in how metrics are tracked in between datanode commands that go through the replication supervisor and those that are handled outside of it. For example in `ReplicateContainerCommandHandler`: > > * `totalTime` and `invocationCount` are never assigned. Their related metrics will always be 0. > > * There is no analogous metric for execution times in the supervisor to replace these. > > * Queued count is pulled from the supervisor and propagated to the command handler metrics. > > * The supervisor does not expose this metric directly, even though it tracks it. > > > Basically the way this change is implemented I think it is confusing whether to look for a metric in the replication supervisor or the command handler. IMO all datanode commands should properly fill in a core common set of metrics, and the replication supervisor metrics should extend that with additional data for replication specific commands. I had a look, and these problems of consistency are pre-existing. I think this PR is an improvement, but there is scope for further changes to bring in the total time and invocation count to the supervisor metrics. The way the code is structured, the ReplicationContainer command handler simply passes the task to the supervisor, so it cannot track the total time of execution. It could in theory pull the metric from the supervisor on demand, if the supervisor tracked it. In some respects, replication commands are different to other commands on the DN, so it does somewhat make sense that they have their own metrics from the supervisor. In summary, I am happy with this change and will give it a +1, but it would make sense to open a Jira and ideally a further PR to add totalTime to the supervisor metrics per command type. That said, I am not really sure how useful that is - all you can get from it is an average, and containers are all different sizes, so its not a very useful metric to analyse the replication performance. @errose28 - what do you think about proceeding with committing this and having followup work to make things more consistent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
