Re: [PR] HDDS-11376. Improve ReplicationSupervisor to record replication metrics [ozone]

via GitHub Tue, 10 Sep 2024 04:30:36 -0700


sodonnel commented on PR #7140:
URL: https://github.com/apache/ozone/pull/7140#issuecomment-2340407654


   > Hi @jianghuazhu thanks for working on this. This will be useful for 
container reconciliation in 
[HDDS-11345](https://issues.apache.org/jira/browse/HDDS-11345) as well. I think 
we still have to work out the differences in how metrics are tracked in between 
datanode commands that go through the replication supervisor and those that are 
handled outside of it. For example in `ReplicateContainerCommandHandler`:
   > 
   >     * `totalTime` and `invocationCount` are never assigned. Their related 
metrics will always be 0.
   >       
   >       * There is no analogous metric for execution times in the supervisor 
to replace these.
   > 
   >     * Queued count is pulled from the supervisor and propagated to the 
command handler metrics.
   >       
   >       * The supervisor does not expose this metric directly, even though 
it tracks it.
   > 
   > 
   > Basically the way this change is implemented I think it is confusing 
whether to look for a metric in the replication supervisor or the command 
handler. IMO all datanode commands should properly fill in a core common set of 
metrics, and the replication supervisor metrics should extend that with 
additional data for replication specific commands.
   
   I had a look, and these problems of consistency are pre-existing. I think 
this PR is an improvement, but there is scope for further changes to bring in 
the total time and invocation count to the supervisor metrics.
   
   The way the code is structured, the ReplicationContainer command handler 
simply passes the task to the supervisor, so it cannot track the total time of 
execution. It could in theory pull the metric from the supervisor on demand, if 
the supervisor tracked it.
   
   In some respects, replication commands are different to other commands on 
the DN, so it does somewhat make sense that they have their own metrics from 
the supervisor.
   
   In summary, I am happy with this change and will give it a +1, but it would 
make sense to open a Jira and ideally a further PR to add totalTime to the 
supervisor metrics per command type. That said, I am not really sure how useful 
that is - all you can get from it is an average, and containers are all 
different sizes, so its not a very useful metric to analyse the replication 
performance.
   
   @errose28 - what do you think about proceeding with committing this and 
having followup work to make things more consistent?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-11376. Improve ReplicationSupervisor to record replication metrics [ozone]

Reply via email to