Hi Ozone Devs, I am currently working on HDDS-7364 <https://issues.apache.org/jira/browse/HDDS-7364> to get Ozone's container scanner to a point where it can be enabled by default. The container scanner will check container block data and metadata in the background to identify corruption, mark containers unhealthy, and notify SCM so a healthy replica can be copied.
One of the subtasks is HDDS-8062 <https://issues.apache.org/jira/browse/HDDS-8062>, which is to provide a way to track why containers were marked unhealthy, and persist that information so it can be referenced a while later. Datanode application logs can roll too frequently for this purpose, so I propose adding a new log to the datanode to track container replica state transitions. This log would provide useful debugging insight not just for the scanner, but for any other replica related issues that may originate on the datanodes. The design doc is attached to HDDS-8062 <https://issues.apache.org/jira/browse/HDDS-8062> and here <https://issues.apache.org/jira/secure/attachment/13058801/container_log_v1.pdf> is a link as well. This will add a new log and new debugging capabilities to Ozone. Please review and provide any feedback on this thread or the jira. Thanks. Ethan