[jira] [Comment Edited] (HDDS-12207) Unify output of `ozone debug replicas verify` checks

Ethan Rose (Jira) Mon, 24 Mar 2025 08:48:30 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-12207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937254#comment-17937254
 ]


Ethan Rose edited comment on HDDS-12207 at 3/24/25 3:47 PM:
------------------------------------------------------------

Thanks for looking at this [~sarvekshayr]. I took a stab some ideas for output 
too and came up with something like this:
{code}
{
    "keys": [
        {
            "volumeName": "vol", // Split volume and bucket for easier 
post-processing. This also matches key list output format.
            "bucketName": "bucket",
            "name": "HISTORY.md",
            // Set to false if any of the replica's checks failed, or any block 
had no replicas found.
            "pass": false,
            "blocks": [
                {
                    "containerID": 1,
                    "blockID": 123,
                    "replicas": [
                        {
                            "datanode": {
                                "uuid": "123-456",
                                "hostname": "dn1"
                            },
                            "checks": [
                                {
                                    "type": "checksums",
                                    "pass": false,
                                    "failures": [
                                        {
                                            "present": true, // The block was 
found in this container replica.
                                            "message": "Inconsistent read for 
chunk=123 len=10 bytesRead=5" // Comes from the checksum exception thrown out 
of the block input stream. May also be block not found.
                                        }
                                    ]
                                },
                                {
                                    "type": "block existence",
                                    "pass": false,
                                    "failures": [
                                        {
                                            "message": "" // It's possible that 
the getBlock call failed for a different reason other than the block being 
missing. We can write that here.
                                        }
                                    ]
                                },
                                {
                                    "type": "container states",
                                    "pass": false,
                                    "failures": [
                                        {
                                            // This check works on both SCM and 
the replicas, so scm state would end up duplicated among each replica's output 
in this layout.
                                            // SCM states of DELETING or 
DELETED would trigger a failure. Missing containers would already have an empty 
replica list as described above.
                                            "scmState": "CLOSED",
                                            "present": true,
                                            // Use the datanodes' readContainer 
API instead of SCM's getContainer API for the most up to date info.
                                            // UNHEALTHY would currently be the 
only replica state to count as a failure.
                                            "replicaState": "UNHEALTHY"
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ],
    "pass": true // Populated at the end to quickly see if there were any 
failures.
}
{code}

We can make the output print minimal information by default, and use additional 
flags to add information. For example, instead of {{\-\-failures-only}} to 
print only the failing keys. we can print only failures by default. Passing 
{{\-\-all}} would print results for all passing and failing keys. The extra 
{{failures}} information would be omitted unless ((--verbose}} is passed. 


was (Author: erose):
Thanks for looking at this [~sarvekshayr]. I took a stab some ideas for output 
too and came up with something like this:
{code}
{
    "keys": [
        {
            "volumeName": "vol", // Split volume and bucket for easier 
post-processing. This also matches key list output format.
            "bucketName": "bucket",
            "name": "HISTORY.md",
            // Set to false if any of the replica's checks failed, or any block 
had no replicas found.
            "pass": false,
            "blocks": [
                {
                    "containerID": 1,
                    "blockID": 123,
                    "replicas": [
                        {
                            "datanode": {
                                "uuid": "123-456",
                                "hostname": "dn1"
                            },
                            "checks": [
                                {
                                    "type": "checksums",
                                    "pass": false,
                                    "failures": [
                                        {
                                            "present": true, // The block was 
found in this container replica.
                                            "message": "Inconsistent read for 
chunk=123 len=10 bytesRead=5" // Comes from the checksum exception thrown out 
of the block input stream. May also be block not found.
                                        }
                                    ]
                                },
                                {
                                    "type": "block existence",
                                    "pass": false,
                                    "failures": [
                                        {
                                            "message": "" // It's possible that 
the getBlock call failed for a different reason other than the block being 
missing. We can write that here.
                                        }
                                    ]
                                },
                                {
                                    "type": "container states",
                                    "pass": false,
                                    "failures": [
                                        {
                                            // This check works on both SCM and 
the replicas, so scm state would end up duplicated among each replica's output 
in this layout.
                                            // SCM states of DELETING or 
DELETED would trigger a failure. Missing containers would already have an empty 
replica list as described above.
                                            "scmState": "CLOSED",
                                            "present": true,
                                            // Use the datanodes' readContainer 
API instead of SCM's getContainer API for the most up to date info.
                                            // UNHEALTHY would currently be the 
only replica state to count as a failure.
                                            "replicaState": "UNHEALTHY"
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ],
    "pass": true // Populated at the end to quickly see if there were any 
failures.
}
{code}

We can make the output print minimal information by default, and use additional 
flags to add information. For example, instead of {{--failures-only}} to print 
only the failing keys. we can print only failures by default. Passing {{--all}} 
would print results for all passing and failing keys. The extra {{failures}} 
information would be omitted unless ((--verbose}} is passed. 

> Unify output of `ozone debug replicas verify` checks
> ----------------------------------------------------
>
>                 Key: HDDS-12207
>                 URL: https://issues.apache.org/jira/browse/HDDS-12207
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Assignee: Sarveksha Yeshavantha Raju
>            Priority: Major
>
> Make {{ozone debug replicas verify}} output json information about each key 
> and the checks that were run on it. This could optionally be streamed to 
> stdout or broken up into multiple files as specified by the user. As new 
> checks are added, their results will be included in the same json objects. We 
> can also add an option to skip output for keys that passed all the specified 
> checks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Comment Edited] (HDDS-12207) Unify output of `ozone debug replicas verify` checks

Reply via email to