[
https://issues.apache.org/jira/browse/HDDS-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789019#comment-17789019
]
Christos Bisias commented on HDDS-9645:
---------------------------------------
{quote}you started your docker cluster with 3 DNs only initially ?
{quote}
[~deveshsingh] Sorry, that was a typo. I started with 5 datanodes. I will
update the description.
* Before decommissioning any nodes
** Replicas are on 3 datanodes
{code:java}
bash-4.2$ ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE
bash-4.2$ ozone admin container list
{
"state" : "OPEN",
"stateEnterTime" : "2023-11-23T08:16:38.013Z",
"replicationConfig" : {
"replicationFactor" : "THREE",
"replicationType" : "RATIS"
},
"usedBytes" : 0,
"numberOfKeys" : 0,
"lastUsed" : "2023-11-23T08:17:02.803381Z",
"owner" : "omServiceIdDefault",
"containerID" : 1,
"deleteTransactionId" : 0,
"sequenceId" : 2,
"open" : true,
"deleted" : false
}
bash-4.2$ ozone admin container info 1
Container id: 1
Pipeline id: 0ef12a4b-76be-477e-9597-979f58b10607
Write PipelineId: 0ef12a4b-76be-477e-9597-979f58b10607
Write Pipeline State: OPEN
Container State: OPEN
Datanodes: [2da88b54-806f-4efe-9d04-bad55bfc342c/ozone-datanode-2.ozone_default,
c515b485-ab4f-4c45-8852-dfdabe0d28cb/ozone-datanode-3.ozone_default,
ba8a1675-087e-447c-9679-10daff9d9621/ozone-datanode-4.ozone_default]
Replicas: [State: OPEN; ReplicaIndex: 0; Origin:
ba8a1675-087e-447c-9679-10daff9d9621; Location:
ba8a1675-087e-447c-9679-10daff9d9621/ozone-datanode-4.ozone_default,
State: OPEN; ReplicaIndex: 0; Origin: 2da88b54-806f-4efe-9d04-bad55bfc342c;
Location: 2da88b54-806f-4efe-9d04-bad55bfc342c/ozone-datanode-2.ozone_default,
State: OPEN; ReplicaIndex: 0; Origin: c515b485-ab4f-4c45-8852-dfdabe0d28cb;
Location: c515b485-ab4f-4c45-8852-dfdabe0d28cb/ozone-datanode-3.ozone_default]
{code}
* After decommissioning 2 datanodes
** Replicas are on 5 datanodes
{code:java}
bash-4.2$ ozone admin container info 1
Container id: 1
Pipeline id: 5f92191a-5298-408b-9929-d2667f76dd67
Write PipelineId: 0ef12a4b-76be-477e-9597-979f58b10607
Write Pipeline State: CLOSED
Container State: CLOSED
Datanodes: [ba8a1675-087e-447c-9679-10daff9d9621/ozone-datanode-4.ozone_default,
d7951798-d5d0-4e89-b12d-23c4340e865c/ozone-datanode-1.ozone_default,
2da88b54-806f-4efe-9d04-bad55bfc342c/ozone-datanode-2.ozone_default,
2dd51a66-b60f-4072-a58f-eaa246a95dff/ozone-datanode-5.ozone_default,
c515b485-ab4f-4c45-8852-dfdabe0d28cb/ozone-datanode-3.ozone_default]
Replicas: [State: CLOSED; ReplicaIndex: 0; Origin:
ba8a1675-087e-447c-9679-10daff9d9621; Location:
ba8a1675-087e-447c-9679-10daff9d9621/ozone-datanode-4.ozone_default,
State: CLOSED; ReplicaIndex: 0; Origin: ba8a1675-087e-447c-9679-10daff9d9621;
Location: d7951798-d5d0-4e89-b12d-23c4340e865c/ozone-datanode-1.ozone_default,
State: CLOSED; ReplicaIndex: 0; Origin: 2da88b54-806f-4efe-9d04-bad55bfc342c;
Location: 2da88b54-806f-4efe-9d04-bad55bfc342c/ozone-datanode-2.ozone_default,
State: CLOSED; ReplicaIndex: 0; Origin: c515b485-ab4f-4c45-8852-dfdabe0d28cb;
Location: 2dd51a66-b60f-4072-a58f-eaa246a95dff/ozone-datanode-5.ozone_default,
State: CLOSED; ReplicaIndex: 0; Origin: c515b485-ab4f-4c45-8852-dfdabe0d28cb;
Location: c515b485-ab4f-4c45-8852-dfdabe0d28cb/ozone-datanode-3.ozone_default]
{code}
Recon counts the decommissioned replicas as well.
!image-2023-11-23-10-24-47-240.png|width=558,height=350!
> Recon doesn't exclude out-of-service nodes when checking for healthy
> containers
> -------------------------------------------------------------------------------
>
> Key: HDDS-9645
> URL: https://issues.apache.org/jira/browse/HDDS-9645
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Recon
> Reporter: Christos Bisias
> Assignee: Christos Bisias
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2023-11-07-17-47-14-250.png,
> image-2023-11-23-10-24-47-240.png
>
>
> When SCM checks for over-replication or under-replication, it doesn’t count
> replicas that belong to datanodes that are decommissioned or in maintenance.
> But it checks these datanodes when testing for mis-replication.
> Recon counts replicas belonging to datanodes that are decommissioned or in
> maintenance, in all above cases.
> We should exclude these datanodes
> * to be consistent
> * because replicas belonging to out-of-service nodes are not actually
> available
>
> To reproduce the issue
> * /hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone
> * Edit *docker-config* and add these two configs to decommission datanodes
> **
> {code:java}
> OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
> OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
> {code}
>
> * Start the docker env, create a key with replication RATIS 3
> **
> {code:java}
> > docker-compose up --scale datanode=3 -d
> > docker-compose exec scm bash
> bash-4.2$ ozone sh volume create /vol1
> bash-4.2$ ozone sh bucket create /vol1/bucket1
> bash-4.2$ ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE
> {code}
> * Decommission 2/3 datanodes that have the container replicas
> **
> {code:java}
> bash-4.2$ ozone admin container info 1
> get 2/3 datanodes
> bash-4.2$ ozone admin scm roles
> copy SCM IP
> bash-4.2$ ozone admin datanode list
> copy datanode IPs
> bash-4.2$ ozone admin datanode decommission -id=scmservice
> --scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
> Started decommissioning datanode(s):
> 172.23.0.8/ozone-datanode-2.ozone_default
> bash-4.2$ ozone admin datanode decommission -id=scmservice
> --scm=172.23.0.2:9894 172.23.0.11/ozone-datanode-1.ozone_default
> Started decommissioning datanode(s):
> 172.23.0.11/ozone-datanode-1.ozone_default{code}
> * After the nodes have successfully being decommissioned
> ** SCM container report
> ***
> {code:java}
> bash-4.2$ ozone admin container report
> Container Summary Report generated at 2023-11-07T15:37:38Z
> ==========================================================
> Container State Summary
> =======================
> OPEN: 0
> CLOSING: 0
> QUASI_CLOSED: 0
> CLOSED: 1
> DELETING: 0
> DELETED: 0
> RECOVERING: 0
> Container Health Summary
> ========================
> UNDER_REPLICATED: 0
> MIS_REPLICATED: 0
> OVER_REPLICATED: 0
> MISSING: 0
> UNHEALTHY: 0
> EMPTY: 0
> OPEN_UNHEALTHY: 0
> QUASI_CLOSED_STUCK: 0 {code}
> *
> ** Recon container page
> *** The container appears as over-replicated indefinetely
> *** Two replicas have been created in new datanodes but Recon reports that
> we expect 3 replicas but actually have 5. It's counting the replicas on the
> out-of-service nodes as well
> !image-2023-11-07-17-47-14-250.png|width=387,height=238!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]