[ https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz-wo Sze resolved HDDS-7103. ------------------------------ Resolution: Workaround This problem has been worked around by HDDS-9192. Resolving ... > Ratis log storage directories unchecked causing unhandled exception on > datanode restart > --------------------------------------------------------------------------------------- > > Key: HDDS-7103 > URL: https://issues.apache.org/jira/browse/HDDS-7103 > Project: Apache Ozone > Issue Type: Bug > Reporter: Neil Joshi > Priority: Major > > Under the condition the ratis storage logs are configured to be on multiple > disks and there is a corruption causing the same directory found on each > disk, ratis throws an unhandled exception. The unhandled exception prevents > the datanode from creating pipelines. The datanode remains up with the user > only detecting a failure through the datanode logs. > Error can be seen with ozone cluster with configuration property > _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. > _dn1,dn2_ . Having the same directories in both disks. On datanode start > error will be logged when bringing up the XceiverServerRatis. > Snippet of logged error: > {code:java} > ozone-datanode-1 | 2022-08-03 22:05:54 INFO XceiverServerRatis:481 - > Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693 > ozone-datanode-1 | 2022-08-03 22:05:54 WARN EndpointStateMachine:236 - > Unable to communicate to SCM server at scm:9861 for past 0 seconds. > ozone-datanode-1 | java.io.IOException: More than one directories found for > 01a173a0-6bd2-478a-8598-05df3a6f318a: > [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, > /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a] > ozone-datanode-1 | at > org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177) > ozone-datanode-1 | at > org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113) > ozone-datanode-1 | at > org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code} > This jira is filed to track the issue and to resolve it. This issue had been > identified and discussed in a previous PR for the hdds volume diskchecker, PR > #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999. > Idea from the PR was to omit directories with the problem and continue. This > was to be done either, > i.) with a checker prior to the XceiverServerRatis; if this is in the current > Ozone, how to configure it to resolve this issue. > ii.) modifiy the Ratis code to remove affected directories and continue > instead of throwing and unhandled IOException, see > https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org