[ 
https://issues.apache.org/jira/browse/HDDS-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841382#comment-17841382
 ] 

Hemant Kumar commented on HDDS-10738:
-------------------------------------

I couldn't find any other ERROR or problem in between 2024-04-22 10:43:06 and 
2024-04-22 13:16:20 other than the leadership change.
And failure in notifying the follower.

{code}
...
2024-04-22 10:43:44,706 INFO 
[om131@group-2BC026ED99AC->om132-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
 om131@group-2BC026ED99AC->om132-GrpcLogAppender: send 
om131->om132#0-t22,notify:(t:7, i:3424334)
2024-04-22 10:43:44,706 WARN 
[grpc-default-executor-11]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om131@group-2BC026ED99AC->om132-InstallSnapshotResponseHandler: Failed 
InstallSnapshot: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
2024-04-22 10:43:44,707 WARN 
[grpc-default-executor-12]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om131@group-2BC026ED99AC->om132-AppendLogResponseHandler: Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-04-22 10:43:44,707 WARN 
[grpc-default-executor-8]-org.apache.ratis.grpc.server.GrpcLogAppender: 
om131@group-2BC026ED99AC->om132-AppendLogResponseHandler: Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2024-04-22 10:43:44,707 INFO 
[om131@group-2BC026ED99AC->om132-GrpcLogAppender-LogAppenderDaemon]-org.apache.ratis.grpc.server.GrpcLogAppender:
 om131@group-2BC026ED99AC->om132-GrpcLogAppender: followerNextIndex = 0 but 
logStartIndex = 3424334, notify follower to install snapshot-(t:7, i:3424334)
...
{code}


In theory, sending notifications to follower should not have any performance 
impact on the current leader node. But seems like this is not true in this case.

After some time, follower finally sends the requests to get the tarball around 
2024-04-22 10:47:31
{code}
...
2024-04-22 04:14:25,645 INFO 
[qtp1421547398-16116]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 10:47:31,842 INFO 
[qtp560715723-671567]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 10:57:36,898 INFO 
[qtp560715723-672419]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:07:41,992 INFO 
[qtp560715723-672879]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
...
{code}



> Unable to load snapshot exception encountered in a LR setup
> -----------------------------------------------------------
>
>                 Key: HDDS-10738
>                 URL: https://issues.apache.org/jira/browse/HDDS-10738
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, Snapshot
>            Reporter: Jyotirmoy Sinha
>            Assignee: Hemant Kumar
>            Priority: Major
>              Labels: ozone-snapshot
>
> Scenario :
>  * Generate data over parallel threads over various volume/buckets
>  * Perform parallel snapshot create/delete/list operations over above buckets
>  * Perform parallel snapdiff operations over each bucket
>  * Perform parallel read operations of snapshot contents
>  * Introduce OM and cluster restarts in between along with DN decommissioning 
> and balancer restarts.
> Observation - When multiple threads are running with snapshot operations the 
> snapshot path contents are not accessible even after 20 mins
> Snapshot creation log OM -
> {code:java}
> 2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread - 
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: 
> Created snapshot: 'snap1713809817' with snapshotId: 
> '849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6' 
> {code}
> OM Error stacktrace - 
> {code:java}
> 2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint 
> directory: 60 didn't get created in 
> /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a
>  secs.
> 2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve 
> snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
> TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load 
> snapshot. Snapshot checkpoint directory 
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
>  does not exist yet. Please wait a few more seconds before retrying
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
>         at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
>         at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call 
> Call#2 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.17.207.24:57514
> java.lang.IllegalStateException: TIMEOUT 
> org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. 
> Snapshot checkpoint directory 
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
>  does not exist yet. Please wait a few more seconds before retrying {code}
> The above error is coming for multiple snapshots repeatedly and mostly coming 
> in parallel snapshot operations across various volume/buckets, not in serial 
> operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to