[
https://issues.apache.org/jira/browse/HDDS-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved HDDS-12469.
-------------------------------
Fix Version/s: 1.4.2
Resolution: Fixed
The pull request is now merged. Thanks, [~sumitagrawl]!
> fail fast for write block stuck
> -------------------------------
>
> Key: HDDS-12469
> URL: https://issues.apache.org/jira/browse/HDDS-12469
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Datanode
> Reporter: Sumit Agrawal
> Assignee: Sumit Agrawal
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.4.2
>
> Attachments: 8022_review.patch
>
>
> In follower, ContainerStateMachine's write() return future, which will actual
> perform block/chunk write.
> As part of check write,
> * can create container if not exist
> * write block chunk to disk
>
> Under disk full condition / low disk, its taking huge time to process the
> write chunk and seems stuck.
> From JMX metrics for DNs, its observed that Time taken (ns) in order of
> 10^14, 10^13, ... that is, 100k second/10k seconds, .... shows process is
> really stuck and unable to come out.
>
> {code:java}
> jmxnode1_p1: "WriteStateMachineDataNsAvgTime" : 1.0438595905348E14
> jmxnode2_p2: "WriteStateMachineDataNsAvgTime" : 2.2966696397828832E13
> jmxnode2_p3: "WriteStateMachineDataNsAvgTime" : 1.4061009948751E13
> jmxnode3_p4: "WriteStateMachineDataNsAvgTime" : 1.0024869351741E13
> ... {code}
>
> This might be due to the reason of volume might be failed, later observed few
> volume disk have issues.
>
> From logs of ratis, it keeps track and printing TimeoutException for the task
> every 10 sec.
> {code:java}
> org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s:
> WriteLog:115: (t:1, i:115), STATEMACHINELOGENTRY, cmdType: WriteChunk
> traceID: "" containerID: 18446516 datanodeUuid:
> "2834c106-e999-4013-9934-a165fdbe41cf" pipelineID:
> "f1efe128-22fe-4762-a248-7aebcaa07dff"
> ...
> ...{code}
> Considering above scenario,
> * Need make pipeline unhealthy if time taken is crossing certain threshold
> (can be 10 min as max time for 256MB write or lesser), trigger pipeline
> closure
> * need make current task stop and fail, and avoid accepting further raft logs
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]