[jira] [Resolved] (HDDS-12469) fail fast for write block stuck

Tsz-wo Sze (Jira) Thu, 13 Mar 2025 09:22:15 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tsz-wo Sze resolved HDDS-12469.
-------------------------------
    Fix Version/s: 1.4.2
       Resolution: Fixed

The pull request is now merged.  Thanks, [~sumitagrawl]!

> fail fast for write block stuck
> -------------------------------
>
>                 Key: HDDS-12469
>                 URL: https://issues.apache.org/jira/browse/HDDS-12469
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Datanode
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.4.2
>
>         Attachments: 8022_review.patch
>
>
> In follower, ContainerStateMachine's write() return future, which will actual 
> perform block/chunk write.
> As part of check write,
>  * can create container if not exist
>  * write block chunk to disk
>  
> Under disk full condition / low disk, its taking huge time to process the 
> write chunk and seems stuck.
> From JMX metrics for DNs, its observed that Time taken (ns) in order of 
> 10^14, 10^13, ... that is, 100k second/10k seconds, .... shows process is 
> really stuck and unable to come out.
>  
> {code:java}
> jmxnode1_p1:    "WriteStateMachineDataNsAvgTime" : 1.0438595905348E14
> jmxnode2_p2:    "WriteStateMachineDataNsAvgTime" : 2.2966696397828832E13
> jmxnode2_p3:    "WriteStateMachineDataNsAvgTime" : 1.4061009948751E13
> jmxnode3_p4:    "WriteStateMachineDataNsAvgTime" : 1.0024869351741E13
> ... {code}
>  
> This might be due to the reason of volume might be failed, later observed few 
> volume disk have issues.
>  
> From logs of ratis, it keeps track and printing TimeoutException for the task 
> every 10 sec.
> {code:java}
> org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: 
> WriteLog:115: (t:1, i:115), STATEMACHINELOGENTRY, cmdType: WriteChunk 
> traceID: "" containerID: 18446516 datanodeUuid: 
> "2834c106-e999-4013-9934-a165fdbe41cf" pipelineID: 
> "f1efe128-22fe-4762-a248-7aebcaa07dff" 
> ...
> ...{code}
> Considering above scenario,
>  * Need make pipeline unhealthy if time taken is crossing certain threshold 
> (can be 10 min as max time for 256MB write or lesser), trigger pipeline 
> closure
>  * need make current task stop and fail, and avoid accepting further raft logs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HDDS-12469) fail fast for write block stuck

Reply via email to