[jira] [Comment Edited] (HDDS-12481) OM down due to OMDoubleBuffer error

Sadanand Shenoy (Jira) Thu, 20 Mar 2025 09:11:23 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-12481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937021#comment-17937021
 ]


Sadanand Shenoy edited comment on HDDS-12481 at 3/20/25 6:44 AM:
-----------------------------------------------------------------

{code:java}
[~]$ ulimit -n
1024
{code}
The OM logs have too many exceptions complaining of the open file limit. I 
think the default limit is not enough for this workload. 
{code:java}
Caused by: java.io.IOException: class 
org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open 
/var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279; 
status : IOError; message : While open a file for random read: 
/var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279/168584.sst:
 Too many open files{code}
Somehow the active rocksDB got closed/crashed. It could be due to the same 
reason i.e it hitting the open file limit.

Since activeDB got closed, *createOmSnapshotCheckpoint* which accesses the 
active DB to clear the DeletedTable wasn't succesful. This is part of 
doubleBuffer flush and any operation that fails here causes OM to terminate
{code:java}
025-03-03 09:53:30,921 ERROR 
[OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
 Terminating with exit status 1: During flush to DB encountered error in 
OMDoubleBuffer flush thread OMDoubleBufferFlushThread when handling OMRequest: 
cmdType: CreateSnapshot{code}
I think the issue should get relieved upon increasing the open file limit


was (Author: sshenoy):
{code:java}
[~]$ ulimit -n
1024
{code}
The OM logs have too many exceptions complaining of the open file limit. I 
think the default limit is not enough for this workload. 
{code:java}
Caused by: java.io.IOException: class 
org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open 
/var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279; 
status : IOError; message : While open a file for random read: 
/var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279/168584.sst:
 Too many open files{code}
Somehow the active rocksDB got closed/crashed. I don't see the crash logs in 
the attached drive. It could be due to the same reason i.e it hitting the open 
file limit.

Since activeDB got closed, *createOmSnapshotCheckpoint* which accesses the 
active DB to clear the DeletedTable wasn't succesful. This is part of 
doubleBuffer flush and any operation that fails here causes OM to terminate
{code:java}
025-03-03 09:53:30,921 ERROR 
[OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
 Terminating with exit status 1: During flush to DB encountered error in 
OMDoubleBuffer flush thread OMDoubleBufferFlushThread when handling OMRequest: 
cmdType: CreateSnapshot{code}
I think the issue should get relieved upon increasing the open file limit

> OM down due to OMDoubleBuffer error
> -----------------------------------
>
>                 Key: HDDS-12481
>                 URL: https://issues.apache.org/jira/browse/HDDS-12481
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Snapshot
>            Reporter: Jyotirmoy Sinha
>            Priority: Major
>
> Scenario -
>  # 150 Volumes
>  # 4 buckets per volume of combinations -
>  ## ratis - fso
>  ## ratis - obs
>  ## ec - fso
>  ## ec -obs
>  # 7 snapshots -
>  ## 4 backup snapshots
>  ## 3 replication snapshots
>  # Total number of snapshots = 150 * 4 * 7 = 4200
>  # 500K keys per snapshot (key scale to be configurable) 
>  # Repeat steps 3-6 for continuous intervals
> OM Error stacktrace - 
> {code:java}
> 2025-03-03 09:53:30,921 INFO 
> [grpc-default-executor-644]-org.apache.ratis.grpc.server.GrpcLogAppender: 
> om122@group-9E840F57059A->om121-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2025-03-03 09:53:30,921 ERROR 
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
>  Terminating with exit status 1: During flush to DB encountered error in 
> OMDoubleBuffer flush thread OMDoubleBufferFlushThread when handling 
> OMRequest: cmdType: CreateSnapshot
> traceID: ""
> success: true
> status: OK
> CreateSnapshotResponse {
>   snapshotInfo {
>     snapshotID {
>       mostSigBits: 7137775835294879303
>       leastSigBits: -8902643940451942791
>     }
>     name: "snap1741024398"
>     volumeName: "bofavol-38"
>     bucketName: "bofabuck-38-ratfso"
>     snapshotStatus: SNAPSHOT_ACTIVE
>     creationTime: 1741024409564
>     deletionTime: 18446744073709551615
>     pathPreviousSnapshotID {
>       mostSigBits: -1057436070549500687
>       leastSigBits: -5683722903965903247
>     }
>     globalPreviousSnapshotID {
>       mostSigBits: 8988231063147725793
>       leastSigBits: -5103262545165094617
>     }
>     snapshotPath: "bofavol-38/bofabuck-38-ratfso"
>     checkpointDir: "-630e794d-fd1e-4e47-8473-74811f93fe79"
>     dbTxSequenceNumber: 323480877
>     deepClean: true
>     sstFiltered: false
>   }
> } java.io.IOException: Rocks Database is closed
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.acquire(RocksDatabase.java:439)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.newIterator(RocksDatabase.java:777)
>         at 
> org.apache.hadoop.hdds.utils.db.RDBTable.iterator(RDBTable.java:232)
>         at 
> org.apache.hadoop.hdds.utils.db.TypedTable.iterator(TypedTable.java:418)
>         at 
> org.apache.hadoop.hdds.utils.db.TypedTable.iterator(TypedTable.java:55)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.deleteKeysFromDelKeyTableInSnapshotScope(OmSnapshotManager.java:558)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.createOmSnapshotCheckpoint(OmSnapshotManager.java:437)
>         at 
> org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotCreateResponse.addToDBBatch(OMSnapshotCreateResponse.java:81)
>         at 
> org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:73)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$5(OzoneManagerDoubleBuffer.java:383)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:221)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatch(OzoneManagerDoubleBuffer.java:382)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushBatch(OzoneManagerDoubleBuffer.java:325)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushCurrentBuffer(OzoneManagerDoubleBuffer.java:298)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:263)
>         at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Comment Edited] (HDDS-12481) OM down due to OMDoubleBuffer error

Reply via email to