[ https://issues.apache.org/jira/browse/HDDS-12481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937021#comment-17937021 ]
Sadanand Shenoy edited comment on HDDS-12481 at 3/20/25 6:44 AM: ----------------------------------------------------------------- {code:java} [~]$ ulimit -n 1024 {code} The OM logs have too many exceptions complaining of the open file limit. I think the default limit is not enough for this workload. {code:java} Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279; status : IOError; message : While open a file for random read: /var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279/168584.sst: Too many open files{code} Somehow the active rocksDB got closed/crashed. It could be due to the same reason i.e it hitting the open file limit. Since activeDB got closed, *createOmSnapshotCheckpoint* which accesses the active DB to clear the DeletedTable wasn't succesful. This is part of doubleBuffer flush and any operation that fails here causes OM to terminate {code:java} 025-03-03 09:53:30,921 ERROR [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer: Terminating with exit status 1: During flush to DB encountered error in OMDoubleBuffer flush thread OMDoubleBufferFlushThread when handling OMRequest: cmdType: CreateSnapshot{code} I think the issue should get relieved upon increasing the open file limit was (Author: sshenoy): {code:java} [~]$ ulimit -n 1024 {code} The OM logs have too many exceptions complaining of the open file limit. I think the default limit is not enough for this workload. {code:java} Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279; status : IOError; message : While open a file for random read: /var/lib/hadoop-ozone/om/data/db.checkpoints/om.db_checkpoint_1741024400279/168584.sst: Too many open files{code} Somehow the active rocksDB got closed/crashed. I don't see the crash logs in the attached drive. It could be due to the same reason i.e it hitting the open file limit. Since activeDB got closed, *createOmSnapshotCheckpoint* which accesses the active DB to clear the DeletedTable wasn't succesful. This is part of doubleBuffer flush and any operation that fails here causes OM to terminate {code:java} 025-03-03 09:53:30,921 ERROR [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer: Terminating with exit status 1: During flush to DB encountered error in OMDoubleBuffer flush thread OMDoubleBufferFlushThread when handling OMRequest: cmdType: CreateSnapshot{code} I think the issue should get relieved upon increasing the open file limit > OM down due to OMDoubleBuffer error > ----------------------------------- > > Key: HDDS-12481 > URL: https://issues.apache.org/jira/browse/HDDS-12481 > Project: Apache Ozone > Issue Type: Bug > Components: Snapshot > Reporter: Jyotirmoy Sinha > Priority: Major > > Scenario - > # 150 Volumes > # 4 buckets per volume of combinations - > ## ratis - fso > ## ratis - obs > ## ec - fso > ## ec -obs > # 7 snapshots - > ## 4 backup snapshots > ## 3 replication snapshots > # Total number of snapshots = 150 * 4 * 7 = 4200 > # 500K keys per snapshot (key scale to be configurable) > # Repeat steps 3-6 for continuous intervals > OM Error stacktrace - > {code:java} > 2025-03-03 09:53:30,921 INFO > [grpc-default-executor-644]-org.apache.ratis.grpc.server.GrpcLogAppender: > om122@group-9E840F57059A->om121-InstallSnapshotResponseHandler: > InstallSnapshot in progress. > 2025-03-03 09:53:30,921 ERROR > [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer: > Terminating with exit status 1: During flush to DB encountered error in > OMDoubleBuffer flush thread OMDoubleBufferFlushThread when handling > OMRequest: cmdType: CreateSnapshot > traceID: "" > success: true > status: OK > CreateSnapshotResponse { > snapshotInfo { > snapshotID { > mostSigBits: 7137775835294879303 > leastSigBits: -8902643940451942791 > } > name: "snap1741024398" > volumeName: "bofavol-38" > bucketName: "bofabuck-38-ratfso" > snapshotStatus: SNAPSHOT_ACTIVE > creationTime: 1741024409564 > deletionTime: 18446744073709551615 > pathPreviousSnapshotID { > mostSigBits: -1057436070549500687 > leastSigBits: -5683722903965903247 > } > globalPreviousSnapshotID { > mostSigBits: 8988231063147725793 > leastSigBits: -5103262545165094617 > } > snapshotPath: "bofavol-38/bofabuck-38-ratfso" > checkpointDir: "-630e794d-fd1e-4e47-8473-74811f93fe79" > dbTxSequenceNumber: 323480877 > deepClean: true > sstFiltered: false > } > } java.io.IOException: Rocks Database is closed > at > org.apache.hadoop.hdds.utils.db.RocksDatabase.acquire(RocksDatabase.java:439) > at > org.apache.hadoop.hdds.utils.db.RocksDatabase.newIterator(RocksDatabase.java:777) > at > org.apache.hadoop.hdds.utils.db.RDBTable.iterator(RDBTable.java:232) > at > org.apache.hadoop.hdds.utils.db.TypedTable.iterator(TypedTable.java:418) > at > org.apache.hadoop.hdds.utils.db.TypedTable.iterator(TypedTable.java:55) > at > org.apache.hadoop.ozone.om.OmSnapshotManager.deleteKeysFromDelKeyTableInSnapshotScope(OmSnapshotManager.java:558) > at > org.apache.hadoop.ozone.om.OmSnapshotManager.createOmSnapshotCheckpoint(OmSnapshotManager.java:437) > at > org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotCreateResponse.addToDBBatch(OMSnapshotCreateResponse.java:81) > at > org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:73) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$5(OzoneManagerDoubleBuffer.java:383) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:221) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatch(OzoneManagerDoubleBuffer.java:382) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushBatch(OzoneManagerDoubleBuffer.java:325) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushCurrentBuffer(OzoneManagerDoubleBuffer.java:298) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:263) > at java.lang.Thread.run(Thread.java:748){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org