Elek, Marton created HDDS-296: --------------------------------- Summary: OMMetadataManagerLock is hold by getPendingDeletionKeys for a full table scan Key: HDDS-296 URL: https://issues.apache.org/jira/browse/HDDS-296 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Elek, Marton Fix For: 0.2.1
We identified the problem during freon tests on real clusters. First I saw it on a kubernetes based pseudo cluster (50 datanode, 1 freon). After a while the rate of the key allocation was slowed down. (See the attached image). I could also reproduce the problem with local cluster (I used the hadoop-dist/target/compose/ozoneperf setup). After the first 1 million keys the key creation is almost stopped. With the help of [~nandakumar131] we identified the problem is the lock in the ozone manager. (We profiled the OM with visual vm and found that the code is locked for an extremity long time, also checked the rocksdb/rpc metrics from prometheus and everything else was worked well. [~nandakumar131] suggested to use Instrumented lock in the OMMetadataManager. With a custom build we identified that the problem is that the deletion service holds the OMMetadataManager lock for a full range scan. For 1 million keys it took about 10 seconds (with my local developer machine + ssd) {code} ozoneManager_1 | 2018-07-25 12:45:03 WARN OMMetadataManager:143 - Lock held time above threshold: lock identifier: OMMetadataManagerLock lockHeldTimeMs=2648 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) ozoneManager_1 | org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) ozoneManager_1 | org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) ozoneManager_1 | org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) ozoneManager_1 | org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78) ozoneManager_1 | org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:506) ozoneManager_1 | org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:98) ozoneManager_1 | org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:85) ozoneManager_1 | java.util.concurrent.FutureTask.run(FutureTask.java:266) ozoneManager_1 | java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ozoneManager_1 | java.util.concurrent.FutureTask.run(FutureTask.java:266) ozoneManager_1 | java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ozoneManager_1 | java.util.concurrent.FutureTask.run(FutureTask.java:266) ozoneManager_1 | java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ozoneManager_1 | java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ozoneManager_1 | java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ozoneManager_1 | java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ozoneManager_1 | java.lang.Thread.run(Thread.java:748) {code} I checked it with disabled DeletionService and worked well. Deletion service should be improved to make it work without long term locking. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org