Elek, Marton created HDDS-296:
---------------------------------

             Summary: OMMetadataManagerLock is hold by getPendingDeletionKeys 
for a full table scan
                 Key: HDDS-296
                 URL: https://issues.apache.org/jira/browse/HDDS-296
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
            Reporter: Elek, Marton
             Fix For: 0.2.1


We identified the problem during freon tests on real clusters. First I saw it 
on a kubernetes based pseudo cluster (50 datanode, 1 freon). After a while the 
rate of the key allocation was slowed down. (See the attached image).

I could also reproduce the problem with local cluster (I used the 
hadoop-dist/target/compose/ozoneperf setup). After the first 1 million keys the 
key creation is almost stopped.

With the help of [~nandakumar131] we identified the problem is the lock in the 
ozone manager. (We profiled the OM with visual vm and found that the code is 
locked for an extremity long time, also checked the rocksdb/rpc metrics from 
prometheus and everything else was worked well.

[~nandakumar131] suggested to use Instrumented lock in the OMMetadataManager. 
With a custom build we identified that the problem is that the deletion service 
holds the OMMetadataManager lock for a full range scan. For 1 million keys it 
took about 10 seconds (with my local developer machine + ssd)

{code}
ozoneManager_1  | 2018-07-25 12:45:03 WARN  OMMetadataManager:143 - Lock held 
time above threshold: lock identifier: OMMetadataManagerLock 
lockHeldTimeMs=2648 ms. Suppressed 0 lock warnings. The stack trace is: 
java.lang.Thread.getStackTrace(Thread.java:1559)
ozoneManager_1  | 
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
ozoneManager_1  | 
org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
ozoneManager_1  | 
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
ozoneManager_1  | 
org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78)
ozoneManager_1  | 
org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:506)
ozoneManager_1  | 
org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:98)
ozoneManager_1  | 
org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:85)
ozoneManager_1  | java.util.concurrent.FutureTask.run(FutureTask.java:266)
ozoneManager_1  | 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
ozoneManager_1  | java.util.concurrent.FutureTask.run(FutureTask.java:266)
ozoneManager_1  | 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
ozoneManager_1  | java.util.concurrent.FutureTask.run(FutureTask.java:266)
ozoneManager_1  | 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
ozoneManager_1  | 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
ozoneManager_1  | 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
ozoneManager_1  | 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
ozoneManager_1  | java.lang.Thread.run(Thread.java:748)
{code}

I checked it with disabled DeletionService and worked well.

Deletion service should be improved to make it work without long term locking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to