Elek, Marton created HDDS-296:
---------------------------------
Summary: OMMetadataManagerLock is hold by getPendingDeletionKeys
for a full table scan
Key: HDDS-296
URL: https://issues.apache.org/jira/browse/HDDS-296
Project: Hadoop Distributed Data Store
Issue Type: Bug
Reporter: Elek, Marton
Fix For: 0.2.1
We identified the problem during freon tests on real clusters. First I saw it
on a kubernetes based pseudo cluster (50 datanode, 1 freon). After a while the
rate of the key allocation was slowed down. (See the attached image).
I could also reproduce the problem with local cluster (I used the
hadoop-dist/target/compose/ozoneperf setup). After the first 1 million keys the
key creation is almost stopped.
With the help of [~nandakumar131] we identified the problem is the lock in the
ozone manager. (We profiled the OM with visual vm and found that the code is
locked for an extremity long time, also checked the rocksdb/rpc metrics from
prometheus and everything else was worked well.
[~nandakumar131] suggested to use Instrumented lock in the OMMetadataManager.
With a custom build we identified that the problem is that the deletion service
holds the OMMetadataManager lock for a full range scan. For 1 million keys it
took about 10 seconds (with my local developer machine + ssd)
{code}
ozoneManager_1 | 2018-07-25 12:45:03 WARN OMMetadataManager:143 - Lock held
time above threshold: lock identifier: OMMetadataManagerLock
lockHeldTimeMs=2648 ms. Suppressed 0 lock warnings. The stack trace is:
java.lang.Thread.getStackTrace(Thread.java:1559)
ozoneManager_1 |
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
ozoneManager_1 |
org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
ozoneManager_1 |
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
ozoneManager_1 |
org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78)
ozoneManager_1 |
org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:506)
ozoneManager_1 |
org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:98)
ozoneManager_1 |
org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:85)
ozoneManager_1 | java.util.concurrent.FutureTask.run(FutureTask.java:266)
ozoneManager_1 |
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
ozoneManager_1 | java.util.concurrent.FutureTask.run(FutureTask.java:266)
ozoneManager_1 |
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
ozoneManager_1 | java.util.concurrent.FutureTask.run(FutureTask.java:266)
ozoneManager_1 |
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
ozoneManager_1 |
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
ozoneManager_1 |
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
ozoneManager_1 |
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
ozoneManager_1 | java.lang.Thread.run(Thread.java:748)
{code}
I checked it with disabled DeletionService and worked well.
Deletion service should be improved to make it work without long term locking.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]