ZanderXu commented on code in PR #6737: URL: https://github.com/apache/hadoop/pull/6737#discussion_r1568168688
########## hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/NamenodeFGL.md: ########## @@ -0,0 +1,201 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +HDFS Namenode Fine-grained Locking +================================== + +<!-- MACRO{toc|fromDepth=0|toDepth=3} --> + +Overview +-------- + +HDFS relies on a single master, the Namenode (NN), as its metadata center. +From an architectural point of view, a few elements make NN the bottleneck of an HDFS cluster: +* NN keeps the entire namespace in memory (directory tree, blocks, Datanode related info, etc.) +* Read requests (`getListing`, `getFileInfo`, `getBlockLocations`) are served from memory. +Write requests (`mkdir`, `create`, `addBlock`, `complete`) update the memory state and write a journal transaction into QJM. +Both types of requests need a locking mechanism to ensure data consistency and correctness. +* All requests are funneled into NN and have to go through the global FS lock. +Each write operation acquires this lock in write mode and holds it until that operation is executed. +This lock mode prevents concurrent execution of write operations even if they involve different branches of the directory tree. + +NN fine-grained locking (FGL) implementation aims to alleviate this bottleneck by allowing concurrency of disjoint write operations. + +JIRA: [HDFS-17366](https://issues.apache.org/jira/browse/HDFS-17366) + +Design +------ +In theory, fully independent operations can be processed concurrently, such as operations involving different subdirectory trees. +As such, NN can split the global lock into the full path lock, just using the full path lock to protect a special subdirectory tree. + +### RPC Categorization + +Roughly, RPC operations handled by NN can be divided into 8 main categories + +| Category | Operations | +|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Involving namespace tree | `mkdir`, `create` (without overwrite), `getFileInfo` (without locations), `getListing` (without locations), `setOwner`, `setPermission`, `getStoragePolicy`, `setStoragePolicy`, `rename`, `isFileClosed`, `getFileLinkInfo`, `setTimes`, `modifyAclEntries`, `removeAclEntries`, `setAcl`, `getAcl`, `setXAttr`, `getXAttrs`, `listXAttrs`, `removeXAttr`, `checkAccess`, `getErasureCodingPolicy`, `unsetErasureCodingPolicy`, `getQuotaUsage`, `getPreferredBlockSize` | +| Involving only blocks | `reportBadBlocks`, `updateBlockForPipeline`, `updatePipeline` | +| Involving only DNs | `registerDatanode`, `setBalancerBandwidth`, `sendHeartbeat` | +| Involving both namespace tree & blocks | `getBlockLocation`, `create` (with overwrite), `append`, `setReplication`, `abandonBlock`, `addBlock`, `getAdditionalDatanode`, `complete`, `concat`, `truncate`, `delete`, `getListing` (with locations), `getFileInfo` (with locations), `recoverLease`, `listCorruptFileBlocks`, `fsync`, `commitBlockSynchronization`, `RedundancyMonitor`, `processMisReplicatedBlocks` | +| Involving both DNs & blocks | `getBlocks`, `errorReport` | +| Involving namespace tree, DNs & blocks | `blockReport`, `blockReceivedAndDeleted`, `HeartbeatManager`, `Decommission` | +| Requiring locking the entire namespace | `rollEditLog`, `startCommonService`, `startActiveService`, `saveNamespace`, `rollEdits`, `EditLogTailer`, `rollingUpgrade` | +| Requiring no locking | `getServerDefaults`, `getStats` | + +For operations involving namespace tree, fully independent operations can be handled by NN concurrently. Almost all of them use the full path as a parameter, e.g. `create`, `mkdirs`, `getFileInfo`, etc. So we can use a full path lock to make them thread-safe. + +For operations involving blocks, one block belongs to one and only one `INodeFile`, so NN can use the namespace tree to make these operations thread-safe. + +For operations involving DNs, NN needs a separate DN lock because DNs operate separately from the namespace tree. + +For operations requiring the entire namespace locked, the global lock can be used to make these operations thread-safe. In general, these operations have low frequency and thus low impact despite the global locking. + +### Full Path Lock + +Used to protect operations involving namespace tree. + +All of these operations receive a path or INodeID as a parameter and can further be divided into 3 main subcategories: +1. Parameters contain only one path (`create`, `mkdir`) +2. Parameters contain multiple paths (`rename`, `concat`) +3. Parameters contain INodeID (`addBlock`, `complete`) + +For type 1, NN acquires a full path lock according to its semantics. Take `setPermission("/a/b/c/f.txt")` for example, the set of locks to acquire are ReadLock("/"), ReadLock("a"), ReadLock("b"), ReadLock("c") and WriteLock("f.txt"). Different lock patterns are explained in a later section. + +For type 2, NN acquires full path locks in a predefined order, such as the lexicographic order, to avoid deadlocks. + +For type 3, NN acquires a full path lock by in the following fashion: +- Unsafely obtains full path recursively +- Acquires the full path lock according to the lock mode +- Rechecks whether the last node of the full path is equal to the INodeID given + - If not, that means that the `INodeFile` might have been renamed or concatenated, need to retry + - If the max retry attempts have been reached, throw a `RetryException` to client to let client retry + +### `INodeFile` Lock + +Used to protect operations involving blocks. + +One block belongs to one and only one `INodeFile`, so NN can use the INodeFile lock to make operations thread-safe. Normally, there is no need to acquire the full path lock since changing the namespace tree structure does not affect the block. + +`concat` might change the `INodeFile` a block belongs to. Since both block related operations and `concat` need to acquire the `INodeFile` write lock, only one of them can be processed at a time. + +### DN Lock + +Used to protect operations involving DNs. + +NN uses a `DatanodeDescriptor` object to store the information of one DN and uses `DatanodeManager` to manage all DNs in the memory. `DatanodeDescriptor` uses `DatanodeStorageInfo` to store the information of one storage device on one DN. + +DNs have nothing to do with the namespace tree, so NN uses a separate DN lock for these operations. Since DNs are independent of one another, NN can assign a lock to each DN. + +### Global Lock + +Used for operations requiring the entire namespace locked. + +There are some operations that need to lock the entire namespace, e.g. safe mode related operations, HA service related, etc. NN uses the global lock to make these operations thread-safe. Outside of these infrequent operations that require the global write lock, all other operations have to acquire the global read lock. + +### Lock Order + +As mentioned above, there are the global lock, DN lock, and full path lock. NN acquires locks in this specific order to avoid deadlocks. + +Locks are to be acquired in this order: +- Global > DN > Full path +- Global > DN > Last `INodeFile` + +Possible lock combinations are as follows: +- Global write lock +- Global read lock > Full path lock +- Global read lock > DN read/write lock +- Global read lock > DN read/write lock > Read/Write lock of last `INodeFile` +- Global read lock > DN read/write lock > Full path lock + +### Lock Pools + +NN allocates locks as needed to the INodes used by active threads, and deletes them after the locks are no longer in use. Locks for commonly accessed `INode`s like the root are cached. + +NN uses an `INodeLockPool` to manage these locks. The lock pool: Review Comment: There is improvement to fix the hot iNodes, right? We can just mention it. ########## hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/NamenodeFGL.md: ########## @@ -0,0 +1,201 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +HDFS Namenode Fine-grained Locking +================================== + +<!-- MACRO{toc|fromDepth=0|toDepth=3} --> + +Overview +-------- + +HDFS relies on a single master, the Namenode (NN), as its metadata center. +From an architectural point of view, a few elements make NN the bottleneck of an HDFS cluster: +* NN keeps the entire namespace in memory (directory tree, blocks, Datanode related info, etc.) +* Read requests (`getListing`, `getFileInfo`, `getBlockLocations`) are served from memory. +Write requests (`mkdir`, `create`, `addBlock`, `complete`) update the memory state and write a journal transaction into QJM. +Both types of requests need a locking mechanism to ensure data consistency and correctness. +* All requests are funneled into NN and have to go through the global FS lock. +Each write operation acquires this lock in write mode and holds it until that operation is executed. +This lock mode prevents concurrent execution of write operations even if they involve different branches of the directory tree. + +NN fine-grained locking (FGL) implementation aims to alleviate this bottleneck by allowing concurrency of disjoint write operations. + +JIRA: [HDFS-17366](https://issues.apache.org/jira/browse/HDFS-17366) + +Design +------ +In theory, fully independent operations can be processed concurrently, such as operations involving different subdirectory trees. +As such, NN can split the global lock into the full path lock, just using the full path lock to protect a special subdirectory tree. + +### RPC Categorization + +Roughly, RPC operations handled by NN can be divided into 8 main categories + +| Category | Operations | +|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Involving namespace tree | `mkdir`, `create` (without overwrite), `getFileInfo` (without locations), `getListing` (without locations), `setOwner`, `setPermission`, `getStoragePolicy`, `setStoragePolicy`, `rename`, `isFileClosed`, `getFileLinkInfo`, `setTimes`, `modifyAclEntries`, `removeAclEntries`, `setAcl`, `getAcl`, `setXAttr`, `getXAttrs`, `listXAttrs`, `removeXAttr`, `checkAccess`, `getErasureCodingPolicy`, `unsetErasureCodingPolicy`, `getQuotaUsage`, `getPreferredBlockSize` | +| Involving only blocks | `reportBadBlocks`, `updateBlockForPipeline`, `updatePipeline` | +| Involving only DNs | `registerDatanode`, `setBalancerBandwidth`, `sendHeartbeat` | +| Involving both namespace tree & blocks | `getBlockLocation`, `create` (with overwrite), `append`, `setReplication`, `abandonBlock`, `addBlock`, `getAdditionalDatanode`, `complete`, `concat`, `truncate`, `delete`, `getListing` (with locations), `getFileInfo` (with locations), `recoverLease`, `listCorruptFileBlocks`, `fsync`, `commitBlockSynchronization`, `RedundancyMonitor`, `processMisReplicatedBlocks` | +| Involving both DNs & blocks | `getBlocks`, `errorReport` | +| Involving namespace tree, DNs & blocks | `blockReport`, `blockReceivedAndDeleted`, `HeartbeatManager`, `Decommission` | +| Requiring locking the entire namespace | `rollEditLog`, `startCommonService`, `startActiveService`, `saveNamespace`, `rollEdits`, `EditLogTailer`, `rollingUpgrade` | +| Requiring no locking | `getServerDefaults`, `getStats` | + +For operations involving namespace tree, fully independent operations can be handled by NN concurrently. Almost all of them use the full path as a parameter, e.g. `create`, `mkdirs`, `getFileInfo`, etc. So we can use a full path lock to make them thread-safe. + +For operations involving blocks, one block belongs to one and only one `INodeFile`, so NN can use the namespace tree to make these operations thread-safe. + +For operations involving DNs, NN needs a separate DN lock because DNs operate separately from the namespace tree. + +For operations requiring the entire namespace locked, the global lock can be used to make these operations thread-safe. In general, these operations have low frequency and thus low impact despite the global locking. + +### Full Path Lock + +Used to protect operations involving namespace tree. + +All of these operations receive a path or INodeID as a parameter and can further be divided into 3 main subcategories: +1. Parameters contain only one path (`create`, `mkdir`) +2. Parameters contain multiple paths (`rename`, `concat`) +3. Parameters contain INodeID (`addBlock`, `complete`) + +For type 1, NN acquires a full path lock according to its semantics. Take `setPermission("/a/b/c/f.txt")` for example, the set of locks to acquire are ReadLock("/"), ReadLock("a"), ReadLock("b"), ReadLock("c") and WriteLock("f.txt"). Different lock patterns are explained in a later section. + +For type 2, NN acquires full path locks in a predefined order, such as the lexicographic order, to avoid deadlocks. + +For type 3, NN acquires a full path lock by in the following fashion: +- Unsafely obtains full path recursively +- Acquires the full path lock according to the lock mode +- Rechecks whether the last node of the full path is equal to the INodeID given + - If not, that means that the `INodeFile` might have been renamed or concatenated, need to retry + - If the max retry attempts have been reached, throw a `RetryException` to client to let client retry + +### `INodeFile` Lock + +Used to protect operations involving blocks. + +One block belongs to one and only one `INodeFile`, so NN can use the INodeFile lock to make operations thread-safe. Normally, there is no need to acquire the full path lock since changing the namespace tree structure does not affect the block. + +`concat` might change the `INodeFile` a block belongs to. Since both block related operations and `concat` need to acquire the `INodeFile` write lock, only one of them can be processed at a time. + +### DN Lock + +Used to protect operations involving DNs. + +NN uses a `DatanodeDescriptor` object to store the information of one DN and uses `DatanodeManager` to manage all DNs in the memory. `DatanodeDescriptor` uses `DatanodeStorageInfo` to store the information of one storage device on one DN. + +DNs have nothing to do with the namespace tree, so NN uses a separate DN lock for these operations. Since DNs are independent of one another, NN can assign a lock to each DN. + +### Global Lock + +Used for operations requiring the entire namespace locked. + +There are some operations that need to lock the entire namespace, e.g. safe mode related operations, HA service related, etc. NN uses the global lock to make these operations thread-safe. Outside of these infrequent operations that require the global write lock, all other operations have to acquire the global read lock. + +### Lock Order + +As mentioned above, there are the global lock, DN lock, and full path lock. NN acquires locks in this specific order to avoid deadlocks. + +Locks are to be acquired in this order: +- Global > DN > Full path +- Global > DN > Last `INodeFile` + +Possible lock combinations are as follows: +- Global write lock +- Global read lock > Full path lock +- Global read lock > DN read/write lock +- Global read lock > DN read/write lock > Read/Write lock of last `INodeFile` +- Global read lock > DN read/write lock > Full path lock + +### Lock Pools + +NN allocates locks as needed to the INodes used by active threads, and deletes them after the locks are no longer in use. Locks for commonly accessed `INode`s like the root are cached. + +NN uses an `INodeLockPool` to manage these locks. The lock pool: +- Returns a closeable lock for an INode based on the lock type, +- Removes this lock if it is no longer used by any threads. + +Similar to `INodeLockPool`, a `DNLockPool` is used to manage the locks for DNs. Unlike `INodeLockPool`, `DNLockPool` keeps all locks in memory due to the comparatively lower number of locks. + +### Lock Modes + +Operations related to namespace tree have different semantics and may involve the modification or access of different INodes, for example: `getBlockLocation` only accesses the last iNodeFile, `delete` modifies both the parent and the last INode, `mkdir` may modify multiple ancestor INodes. + +Four lock modes (plus no locking): +- LOCK_READ + - This lock mode acquires the read locks for all INodes down the path. + - Example operations: `getBlockLocation`, `getFileInfo`. +- LOCK_WRITE + - This lock mode acquires the write lock for the last INode and the read locks for all ancestor INodes in the full path. + - Example operations: `setPermission`, `setReplication`. +- LOCK_PARENT + - This lock mode acquires the write lock for the last two INodes and the read locks for all remaining ancestor INodes in the full path. + - Example operations: `rename`, `delete`, `create` (when the parent directory exists). +- LOCK_ANCESTOR + - This lock mode acquires the write lock for the last existing INode and the read locks for all remaining ancestor INodes in the full path. + - Example operations: `mkdir`, `create` (when the parent directory doesn't exist). +- NONE + - This lock mode does not acquire any locks for the given path. + +Roadmap +------- + +#### Step 1: Split the global lock into FSLock and BMLock + +Split the global lock into two global locks, FSLock and BMLock. +- FSLock for operations that relate to namespace tree. +- BMLock for operations related to blocks and/or operations related to DNs. +- Both FSLock and BMLock for HA related operations. +After this step, FGL contains global FSLock and global BMLock. + +No big logic changes in this step. The original logic with the global lock retains. This step aims to make the lock mode configurable. Review Comment: Here we can give some examples for developer and end-users: 1. How to enable or disable FGL through configurations? 2. What kind of lock is needed when coding? ########## hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/NamenodeFGL.md: ########## @@ -0,0 +1,201 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +HDFS Namenode Fine-grained Locking +================================== + +<!-- MACRO{toc|fromDepth=0|toDepth=3} --> + +Overview +-------- + +HDFS relies on a single master, the Namenode (NN), as its metadata center. +From an architectural point of view, a few elements make NN the bottleneck of an HDFS cluster: +* NN keeps the entire namespace in memory (directory tree, blocks, Datanode related info, etc.) +* Read requests (`getListing`, `getFileInfo`, `getBlockLocations`) are served from memory. +Write requests (`mkdir`, `create`, `addBlock`, `complete`) update the memory state and write a journal transaction into QJM. +Both types of requests need a locking mechanism to ensure data consistency and correctness. +* All requests are funneled into NN and have to go through the global FS lock. +Each write operation acquires this lock in write mode and holds it until that operation is executed. +This lock mode prevents concurrent execution of write operations even if they involve different branches of the directory tree. + +NN fine-grained locking (FGL) implementation aims to alleviate this bottleneck by allowing concurrency of disjoint write operations. + +JIRA: [HDFS-17366](https://issues.apache.org/jira/browse/HDFS-17366) + +Design +------ +In theory, fully independent operations can be processed concurrently, such as operations involving different subdirectory trees. +As such, NN can split the global lock into the full path lock, just using the full path lock to protect a special subdirectory tree. + +### RPC Categorization + +Roughly, RPC operations handled by NN can be divided into 8 main categories + +| Category | Operations | +|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Involving namespace tree | `mkdir`, `create` (without overwrite), `getFileInfo` (without locations), `getListing` (without locations), `setOwner`, `setPermission`, `getStoragePolicy`, `setStoragePolicy`, `rename`, `isFileClosed`, `getFileLinkInfo`, `setTimes`, `modifyAclEntries`, `removeAclEntries`, `setAcl`, `getAcl`, `setXAttr`, `getXAttrs`, `listXAttrs`, `removeXAttr`, `checkAccess`, `getErasureCodingPolicy`, `unsetErasureCodingPolicy`, `getQuotaUsage`, `getPreferredBlockSize` | +| Involving only blocks | `reportBadBlocks`, `updateBlockForPipeline`, `updatePipeline` | +| Involving only DNs | `registerDatanode`, `setBalancerBandwidth`, `sendHeartbeat` | +| Involving both namespace tree & blocks | `getBlockLocation`, `create` (with overwrite), `append`, `setReplication`, `abandonBlock`, `addBlock`, `getAdditionalDatanode`, `complete`, `concat`, `truncate`, `delete`, `getListing` (with locations), `getFileInfo` (with locations), `recoverLease`, `listCorruptFileBlocks`, `fsync`, `commitBlockSynchronization`, `RedundancyMonitor`, `processMisReplicatedBlocks` | +| Involving both DNs & blocks | `getBlocks`, `errorReport` | +| Involving namespace tree, DNs & blocks | `blockReport`, `blockReceivedAndDeleted`, `HeartbeatManager`, `Decommission` | +| Requiring locking the entire namespace | `rollEditLog`, `startCommonService`, `startActiveService`, `saveNamespace`, `rollEdits`, `EditLogTailer`, `rollingUpgrade` | +| Requiring no locking | `getServerDefaults`, `getStats` | + +For operations involving namespace tree, fully independent operations can be handled by NN concurrently. Almost all of them use the full path as a parameter, e.g. `create`, `mkdirs`, `getFileInfo`, etc. So we can use a full path lock to make them thread-safe. + +For operations involving blocks, one block belongs to one and only one `INodeFile`, so NN can use the namespace tree to make these operations thread-safe. + +For operations involving DNs, NN needs a separate DN lock because DNs operate separately from the namespace tree. + +For operations requiring the entire namespace locked, the global lock can be used to make these operations thread-safe. In general, these operations have low frequency and thus low impact despite the global locking. + +### Full Path Lock + +Used to protect operations involving namespace tree. + +All of these operations receive a path or INodeID as a parameter and can further be divided into 3 main subcategories: +1. Parameters contain only one path (`create`, `mkdir`) +2. Parameters contain multiple paths (`rename`, `concat`) +3. Parameters contain INodeID (`addBlock`, `complete`) + +For type 1, NN acquires a full path lock according to its semantics. Take `setPermission("/a/b/c/f.txt")` for example, the set of locks to acquire are ReadLock("/"), ReadLock("a"), ReadLock("b"), ReadLock("c") and WriteLock("f.txt"). Different lock patterns are explained in a later section. + +For type 2, NN acquires full path locks in a predefined order, such as the lexicographic order, to avoid deadlocks. + +For type 3, NN acquires a full path lock by in the following fashion: +- Unsafely obtains full path recursively +- Acquires the full path lock according to the lock mode +- Rechecks whether the last node of the full path is equal to the INodeID given + - If not, that means that the `INodeFile` might have been renamed or concatenated, need to retry + - If the max retry attempts have been reached, throw a `RetryException` to client to let client retry + +### `INodeFile` Lock + +Used to protect operations involving blocks. + +One block belongs to one and only one `INodeFile`, so NN can use the INodeFile lock to make operations thread-safe. Normally, there is no need to acquire the full path lock since changing the namespace tree structure does not affect the block. + +`concat` might change the `INodeFile` a block belongs to. Since both block related operations and `concat` need to acquire the `INodeFile` write lock, only one of them can be processed at a time. + +### DN Lock + +Used to protect operations involving DNs. + +NN uses a `DatanodeDescriptor` object to store the information of one DN and uses `DatanodeManager` to manage all DNs in the memory. `DatanodeDescriptor` uses `DatanodeStorageInfo` to store the information of one storage device on one DN. + +DNs have nothing to do with the namespace tree, so NN uses a separate DN lock for these operations. Since DNs are independent of one another, NN can assign a lock to each DN. + +### Global Lock + +Used for operations requiring the entire namespace locked. + +There are some operations that need to lock the entire namespace, e.g. safe mode related operations, HA service related, etc. NN uses the global lock to make these operations thread-safe. Outside of these infrequent operations that require the global write lock, all other operations have to acquire the global read lock. + +### Lock Order + +As mentioned above, there are the global lock, DN lock, and full path lock. NN acquires locks in this specific order to avoid deadlocks. + +Locks are to be acquired in this order: +- Global > DN > Full path +- Global > DN > Last `INodeFile` + +Possible lock combinations are as follows: +- Global write lock +- Global read lock > Full path lock +- Global read lock > DN read/write lock +- Global read lock > DN read/write lock > Read/Write lock of last `INodeFile` +- Global read lock > DN read/write lock > Full path lock + +### Lock Pools + +NN allocates locks as needed to the INodes used by active threads, and deletes them after the locks are no longer in use. Locks for commonly accessed `INode`s like the root are cached. + +NN uses an `INodeLockPool` to manage these locks. The lock pool: +- Returns a closeable lock for an INode based on the lock type, +- Removes this lock if it is no longer used by any threads. + +Similar to `INodeLockPool`, a `DNLockPool` is used to manage the locks for DNs. Unlike `INodeLockPool`, `DNLockPool` keeps all locks in memory due to the comparatively lower number of locks. + +### Lock Modes + +Operations related to namespace tree have different semantics and may involve the modification or access of different INodes, for example: `getBlockLocation` only accesses the last iNodeFile, `delete` modifies both the parent and the last INode, `mkdir` may modify multiple ancestor INodes. + +Four lock modes (plus no locking): +- LOCK_READ + - This lock mode acquires the read locks for all INodes down the path. + - Example operations: `getBlockLocation`, `getFileInfo`. +- LOCK_WRITE + - This lock mode acquires the write lock for the last INode and the read locks for all ancestor INodes in the full path. + - Example operations: `setPermission`, `setReplication`. +- LOCK_PARENT + - This lock mode acquires the write lock for the last two INodes and the read locks for all remaining ancestor INodes in the full path. + - Example operations: `rename`, `delete`, `create` (when the parent directory exists). +- LOCK_ANCESTOR + - This lock mode acquires the write lock for the last existing INode and the read locks for all remaining ancestor INodes in the full path. + - Example operations: `mkdir`, `create` (when the parent directory doesn't exist). +- NONE + - This lock mode does not acquire any locks for the given path. + +Roadmap +------- + +#### Step 1: Split the global lock into FSLock and BMLock + +Split the global lock into two global locks, FSLock and BMLock. +- FSLock for operations that relate to namespace tree. +- BMLock for operations related to blocks and/or operations related to DNs. +- Both FSLock and BMLock for HA related operations. +After this step, FGL contains global FSLock and global BMLock. + +No big logic changes in this step. The original logic with the global lock retains. This step aims to make the lock mode configurable. + +JIRA: [HDFS-17384](https://issues.apache.org/jira/browse/HDFS-17384) + +#### Step 2: Split the global FSLock Review Comment: There are some difficult problems should be fixed in this step. And we have discussed it and got some solutions. - Quota problem - Snapshot problem - StoragePolicy ID problem I will create some tickets to describe these problems and give some solution in the ticket for these problems in milestone2. I don't know if these problems should be mentioned in this document. @ferhui @kokonguyen191 What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
