Re: Discussion about NameNode Fine-grained locking

Ayush Saxena Tue, 31 Dec 2024 04:44:04 -0800

+1,
Thanx folks for your efforts on this! I didn't have time to review
everything thoroughly, but my initial pass suggests it looks good or
atleast is safe to merge.
If I find some spare time, I'll test it further and submit a ticket or
so if I encounter any issues.


Good Luck!!!

-Ayush

On Tue, 31 Dec 2024 at 16:39, Hui Fei <[email protected]> wrote:
>
> Thanks Zander for bringing this discussion again and trying your best to push 
> it forward. It's really a long time since last discussion.
>
> It’s indeed time, +1 for merging phase 1 codes based on the following points
>  - The phase 1 feature has been running at scale within companies for a long 
> time
>  - The long-term plan is clear, and also addressed some questions raised by 
> the community
>  - The testing result of future features on memory and performance
>
> ZanderXu <[email protected]> 于2024年12月31日周二 15:36写道：
>>
>> Hi, everyone:
>>
>> Time to Merge FGL Phase I
>>
>> The PR for FGL Phase I is ready for merging! Please take a moment to review 
>> and cast your vote: https://github.com/apache/hadoop/pull/6762.
>>
>> The FGL Phase I has been running successfully in production for over six 
>> months at Shopee and BOSS Zhipin, with no reported performance or stability 
>> issues. It’s now the right time to merge it into the trunk branch, allowing 
>> us to move forward with Phase II.
>>
>> The global lock remains the default lock mode, but users can enable FGL by 
>> configuring 
>> dfs.namenode.lock.model.provider.class=org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock.
>>
>> If there are no objections within 7 days, I will propose an official vote.
>>
>> Performance and Memory Usage of Phase I
>>
>> Conclusion：
>>
>> Fine-grained locks do not lead to significant performance improvements.
>>
>> Fine-grained locks do not result in additional memory consumption
>>
>> Reasons:
>>
>> BM operations heavily depend on FS operations: IBR and BR still acquire the 
>> global lock (FSLock and BMLock).
>>
>> FS operations depend on BM operations: Common operations (create, addBlock, 
>> getBlockLocations) also acquire the global lock (FSLock and BMLock).
>>
>> Phase II will bring significant performance improvements by decoupling FS 
>> and BM dependencies and replacing the global FSLock with a fine-grained 
>> IIPLock.
>>
>> Addressing Common Questions
>>
>> Thank you all for raising meaningful questions!
>>
>> I have rewritten the design document to improve clarity. 
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?usp=sharing
>>
>> Below is a summary of frequently asked questions and answers:
>>
>> Summary of Questions:
>>
>> Question 1: How is the performance of LockPoolManager?
>>
>> Performance Report:
>>
>> Time to acquire a cached lock: 194 ns
>>
>> Time to acquire a non-cached lock: 1044 ns
>>
>> Time to release an in-use lock: 88 ns
>>
>> Time to release an unused lock: 112 ns
>>
>> Overall Performance:
>>
>> QPS: Over 10 million
>>
>> Time to acquire the IIP lock for a path with depth 10:
>>
>> Fully uncached: 10440 ns + 1120 ns (≈ 11 μs)
>>
>> Fully cached: 1940 ns + 1120 ns (≈ 3 μs)
>>
>> In global lock scenarios, lock wait times are typically in the millisecond 
>> range. Therefor, the cost of acquiring and releasing fine-grained locks can 
>> be ignored.
>>
>> Question 2: How much memory does the FGL consume?
>>
>> Memory Consumption:
>>
>> A single LockResource contains a read-write lock and a counter, totaling 
>> approximately 200 bytes:
>>
>> LockResource: 24 bytes
>>
>> ReentrantReadWriteLock: 150 bytes
>>
>> AtomicInteger: 16 bytes
>>
>> Memory Usage Estimates:
>>
>> 10-level directory depth, 100 handlers
>>
>> 1000 lock resources, approximately 200 KB
>>
>> 10-level directory depth, 1000 handlers
>>
>> 10000 lock resources, approximately 2 MB
>>
>> 1, 000,000 lock resources, approximately 200 MB
>>
>> Conclusion: Memory consumption is negligible.
>>
>> Question 3: What happens if no lock is available in the LockPoolManager?
>>
>> If there are not any available LockResources, two solutions are available:
>>
>> Return a RetryException, prompting the client to retry later.
>>
>> Temporarily increase the lock entity limit, allocate more locks to meet 
>> client requests, and use an asynchronous thread to recycle locks 
>> periodically.
>>
>> We can provide multiple LockPoolManager implementations for users to choose 
>> from based on production environments.
>>
>> Question 4: Regarding the IIPLock lock depth issue, can we consider holding 
>> only the first 3 or 4 levels of directory locks?
>>
>> This approach is not recommended for the following reasons:
>>
>> Cannot maximize concurrency.
>>
>> Limited savings in lock acquisition/release time and memory usage, yielding 
>> insignificant benefits.
>>
>> Question 5: How should attributes like StoragePolicy, ErasureCoding, and 
>> ACL, which can be set on parent or ancestor directory nodes, be handled?
>>
>> ErasureCoding and ACL:
>>
>> When changing node attributes, hold the corresponding INode’s write lock.
>>
>> When using ancestor node attributes, hold the corresponding INode’s read 
>> lock.
>>
>> StoragePolicy:
>>
>> More complex due to its impact on both directory tree operations and Block 
>> operations.
>>
>> To improve performance, commonly used block-related operations (such as 
>> BR/IBR) should not acquire IIPLock
>>
>> Detailed design documentation: 
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.96lztsl4mwfk
>>
>> Question 6: How should FGL be implemented for the SNAPSHOT feature?
>>
>> Since the Rename operation on the SNAPSHOT directory is supported, holding 
>> only the write lock of the SNAPSHOT root directory cannot cover the rename 
>> situation, so the thread safety of SNAPSHOT-related operations cannot be 
>> guaranteed
>>
>> It is recommended to use global FS lock to ensure thread safety.
>>
>> Detailed design documentation: 
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.sm36p6bfcpec
>>
>> Question 7: How should FGL be implemented for the Symlinks feature?
>>
>> The Target path of Symlinks is a string, and the client performs a second 
>> forward access to the Target path. So the fine-grained lock project requires 
>> no special handling
>>
>> For the createSymlink RPC, the FGL needs to acquire the IIPLocks for both 
>> target and link paths.
>>
>> Question 8: How should FGL be implemented for the reserved feature?
>>
>> The Reserved feature has two usage modes:
>>
>> /.reserved/iNodes/${inode id}
>>
>> /.reserved/raw/${path}
>>
>> INodeId Mode: During the resolvePath phase, obtain the real IIPLock lock via 
>> INodeId.
>>
>> Path Mode: During the resolvePath phase, obtain the real IIPLock lock via 
>> path.
>>
>> Detailed design documentation: 
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.h6rcpzkbpanf
>>
>> Question 9: Why is INodeFileLock used as the FGL for BlockInfo?
>>
>> INodeFile and Block have mutual dependencies:
>>
>> INodeFile depends on Block for state and size.
>>
>> Block depends on INodeFile for state and storage policy.
>>
>> Therefore, using INodeFileLock as the fine-grained lock for BlockInfo is a 
>> reasonable choice.
>>
>> Detailed design documentation: 
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.zesd6omuu3kr
>>
>> Seeking Community Feedback
>>
>> Your questions and concerns are always welcome.
>>
>> We can discuss them in detail on the Slack Channel: 
>> https://app.slack.com/client/T4S1WH2J3/C06UDTBQ2SH
>>
>> Let’s work together to advance the Fine-Grained Lock project. I believe this 
>> initiative will deliver significant performance improvements to the HDFS 
>> community and help reinvigorate its activity.
>>
>> Wishing everyone a Happy New Year 2025!
>>
>>
>> On Wed, 5 Jun 2024 at 16:17, ZanderXu <[email protected]> wrote:
>>>
>>> I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share the 
>>> FGL's motivations and some concerns in detail in Chinese.
>>>
>>> The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II)
>>>
>>> The meeting URL is: https://sea.zoom.us/j/94168001269
>>>
>>> You are welcome to this meeting.
>>>
>>> On Mon, 6 May 2024 at 23:57, Hui Fei <[email protected]> wrote:
>>>>
>>>> BTW, there is a Slack channel hdfs-fgl for this feature. can join it and 
>>>> discuss more details.
>>>>
>>>> Is it necessary to hold a meeting to discuss this? So that we can push it 
>>>> forward quickly. Agreed with ZanderXu, it seems inefficient to discuss 
>>>> details via email list.
>>>>
>>>>
>>>> Hui Fei <[email protected]> 于2024年5月6日周一 23:50写道：
>>>>>
>>>>> Thanks all
>>>>>
>>>>> Seems all concerns are related to the stage 2. We can address these and 
>>>>> make it more clear before we start it.
>>>>>
>>>>> From development experience, I think it is reasonable to split the big 
>>>>> feature into several stages. And stage 1 is also independent and it also 
>>>>> can be as a minor feature that uses fs and bm locks instead of the global 
>>>>> lock.
>>>>>
>>>>>
>>>>> ZanderXu <[email protected]> 于2024年4月29日周一 15:17写道：
>>>>>>
>>>>>> Thanks @Ayush Saxena <[email protected]> and @Xiaoqiao He
>>>>>> <[email protected]> for your nice questions.
>>>>>>
>>>>>> Let me summarize your concerns and corresponding solutions:
>>>>>>
>>>>>> *1. Questions about the Snapshot feature*
>>>>>> It's difficult to apply the FGL to Snapshot feature,  but we can just 
>>>>>> using
>>>>>> the global FS write lock to make it thread safe.
>>>>>> So if we can identity if a path contains the snapshot feature, we can 
>>>>>> just
>>>>>> using the global FS write lock to protect it.
>>>>>>
>>>>>> You can refer to HDFS-17479
>>>>>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to identify
>>>>>> it.
>>>>>>
>>>>>> Regarding performance of the operations related to the snapshot features,
>>>>>> we can discuss it in two categories:
>>>>>> Read operations involves snapshots:
>>>>>> The FGL branch uses the global write lock to protect them, the GLOBAL
>>>>>> branch uses the global read lock to protect them. It's hard to conclude
>>>>>> which version has better performance, it depends on the global lock
>>>>>> competition.
>>>>>>
>>>>>> Write operations involves snapshots:
>>>>>> Both FGL and GLOBAL branch use the global write lock to protect them. 
>>>>>> It's
>>>>>> hard to conclude which version has better performance, it depends on the
>>>>>> global lock competition too.
>>>>>>
>>>>>> So I think if namenode load is low, the GLOBAL branch will have a better
>>>>>> performance than FGL; If namenode load is high, the FGL branch may have a
>>>>>> better performance than the GLOBAL, which also depends on the ratio of 
>>>>>> read
>>>>>> and write operations on the SNAPSHOT feature.
>>>>>>
>>>>>> We can do somethings to let end-user to choose a branch with a better
>>>>>> branch according to their business:
>>>>>> First, we need to make the lock mode can be selectable, so that end-user
>>>>>> can choose to use FGL of GLOBAL.
>>>>>> Second, using the global write lock to make operations related to 
>>>>>> snapshot
>>>>>> thread safe as I described in HDFS-17479.
>>>>>>
>>>>>>
>>>>>> *2. Questions about the Symlinks feature*
>>>>>> If Symlink is related to snapshot, we can refer to the solution of the
>>>>>> snapshot;  If Symlink is not related to snapshot, I think it's easy to 
>>>>>> meet
>>>>>> the FGL.
>>>>>> Only createSymlink involves two paths, FGL just need to lock them in the
>>>>>> order to make this operation thread. For other operations, it is the same
>>>>>> as other normal iNode, right?
>>>>>>
>>>>>> If I missed difficult points, please let me know.
>>>>>>
>>>>>>
>>>>>> *3. Questions about Memory Usage of iNode locks*
>>>>>> I think there are too many solutions to limit the memory usage of these
>>>>>> iNode locks, such as: Using a limit capacity lock pool to ensure the
>>>>>> maximum memory usage,  Just holding iNode locks for fixed depth of
>>>>>> directories, etc.
>>>>>>
>>>>>> We can just abstract this LockManager first and then support its
>>>>>> implementation with different ideas, so that we can limit the maximum
>>>>>> memory usage of these iNode locks.
>>>>>> FGL can acquire or lease iNode locks through LockManager.
>>>>>>
>>>>>>
>>>>>> *4. Questions about Performance of acquiring and releasing iNode locks*
>>>>>> We can add some benchmark for LockManager, to test the performance or
>>>>>> acquire and release unblocked locks.
>>>>>>
>>>>>>
>>>>>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
>>>>>> These policies may be sot on an ancestor node and used by some children
>>>>>> files.  The set operation for these policies will be protected by the
>>>>>> directory tree, since there are all file-related operations.  In addition
>>>>>> to Quota and StoragePolicy, the use of other policies will also be
>>>>>> protected by directory tree, such as ECPolicy and ACL.
>>>>>>
>>>>>> Quota is a little special since its update operations may not be 
>>>>>> protected
>>>>>> by the directory tree, we can assign a locks to each QuotaFeature and use
>>>>>> these locks to make updating operations thread safe. you can refer to
>>>>>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to get some
>>>>>> detailed information.
>>>>>>
>>>>>> StoragePolicy is a little special since it is used not only by 
>>>>>> file-related
>>>>>> operations but also block-related operations.  
>>>>>> ProcessExtraRedundancyBlock
>>>>>> uses storage policy to choose redundancy replicas and
>>>>>> BlockReconstructionWork uses storage policy to choose target DNs. In 
>>>>>> order
>>>>>> to maximize the performance improvement, BR and IBR should only involve 
>>>>>> the
>>>>>> iNodeFile to which the current processing block belongs. These redundancy
>>>>>> blocks can be processed by the Redundancy monitor while holding the
>>>>>> directory tree locks. You can refer to HDFS-17505
>>>>>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more detailed
>>>>>> informations.
>>>>>>
>>>>>> *6. Performance of the phase 1*
>>>>>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is used to 
>>>>>> do
>>>>>> some performance testing for phase 1, and I will complete it later.
>>>>>>
>>>>>>
>>>>>> Discuss solution through mails is not efficient, you can create one
>>>>>> sub-tasks under HDFS-17366
>>>>>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
>>>>>> concerns and I will try to give some answers.
>>>>>>
>>>>>> Thanks @Ayush Saxena <[email protected]>  and @Xiaoqiao He
>>>>>> <[email protected]> again.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <[email protected]> wrote:
>>>>>>
>>>>>> > Thanx Everyone for chasing this, Great to see some momentum around FGL,
>>>>>> > that should be a great improvement.
>>>>>> >
>>>>>> > I have some two broad categories:
>>>>>> > ** About the process:*
>>>>>> > I think in the above mails, there are mentions that phase one is 
>>>>>> > complete
>>>>>> > in a feature branch & we are gonna merge that to trunk. If I am 
>>>>>> > catching it
>>>>>> > right, then you can't hit the merge button like that. To merge a 
>>>>>> > feature
>>>>>> > branch. You need to call for a Vote specific to that branch & it 
>>>>>> > requires 3
>>>>>> > binding votes to merge, unlike any other code change which requires 1. 
>>>>>> > It
>>>>>> > is there in our Bylaws.
>>>>>> >
>>>>>> > So, do follow the process.
>>>>>> >
>>>>>> > ** About the feature itself:* (A very quick look at the doc and the 
>>>>>> > Jira,
>>>>>> > so please take it with a grain of salt)
>>>>>> > * The Google Drive link that you folks shared as part of the first 
>>>>>> > mail. I
>>>>>> > don't have access to that. So, please open up the permissions for that 
>>>>>> > doc
>>>>>> > or share the new link
>>>>>> > * Chasing the design doc present on the Jira
>>>>>> > * I think we only have Phase-1 ready, so can you share some metrics 
>>>>>> > just
>>>>>> > for that? Perf improvements just with splitting the FS & BM Locks
>>>>>> > * The memory implications of Phase-1? I don't think there should be any
>>>>>> > major impact on the memory in case of just phase-1
>>>>>> > * Regarding the snapshot stuff, you mentioned taking lock on the root
>>>>>> > itself? Does just taking lock on the snapshot root rather than the FS 
>>>>>> > root
>>>>>> > works?
>>>>>> > * Secondly about the usage of Snapshot or Symlinks, I don't think we
>>>>>> > should operate under the assumptions that they aren't widely used or 
>>>>>> > not,
>>>>>> > we might just not know folks who don't use it widely or they are just 
>>>>>> > users
>>>>>> > not the ones contributing. We can just accept for now, that in those 
>>>>>> > cases
>>>>>> > it isn't optimised and we just lock the entire FS space, which it does 
>>>>>> > even
>>>>>> > today, so no regressions there.
>>>>>> > * Regarding memory usage: Do you have some numbers on how much the 
>>>>>> > memory
>>>>>> > footprint increases?
>>>>>> > * Under the Lock Pool: I think you are assuming there would be very few
>>>>>> > inodes where lock would be required at any given time, so there won't 
>>>>>> > be
>>>>>> > too much heap consumption? I think you are compromising on the 
>>>>>> > Horizontal
>>>>>> > Scalability here. I doubt if your assumption doesn't hold true, under 
>>>>>> > heavy
>>>>>> > read load by concurrent clients accessing different inodes, the 
>>>>>> > Namenode
>>>>>> > will start giving memory troubles, that would do more harm than good.
>>>>>> > Anyway Namenode heap is way bigger problem than anything, so we should 
>>>>>> > be
>>>>>> > very careful increasing load over there.
>>>>>> > * For the Locks on the inodes: Do you plan to have locs for each inode?
>>>>>> > Can we somehow limit that to the depth of the tree? Like currently we 
>>>>>> > take
>>>>>> > lock on the root, have a config which makes us take lock at Level-2 or 
>>>>>> > 3
>>>>>> > (configurable), that might fetch some perf benefits and can be used to
>>>>>> > control the memory usage as well?
>>>>>> > * What is the cost of creating these inode locks? If the lock isn't
>>>>>> > already cached it would incur some cost? Do you have some numbers 
>>>>>> > around
>>>>>> > that? Say I disable caching altogether & then let a test load run, what
>>>>>> > does the perf numbers look like in that case
>>>>>> > * I think we need to limit the size of INodeLockPool, we can't let it 
>>>>>> > grow
>>>>>> > infinitely in case of heavy loads and we need to have some auto
>>>>>> > throttling mechanism for it
>>>>>> > * I didn't catch your Storage Policy problem. If I decode it right, the
>>>>>> > problem is like the policy could be set on an ancestor node & the 
>>>>>> > children
>>>>>> > abide by that & this is the problem, if that is the case then isn't 
>>>>>> > that
>>>>>> > the case with ErasureCoding policies or even ACLs or so? Can you 
>>>>>> > elaborate
>>>>>> > a bit on that.
>>>>>> >
>>>>>> >
>>>>>> > Anyway, regarding the Phase-1. If you share (the perf numbers with 
>>>>>> > proper
>>>>>> > details + Impact on memory if any) for just phase 1 & if they are good,
>>>>>> > then if you call for a branch merge vote for Phase-1 FGL, you have my 
>>>>>> > vote,
>>>>>> > however you'll need to sway the rest of the folks on your own :-)
>>>>>> >
>>>>>> > Good Luck, Nice Work Guys!!!
>>>>>> >
>>>>>> > -Ayush
>>>>>> >
>>>>>> >
>>>>>> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <[email protected]> 
>>>>>> > wrote:
>>>>>> >
>>>>>> >> Thanks ZanderXu and Hui Fei for your work on this feature. It will be
>>>>>> >> a very helpful improvement for the HDFS module in the next journal.
>>>>>> >>
>>>>>> >> 1. If we need any more review bandwidth, I would like to be involved
>>>>>> >> to help review if possible.
>>>>>> >> 2. From the design document there are still missing some detailed
>>>>>> >> descriptions such as snapshot, symbolic link and reserved etc as 
>>>>>> >> mentioned
>>>>>> >> above. I think it will be helpful for newbies who want to be involved
>>>>>> >> if all corner
>>>>>> >> cases are considered and described.
>>>>>> >> 3. From slack, we plan to check into the trunk at this phase. I am not
>>>>>> >> sure
>>>>>> >> If it is the proper time, following the dev plan there are two steps 
>>>>>> >> left
>>>>>> >> to
>>>>>> >> finish this feature from the design document, right? If that, I think 
>>>>>> >> we
>>>>>> >> should
>>>>>> >> postpone checking in when all plans are ready. Considering that there 
>>>>>> >> are
>>>>>> >> many unfinished tries for this feature in history, I think postpone
>>>>>> >> checking
>>>>>> >> will be the safe way, another way it will involve more rebase cost if 
>>>>>> >> you
>>>>>> >> keep
>>>>>> >> separate dev branch, however I think It is not one difficult thing for
>>>>>> >> you.
>>>>>> >>
>>>>>> >> Good luck and look forward to making that happen soon!
>>>>>> >>
>>>>>> >> Best Regards,
>>>>>> >> - He Xiaoqiao
>>>>>> >>
>>>>>> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <[email protected]> wrote:
>>>>>> >> >
>>>>>> >> > Thanks for interest and advice on this.
>>>>>> >> >
>>>>>> >> > Just would like to share some info here
>>>>>> >> >
>>>>>> >> > ZanderXu leads this feature and he has spent a lot of time on it. 
>>>>>> >> > He is
>>>>>> >> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also took 
>>>>>> >> some
>>>>>> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99 
>>>>>> >> RocMarshal
>>>>>> >> kokonguyen191) helped review PRs. (Forgive me if I missed someone)
>>>>>> >> >
>>>>>> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
>>>>>> >> familiar with this feature. We discussed many details offline.
>>>>>> >> >
>>>>>> >> > Welcome to more people interested in joining the development and 
>>>>>> >> > review
>>>>>> >> of the stage 2 and 3.
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > Zengqiang XU <[email protected]> 于2024年4月26日周五 14:56写道：
>>>>>> >> >>
>>>>>> >> >> Thanks Shilun for your response:
>>>>>> >> >>
>>>>>> >> >> 1. This is a big and very useful feature, so it really needs more
>>>>>> >> >> developers to get on board.
>>>>>> >> >> 2. This fine grained lock has been implemented based on internal
>>>>>> >> branches
>>>>>> >> >> and has gained benefits by many companies, such as: Meituan, 
>>>>>> >> >> Kuaishou,
>>>>>> >> >> Bytedance, etc.  But it has not been contributed to the community 
>>>>>> >> >> due
>>>>>> >> to
>>>>>> >> >> various reasons, such as there is a big difference between the 
>>>>>> >> >> version
>>>>>> >> of
>>>>>> >> >> the internal branch and the community trunk branch, the internal
>>>>>> >> branch may
>>>>>> >> >> ignore some functions to make FGL clear, and the contribution 
>>>>>> >> >> needs a
>>>>>> >> lot
>>>>>> >> >> of work and will take many times. It means that this solution has
>>>>>> >> already
>>>>>> >> >> been practiced in their prod environment. We have also practiced 
>>>>>> >> >> it in
>>>>>> >> our
>>>>>> >> >> prod environment and gained benefits, and we are also willing to 
>>>>>> >> >> spend
>>>>>> >> a
>>>>>> >> >> lot of time contributing to the community.
>>>>>> >> >> 3. Regarding the benchmark testing, we don't need to pay more
>>>>>> >> attention to
>>>>>> >> >> whether the performance is improved by 5 times, 10 times or 20 
>>>>>> >> >> times,
>>>>>> >> >> because there are too many factors that affect it.
>>>>>> >> >> 4. As I described above, this solution is already  being practiced 
>>>>>> >> >> by
>>>>>> >> many
>>>>>> >> >> companies. Right now, we just need to think about how to implement 
>>>>>> >> >> it
>>>>>> >> with
>>>>>> >> >> high quality and more comprehensively.
>>>>>> >> >> 5. I firmly believe that all problems can be solved as long as the
>>>>>> >> overall
>>>>>> >> >> solution is right.
>>>>>> >> >> 6. I can spend a lot of time leading the promotion of this entire
>>>>>> >> feature
>>>>>> >> >> and I hope more people can join us in promoting it.
>>>>>> >> >> 7. You are always welcome to raise your concerns.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Thanks Shilun again, I hope you can help review designs and PRs. 
>>>>>> >> >> Thanks
>>>>>> >> >>
>>>>>> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <[email protected]> 
>>>>>> >> >> wrote:
>>>>>> >> >>
>>>>>> >> >> > Thank you for your hard work! This is a very meaningful 
>>>>>> >> >> > improvement,
>>>>>> >> and
>>>>>> >> >> > from the design document, we can see a significant increase in 
>>>>>> >> >> > HDFS
>>>>>> >> >> > read/write throughput.
>>>>>> >> >> >
>>>>>> >> >> > I am happy to see the progress made on HDFS-17384.
>>>>>> >> >> >
>>>>>> >> >> > However, I still have some concerns, which roughly involve the
>>>>>> >> following
>>>>>> >> >> > aspects:
>>>>>> >> >> >
>>>>>> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS and are
>>>>>> >> familiar
>>>>>> >> >> > with related development details, we still need more community
>>>>>> >> member to
>>>>>> >> >> > review the code to ensure that the relevant upgrades meet
>>>>>> >> expectations.
>>>>>> >> >> >
>>>>>> >> >> > 2. We need more details on benchmarks to ensure that test results
>>>>>> >> can be
>>>>>> >> >> > reproduced and to allow more community member to participate in 
>>>>>> >> >> > the
>>>>>> >> testing
>>>>>> >> >> > process.
>>>>>> >> >> >
>>>>>> >> >> > Looking forward to everything going smoothly in the future.
>>>>>> >> >> >
>>>>>> >> >> > Best Regards,
>>>>>> >> >> > - Shilun Fan.
>>>>>> >> >> >
>>>>>> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He 
>>>>>> >> >> > <[email protected]>
>>>>>> >> wrote:
>>>>>> >> >> >
>>>>>> >> >> >> cc [email protected].
>>>>>> >> >> >>
>>>>>> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <[email protected]>
>>>>>> >> wrote:
>>>>>> >> >> >> >
>>>>>> >> >> >> > Here are some summaries about the first phase:
>>>>>> >> >> >> > 1. There are no big changes in this phase
>>>>>> >> >> >> > 2. This phase just uses FS lock and BM lock to replace the
>>>>>> >> original
>>>>>> >> >> >> global
>>>>>> >> >> >> > lock
>>>>>> >> >> >> > 3. It's useful to improve the performance, since some 
>>>>>> >> >> >> > operations
>>>>>> >> just
>>>>>> >> >> >> need
>>>>>> >> >> >> > to hold FS lock or BM lock instead of the global lock
>>>>>> >> >> >> > 4. This feature is turned off by default, you can enable it by
>>>>>> >> setting
>>>>>> >> >> >> > dfs.namenode.lock.model.provider.class to
>>>>>> >> >> >> >
>>>>>> >> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
>>>>>> >> >> >> > 5. This phase is very import for the ongoing development of 
>>>>>> >> >> >> > the
>>>>>> >> entire
>>>>>> >> >> >> FGL
>>>>>> >> >> >> >
>>>>>> >> >> >> > Here I would like to express my special thanks to 
>>>>>> >> >> >> > @kokonguyen191
>>>>>> >> and
>>>>>> >> >> >> > @yuanboliu for their contributions.  And you are also welcome 
>>>>>> >> >> >> > to
>>>>>> >> join us
>>>>>> >> >> >> > and complete it together.
>>>>>> >> >> >> >
>>>>>> >> >> >> >
>>>>>> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <[email protected]>
>>>>>> >> wrote:
>>>>>> >> >> >> >
>>>>>> >> >> >> > > Hi everyone
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > All subtasks of the first phase of the FGL have been 
>>>>>> >> >> >> > > completed
>>>>>> >> and I
>>>>>> >> >> >> plan
>>>>>> >> >> >> > > to merge them into the trunk and start the second phase 
>>>>>> >> >> >> > > based
>>>>>> >> on the
>>>>>> >> >> >> trunk.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > Here is the PR that used to merge the first phases into 
>>>>>> >> >> >> > > trunk:
>>>>>> >> >> >> > > https://github.com/apache/hadoop/pull/6762
>>>>>> >> >> >> > > Here is the ticket:
>>>>>> >> https://issues.apache.org/jira/browse/HDFS-17384
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > I hope you can help to review this PR when you are available
>>>>>> >> and give
>>>>>> >> >> >> some
>>>>>> >> >> >> > > ideas.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > HDFS-17385 
>>>>>> >> >> >> > > <https://issues.apache.org/jira/browse/HDFS-17385>
>>>>>> >> is
>>>>>> >> >> >> used for
>>>>>> >> >> >> > > the second phase and I have created some subtasks to 
>>>>>> >> >> >> > > describe
>>>>>> >> >> >> solutions for
>>>>>> >> >> >> > > some problems, such as: snapshot, getListing, quota.
>>>>>> >> >> >> > > You are welcome to join us to complete it together.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > ---------- Forwarded message ---------
>>>>>> >> >> >> > > From: Zengqiang XU <[email protected]>
>>>>>> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07
>>>>>> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking
>>>>>> >> >> >> > > To: <[email protected]>
>>>>>> >> >> >> > > Cc: Zengqiang XU <[email protected]>
>>>>>> >> >> >> > >
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > Hi everyone
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > I have started a discussion about NameNode Fine-grained 
>>>>>> >> >> >> > > Locking
>>>>>> >> to
>>>>>> >> >> >> improve
>>>>>> >> >> >> > > performance of write operations in NameNode.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > I started this discussion again for serval main reasons:
>>>>>> >> >> >> > > 1. We have implemented it and gained nearly 7x performance
>>>>>> >> >> >> improvement in
>>>>>> >> >> >> > > our prod environment
>>>>>> >> >> >> > > 2. Many other companies made similar improvements based on 
>>>>>> >> >> >> > > their
>>>>>> >> >> >> internal
>>>>>> >> >> >> > > branch.
>>>>>> >> >> >> > > 3. This topic has been discussed for a long time, but still
>>>>>> >> without
>>>>>> >> >> >> any
>>>>>> >> >> >> > > results.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > I hope we can push this important improvement in the 
>>>>>> >> >> >> > > community
>>>>>> >> so
>>>>>> >> >> >> that all
>>>>>> >> >> >> > > end-users can enjoy this significant improvement.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > I'd really appreciate you can join in and work with me to 
>>>>>> >> >> >> > > push
>>>>>> >> this
>>>>>> >> >> >> > > feature forward.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > Thanks very much.
>>>>>> >> >> >> > >
>>>>>> >> >> >> > > Ticket: HDFS-17366 <
>>>>>> >> https://issues.apache.org/jira/browse/HDFS-17366>
>>>>>> >> >> >> > > Design: NameNode Fine-grained locking based on directory 
>>>>>> >> >> >> > > tree
>>>>>> >> >> >> > > <
>>>>>> >> >> >>
>>>>>> >> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
>>>>>> >> >> >> >
>>>>>> >> >> >> > >
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> ---------------------------------------------------------------------
>>>>>> >> >> >> To unsubscribe, e-mail: [email protected]
>>>>>> >> >> >> For additional commands, e-mail: [email protected]
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >>
>>>>>> >> ---------------------------------------------------------------------
>>>>>> >> To unsubscribe, e-mail: [email protected]
>>>>>> >> For additional commands, e-mail: [email protected]
>>>>>> >>
>>>>>> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Discussion about NameNode Fine-grained locking

Reply via email to