Re: Discussion about NameNode Fine-grained locking

ZanderXu Tue, 07 Jan 2025 22:28:41 -0800

Thank you all.


   - All tags for the subtasks in JIRA, such as Fix Version, Components,
   Labels, Target Version, and Flags, have been updated accordingly.
   - The failed unit tests are unrelated to this PR.
   - With four +1 votes received and no objections, I will proceed to
   initiate the official voting process.


On Mon, 6 Jan 2025 at 12:00, Xiaoqiao He <hexiaoq...@apache.org> wrote:

> Thanks all for your great work and the big step progress here.
>
> Some nit comments before check in,
> a. Please check the failed unit tests if related to this PR at
> https://github.com/apache/hadoop/pull/6762.
> It is better to execute and get a green result before check in.
> b. Please mark the correct tag `fix version`, `Component/s`, `Labels` and
> `Flags` for the subtask in JIRA.
> Some examples are [1][2].
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
> [1] https://issues.apache.org/jira/browse/HDFS-13891
> [2] https://issues.apache.org/jira/browse/HDFS-17531
>
>
> On Thu, Jan 2, 2025 at 10:54 AM Zhanghaobo <hfutzhan...@163.com> wrote:
>
>> Thanks for your great work!  +1 for merging phase 1 codes.
>>
>> My product clusters have been running phase 1 codes for several months,
>> it looks good.
>>
>> Hope to push this feature forward.
>>
>>
>>
>> 张浩博
>> hfutzhan...@163.com
>>
>> <https://dashi.163.com/projects/signature-manager/detail/index.html?ftlId=1&name=%E5%BC%A0%E6%B5%A9%E5%8D%9A&uid=hfutzhanghb%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fsmc804eb39b0e7885aa8801c3bb66e497d.jpg&items=%5B%22hfutzhanghb%40163.com%22%5D>
>>
>> ---- Replied Message ----
>> From haiyang hu<haiyang87...@gmail.com> <haiyang87...@gmail.com>
>> Date 12/31/2024 23:08
>> To Ayush Saxena<ayush...@gmail.com> <ayush...@gmail.com>
>> Cc Hui Fei<feihui.u...@gmail.com> ,
>> <feihui.u...@gmail.com> ZanderXu<zande...@apache.org> ,
>> <zande...@apache.org> Hdfs-dev<hdfs-dev@hadoop.apache.org> ,
>> <hdfs-dev@hadoop.apache.org> <priv...@hadoop.apache.org> ,
>> <priv...@hadoop.apache.org> Xiaoqiao He<hexiaoq...@apache.org> ,
>> <hexiaoq...@apache.org> slfan1989<slfan1...@apache.org> ,
>> <slfan1...@apache.org> <xuzq_zan...@163.com> <xuzq_zan...@163.com>
>> Subject Re: Discussion about NameNode Fine-grained locking
>> Thanks for your hard work and push it forward.
>> It looks good, +1 for merging phase 1 codes, hope we can work together to
>> promote this major HDFS optimization,
>> so that more companies can benefit from it.
>>
>> Thanks everyone~
>>
>> Ayush Saxena <ayush...@gmail.com> 于2024年12月31日周二 20:33写道：
>>
>> +1,
>> Thanx folks for your efforts on this! I didn't have time to review
>> everything thoroughly, but my initial pass suggests it looks good or
>> atleast is safe to merge.
>> If I find some spare time, I'll test it further and submit a ticket or
>> so if I encounter any issues.
>>
>> Good Luck!!!
>>
>> -Ayush
>>
>> On Tue, 31 Dec 2024 at 16:39, Hui Fei <feihui.u...@gmail.com> wrote:
>>
>>
>> Thanks Zander for bringing this discussion again and trying your best to
>>
>> push it forward. It's really a long time since last discussion.
>>
>>
>> It’s indeed time, +1 for merging phase 1 codes based on the following
>>
>> points
>>
>> - The phase 1 feature has been running at scale within companies for a
>>
>> long time
>>
>> - The long-term plan is clear, and also addressed some questions raised
>>
>> by the community
>>
>> - The testing result of future features on memory and performance
>>
>> ZanderXu <zande...@apache.org> 于2024年12月31日周二 15:36写道：
>>
>>
>> Hi, everyone:
>>
>> Time to Merge FGL Phase I
>>
>> The PR for FGL Phase I is ready for merging! Please take a moment to
>>
>> review and cast your vote: https://github.com/apache/hadoop/pull/6762.
>>
>>
>> The FGL Phase I has been running successfully in production for over
>>
>> six months at Shopee and BOSS Zhipin, with no reported performance or
>> stability issues. It’s now the right time to merge it into the trunk
>> branch, allowing us to move forward with Phase II.
>>
>>
>> The global lock remains the default lock mode, but users can enable FGL
>>
>> by configuring
>>
>> dfs.namenode.lock.model.provider.class=org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock.
>>
>>
>> If there are no objections within 7 days, I will propose an official
>>
>> vote.
>>
>>
>> Performance and Memory Usage of Phase I
>>
>> Conclusion：
>>
>> Fine-grained locks do not lead to significant performance improvements.
>>
>> Fine-grained locks do not result in additional memory consumption
>>
>> Reasons:
>>
>> BM operations heavily depend on FS operations: IBR and BR still acquire
>>
>> the global lock (FSLock and BMLock).
>>
>>
>> FS operations depend on BM operations: Common operations (create,
>>
>> addBlock, getBlockLocations) also acquire the global lock (FSLock and
>> BMLock).
>>
>>
>> Phase II will bring significant performance improvements by decoupling
>>
>> FS and BM dependencies and replacing the global FSLock with a fine-grained
>> IIPLock.
>>
>>
>> Addressing Common Questions
>>
>> Thank you all for raising meaningful questions!
>>
>> I have rewritten the design document to improve clarity.
>>
>>
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?usp=sharing
>>
>>
>> Below is a summary of frequently asked questions and answers:
>>
>> Summary of Questions:
>>
>> Question 1: How is the performance of LockPoolManager?
>>
>> Performance Report:
>>
>> Time to acquire a cached lock: 194 ns
>>
>> Time to acquire a non-cached lock: 1044 ns
>>
>> Time to release an in-use lock: 88 ns
>>
>> Time to release an unused lock: 112 ns
>>
>> Overall Performance:
>>
>> QPS: Over 10 million
>>
>> Time to acquire the IIP lock for a path with depth 10:
>>
>> Fully uncached: 10440 ns + 1120 ns (≈ 11 μs)
>>
>> Fully cached: 1940 ns + 1120 ns (≈ 3 μs)
>>
>> In global lock scenarios, lock wait times are typically in the
>>
>> millisecond range. Therefor, the cost of acquiring and releasing
>> fine-grained locks can be ignored.
>>
>>
>> Question 2: How much memory does the FGL consume?
>>
>> Memory Consumption:
>>
>> A single LockResource contains a read-write lock and a counter,
>>
>> totaling approximately 200 bytes:
>>
>>
>> LockResource: 24 bytes
>>
>> ReentrantReadWriteLock: 150 bytes
>>
>> AtomicInteger: 16 bytes
>>
>> Memory Usage Estimates:
>>
>> 10-level directory depth, 100 handlers
>>
>> 1000 lock resources, approximately 200 KB
>>
>> 10-level directory depth, 1000 handlers
>>
>> 10000 lock resources, approximately 2 MB
>>
>> 1, 000,000 lock resources, approximately 200 MB
>>
>> Conclusion: Memory consumption is negligible.
>>
>> Question 3: What happens if no lock is available in the LockPoolManager?
>>
>> If there are not any available LockResources, two solutions are
>>
>> available:
>>
>>
>> Return a RetryException, prompting the client to retry later.
>>
>> Temporarily increase the lock entity limit, allocate more locks to meet
>>
>> client requests, and use an asynchronous thread to recycle locks
>> periodically.
>>
>>
>> We can provide multiple LockPoolManager implementations for users to
>>
>> choose from based on production environments.
>>
>>
>> Question 4: Regarding the IIPLock lock depth issue, can we consider
>>
>> holding only the first 3 or 4 levels of directory locks?
>>
>>
>> This approach is not recommended for the following reasons:
>>
>> Cannot maximize concurrency.
>>
>> Limited savings in lock acquisition/release time and memory usage,
>>
>> yielding insignificant benefits.
>>
>>
>> Question 5: How should attributes like StoragePolicy, ErasureCoding,
>>
>> and ACL, which can be set on parent or ancestor directory nodes, be
>> handled?
>>
>>
>> ErasureCoding and ACL:
>>
>> When changing node attributes, hold the corresponding INode’s write
>>
>> lock.
>>
>>
>> When using ancestor node attributes, hold the corresponding INode’s
>>
>> read lock.
>>
>>
>> StoragePolicy:
>>
>> More complex due to its impact on both directory tree operations and
>>
>> Block operations.
>>
>>
>> To improve performance, commonly used block-related operations (such as
>>
>> BR/IBR) should not acquire IIPLock
>>
>>
>> Detailed design documentation:
>>
>>
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.96lztsl4mwfk
>>
>>
>> Question 6: How should FGL be implemented for the SNAPSHOT feature?
>>
>> Since the Rename operation on the SNAPSHOT directory is supported,
>>
>> holding only the write lock of the SNAPSHOT root directory cannot cover
>> the
>> rename situation, so the thread safety of SNAPSHOT-related operations
>> cannot be guaranteed
>>
>>
>> It is recommended to use global FS lock to ensure thread safety.
>>
>> Detailed design documentation:
>>
>>
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.sm36p6bfcpec
>>
>>
>> Question 7: How should FGL be implemented for the Symlinks feature?
>>
>> The Target path of Symlinks is a string, and the client performs a
>>
>> second forward access to the Target path. So the fine-grained lock project
>> requires no special handling
>>
>>
>> For the createSymlink RPC, the FGL needs to acquire the IIPLocks for
>>
>> both target and link paths.
>>
>>
>> Question 8: How should FGL be implemented for the reserved feature?
>>
>> The Reserved feature has two usage modes:
>>
>> /.reserved/iNodes/${inode id}
>>
>> /.reserved/raw/${path}
>>
>> INodeId Mode: During the resolvePath phase, obtain the real IIPLock
>>
>> lock via INodeId.
>>
>>
>> Path Mode: During the resolvePath phase, obtain the real IIPLock lock
>>
>> via path.
>>
>>
>> Detailed design documentation:
>>
>>
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.h6rcpzkbpanf
>>
>>
>> Question 9: Why is INodeFileLock used as the FGL for BlockInfo?
>>
>> INodeFile and Block have mutual dependencies:
>>
>> INodeFile depends on Block for state and size.
>>
>> Block depends on INodeFile for state and storage policy.
>>
>> Therefore, using INodeFileLock as the fine-grained lock for BlockInfo
>>
>> is a reasonable choice.
>>
>>
>> Detailed design documentation:
>>
>>
>> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.zesd6omuu3kr
>>
>>
>> Seeking Community Feedback
>>
>> Your questions and concerns are always welcome.
>>
>> We can discuss them in detail on the Slack Channel:
>>
>> https://app.slack.com/client/T4S1WH2J3/C06UDTBQ2SH
>>
>>
>> Let’s work together to advance the Fine-Grained Lock project. I believe
>>
>> this initiative will deliver significant performance improvements to the
>> HDFS community and help reinvigorate its activity.
>>
>>
>> Wishing everyone a Happy New Year 2025!
>>
>>
>> On Wed, 5 Jun 2024 at 16:17, ZanderXu <zande...@apache.org> wrote:
>>
>>
>> I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share
>>
>> the FGL's motivations and some concerns in detail in Chinese.
>>
>>
>> The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II)
>>
>> The meeting URL is: https://sea.zoom.us/j/94168001269
>>
>> You are welcome to this meeting.
>>
>> On Mon, 6 May 2024 at 23:57, Hui Fei <feihui.u...@gmail.com> wrote:
>>
>>
>> BTW, there is a Slack channel hdfs-fgl for this feature. can join it
>>
>> and discuss more details.
>>
>>
>> Is it necessary to hold a meeting to discuss this? So that we can
>>
>> push it forward quickly. Agreed with ZanderXu, it seems inefficient to
>> discuss details via email list.
>>
>>
>>
>> Hui Fei <feihui.u...@gmail.com> 于2024年5月6日周一 23:50写道：
>>
>>
>> Thanks all
>>
>> Seems all concerns are related to the stage 2. We can address these
>>
>> and make it more clear before we start it.
>>
>>
>> From development experience, I think it is reasonable to split the
>>
>> big feature into several stages. And stage 1 is also independent and it
>> also can be as a minor feature that uses fs and bm locks instead of the
>> global lock.
>>
>>
>>
>> ZanderXu <zande...@apache.org> 于2024年4月29日周一 15:17写道：
>>
>>
>> Thanks @Ayush Saxena <ayush...@gmail.com> and @Xiaoqiao He
>> <hexiaoq...@apache.org> for your nice questions.
>>
>> Let me summarize your concerns and corresponding solutions:
>>
>> *1. Questions about the Snapshot feature*
>> It's difficult to apply the FGL to Snapshot feature,  but we can
>>
>> just using
>>
>> the global FS write lock to make it thread safe.
>> So if we can identity if a path contains the snapshot feature, we
>>
>> can just
>>
>> using the global FS write lock to protect it.
>>
>> You can refer to HDFS-17479
>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to
>>
>> identify
>>
>> it.
>>
>> Regarding performance of the operations related to the snapshot
>>
>> features,
>>
>> we can discuss it in two categories:
>> Read operations involves snapshots:
>> The FGL branch uses the global write lock to protect them, the
>>
>> GLOBAL
>>
>> branch uses the global read lock to protect them. It's hard to
>>
>> conclude
>>
>> which version has better performance, it depends on the global lock
>> competition.
>>
>> Write operations involves snapshots:
>> Both FGL and GLOBAL branch use the global write lock to protect
>>
>> them. It's
>>
>> hard to conclude which version has better performance, it depends
>>
>> on the
>>
>> global lock competition too.
>>
>> So I think if namenode load is low, the GLOBAL branch will have a
>>
>> better
>>
>> performance than FGL; If namenode load is high, the FGL branch may
>>
>> have a
>>
>> better performance than the GLOBAL, which also depends on the ratio
>>
>> of read
>>
>> and write operations on the SNAPSHOT feature.
>>
>> We can do somethings to let end-user to choose a branch with a
>>
>> better
>>
>> branch according to their business:
>> First, we need to make the lock mode can be selectable, so that
>>
>> end-user
>>
>> can choose to use FGL of GLOBAL.
>> Second, using the global write lock to make operations related to
>>
>> snapshot
>>
>> thread safe as I described in HDFS-17479.
>>
>>
>> *2. Questions about the Symlinks feature*
>> If Symlink is related to snapshot, we can refer to the solution of
>>
>> the
>>
>> snapshot;  If Symlink is not related to snapshot, I think it's easy
>>
>> to meet
>>
>> the FGL.
>> Only createSymlink involves two paths, FGL just need to lock them
>>
>> in the
>>
>> order to make this operation thread. For other operations, it is
>>
>> the same
>>
>> as other normal iNode, right?
>>
>> If I missed difficult points, please let me know.
>>
>>
>> *3. Questions about Memory Usage of iNode locks*
>> I think there are too many solutions to limit the memory usage of
>>
>> these
>>
>> iNode locks, such as: Using a limit capacity lock pool to ensure the
>> maximum memory usage,  Just holding iNode locks for fixed depth of
>> directories, etc.
>>
>> We can just abstract this LockManager first and then support its
>> implementation with different ideas, so that we can limit the
>>
>> maximum
>>
>> memory usage of these iNode locks.
>> FGL can acquire or lease iNode locks through LockManager.
>>
>>
>> *4. Questions about Performance of acquiring and releasing iNode
>>
>> locks*
>>
>> We can add some benchmark for LockManager, to test the performance
>>
>> or
>>
>> acquire and release unblocked locks.
>>
>>
>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
>> These policies may be sot on an ancestor node and used by some
>>
>> children
>>
>> files.  The set operation for these policies will be protected by
>>
>> the
>>
>> directory tree, since there are all file-related operations.  In
>>
>> addition
>>
>> to Quota and StoragePolicy, the use of other policies will also be
>> protected by directory tree, such as ECPolicy and ACL.
>>
>> Quota is a little special since its update operations may not be
>>
>> protected
>>
>> by the directory tree, we can assign a locks to each QuotaFeature
>>
>> and use
>>
>> these locks to make updating operations thread safe. you can refer
>>
>> to
>>
>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to
>>
>> get some
>>
>> detailed information.
>>
>> StoragePolicy is a little special since it is used not only by
>>
>> file-related
>>
>> operations but also block-related operations.
>>
>> ProcessExtraRedundancyBlock
>>
>> uses storage policy to choose redundancy replicas and
>> BlockReconstructionWork uses storage policy to choose target DNs.
>>
>> In order
>>
>> to maximize the performance improvement, BR and IBR should only
>>
>> involve the
>>
>> iNodeFile to which the current processing block belongs. These
>>
>> redundancy
>>
>> blocks can be processed by the Redundancy monitor while holding the
>> directory tree locks. You can refer to HDFS-17505
>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more
>>
>> detailed
>>
>> informations.
>>
>> *6. Performance of the phase 1*
>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is
>>
>> used to do
>>
>> some performance testing for phase 1, and I will complete it later.
>>
>>
>> Discuss solution through mails is not efficient, you can create one
>> sub-tasks under HDFS-17366
>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
>> concerns and I will try to give some answers.
>>
>> Thanks @Ayush Saxena <ayush...@gmail.com>  and @Xiaoqiao He
>> <hexiaoq...@apache.org> again.
>>
>>
>>
>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <ayush...@gmail.com>
>>
>> wrote:
>>
>>
>> Thanx Everyone for chasing this, Great to see some momentum
>>
>> around FGL,
>>
>> that should be a great improvement.
>>
>> I have some two broad categories:
>> ** About the process:*
>> I think in the above mails, there are mentions that phase one is
>>
>> complete
>>
>> in a feature branch & we are gonna merge that to trunk. If I am
>>
>> catching it
>>
>> right, then you can't hit the merge button like that. To merge a
>>
>> feature
>>
>> branch. You need to call for a Vote specific to that branch & it
>>
>> requires 3
>>
>> binding votes to merge, unlike any other code change which
>>
>> requires 1. It
>>
>> is there in our Bylaws.
>>
>> So, do follow the process.
>>
>> ** About the feature itself:* (A very quick look at the doc and
>>
>> the Jira,
>>
>> so please take it with a grain of salt)
>> * The Google Drive link that you folks shared as part of the
>>
>> first mail. I
>>
>> don't have access to that. So, please open up the permissions for
>>
>> that doc
>>
>> or share the new link
>> * Chasing the design doc present on the Jira
>> * I think we only have Phase-1 ready, so can you share some
>>
>> metrics just
>>
>> for that? Perf improvements just with splitting the FS & BM Locks
>> * The memory implications of Phase-1? I don't think there should
>>
>> be any
>>
>> major impact on the memory in case of just phase-1
>> * Regarding the snapshot stuff, you mentioned taking lock on the
>>
>> root
>>
>> itself? Does just taking lock on the snapshot root rather than
>>
>> the FS root
>>
>> works?
>> * Secondly about the usage of Snapshot or Symlinks, I don't think
>>
>> we
>>
>> should operate under the assumptions that they aren't widely used
>>
>> or not,
>>
>> we might just not know folks who don't use it widely or they are
>>
>> just users
>>
>> not the ones contributing. We can just accept for now, that in
>>
>> those cases
>>
>> it isn't optimised and we just lock the entire FS space, which it
>>
>> does even
>>
>> today, so no regressions there.
>> * Regarding memory usage: Do you have some numbers on how much
>>
>> the memory
>>
>> footprint increases?
>> * Under the Lock Pool: I think you are assuming there would be
>>
>> very few
>>
>> inodes where lock would be required at any given time, so there
>>
>> won't be
>>
>> too much heap consumption? I think you are compromising on the
>>
>> Horizontal
>>
>> Scalability here. I doubt if your assumption doesn't hold true,
>>
>> under heavy
>>
>> read load by concurrent clients accessing different inodes, the
>>
>> Namenode
>>
>> will start giving memory troubles, that would do more harm than
>>
>> good.
>>
>> Anyway Namenode heap is way bigger problem than anything, so we
>>
>> should be
>>
>> very careful increasing load over there.
>> * For the Locks on the inodes: Do you plan to have locs for each
>>
>> inode?
>>
>> Can we somehow limit that to the depth of the tree? Like
>>
>> currently we take
>>
>> lock on the root, have a config which makes us take lock at
>>
>> Level-2 or 3
>>
>> (configurable), that might fetch some perf benefits and can be
>>
>> used to
>>
>> control the memory usage as well?
>> * What is the cost of creating these inode locks? If the lock
>>
>> isn't
>>
>> already cached it would incur some cost? Do you have some numbers
>>
>> around
>>
>> that? Say I disable caching altogether & then let a test load
>>
>> run, what
>>
>> does the perf numbers look like in that case
>> * I think we need to limit the size of INodeLockPool, we can't
>>
>> let it grow
>>
>> infinitely in case of heavy loads and we need to have some auto
>> throttling mechanism for it
>> * I didn't catch your Storage Policy problem. If I decode it
>>
>> right, the
>>
>> problem is like the policy could be set on an ancestor node & the
>>
>> children
>>
>> abide by that & this is the problem, if that is the case then
>>
>> isn't that
>>
>> the case with ErasureCoding policies or even ACLs or so? Can you
>>
>> elaborate
>>
>> a bit on that.
>>
>>
>> Anyway, regarding the Phase-1. If you share (the perf numbers
>>
>> with proper
>>
>> details + Impact on memory if any) for just phase 1 & if they are
>>
>> good,
>>
>> then if you call for a branch merge vote for Phase-1 FGL, you
>>
>> have my vote,
>>
>> however you'll need to sway the rest of the folks on your own :-)
>>
>> Good Luck, Nice Work Guys!!!
>>
>> -Ayush
>>
>>
>> On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <hexiaoq...@apache.org>
>>
>> wrote:
>>
>>
>> Thanks ZanderXu and Hui Fei for your work on this feature. It
>>
>> will be
>>
>> a very helpful improvement for the HDFS module in the next
>>
>> journal.
>>
>>
>> 1. If we need any more review bandwidth, I would like to be
>>
>> involved
>>
>> to help review if possible.
>> 2. From the design document there are still missing some detailed
>> descriptions such as snapshot, symbolic link and reserved etc as
>>
>> mentioned
>>
>> above. I think it will be helpful for newbies who want to be
>>
>> involved
>>
>> if all corner
>> cases are considered and described.
>> 3. From slack, we plan to check into the trunk at this phase. I
>>
>> am not
>>
>> sure
>> If it is the proper time, following the dev plan there are two
>>
>> steps left
>>
>> to
>> finish this feature from the design document, right? If that, I
>>
>> think we
>>
>> should
>> postpone checking in when all plans are ready. Considering that
>>
>> there are
>>
>> many unfinished tries for this feature in history, I think
>>
>> postpone
>>
>> checking
>> will be the safe way, another way it will involve more rebase
>>
>> cost if you
>>
>> keep
>> separate dev branch, however I think It is not one difficult
>>
>> thing for
>>
>> you.
>>
>> Good luck and look forward to making that happen soon!
>>
>> Best Regards,
>> - He Xiaoqiao
>>
>> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <feihui.u...@gmail.com>
>>
>> wrote:
>>
>>
>> Thanks for interest and advice on this.
>>
>> Just would like to share some info here
>>
>> ZanderXu leads this feature and he has spent a lot of time on
>>
>> it. He is
>>
>> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also
>>
>> took some
>>
>> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99
>>
>> RocMarshal
>>
>> kokonguyen191) helped review PRs. (Forgive me if I missed
>>
>> someone)
>>
>>
>> Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
>>
>> familiar with this feature. We discussed many details offline.
>>
>>
>> Welcome to more people interested in joining the development
>>
>> and review
>>
>> of the stage 2 and 3.
>>
>>
>>
>> Zengqiang XU <xuzengqiang5...@gmail.com> 于2024年4月26日周五
>>
>> 14:56写道：
>>
>>
>> Thanks Shilun for your response:
>>
>> 1. This is a big and very useful feature, so it really needs
>>
>> more
>>
>> developers to get on board.
>> 2. This fine grained lock has been implemented based on
>>
>> internal
>>
>> branches
>>
>> and has gained benefits by many companies, such as: Meituan,
>>
>> Kuaishou,
>>
>> Bytedance, etc.  But it has not been contributed to the
>>
>> community due
>>
>> to
>>
>> various reasons, such as there is a big difference between
>>
>> the version
>>
>> of
>>
>> the internal branch and the community trunk branch, the
>>
>> internal
>>
>> branch may
>>
>> ignore some functions to make FGL clear, and the contribution
>>
>> needs a
>>
>> lot
>>
>> of work and will take many times. It means that this solution
>>
>> has
>>
>> already
>>
>> been practiced in their prod environment. We have also
>>
>> practiced it in
>>
>> our
>>
>> prod environment and gained benefits, and we are also willing
>>
>> to spend
>>
>> a
>>
>> lot of time contributing to the community.
>> 3. Regarding the benchmark testing, we don't need to pay more
>>
>> attention to
>>
>> whether the performance is improved by 5 times, 10 times or
>>
>> 20 times,
>>
>> because there are too many factors that affect it.
>> 4. As I described above, this solution is already  being
>>
>> practiced by
>>
>> many
>>
>> companies. Right now, we just need to think about how to
>>
>> implement it
>>
>> with
>>
>> high quality and more comprehensively.
>> 5. I firmly believe that all problems can be solved as long
>>
>> as the
>>
>> overall
>>
>> solution is right.
>> 6. I can spend a lot of time leading the promotion of this
>>
>> entire
>>
>> feature
>>
>> and I hope more people can join us in promoting it.
>> 7. You are always welcome to raise your concerns.
>>
>>
>> Thanks Shilun again, I hope you can help review designs and
>>
>> PRs. Thanks
>>
>>
>> On Fri, 26 Apr 2024 at 08:00, slfan1989 <slfan1...@apache.org>
>>
>> wrote:
>>
>>
>> Thank you for your hard work! This is a very meaningful
>>
>> improvement,
>>
>> and
>>
>> from the design document, we can see a significant increase
>>
>> in HDFS
>>
>> read/write throughput.
>>
>> I am happy to see the progress made on HDFS-17384.
>>
>> However, I still have some concerns, which roughly involve
>>
>> the
>>
>> following
>>
>> aspects:
>>
>> 1. While ZanderXu and Hui Fei have deep expertise in HDFS
>>
>> and are
>>
>> familiar
>>
>> with related development details, we still need more
>>
>> community
>>
>> member to
>>
>> review the code to ensure that the relevant upgrades meet
>>
>> expectations.
>>
>>
>> 2. We need more details on benchmarks to ensure that test
>>
>> results
>>
>> can be
>>
>> reproduced and to allow more community member to
>>
>> participate in the
>>
>> testing
>>
>> process.
>>
>> Looking forward to everything going smoothly in the future.
>>
>> Best Regards,
>> - Shilun Fan.
>>
>> On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <
>>
>> hexiaoq...@apache.org>
>>
>> wrote:
>>
>>
>> cc private@h.a.o.
>>
>> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <
>>
>> zande...@apache.org>
>>
>> wrote:
>>
>>
>> Here are some summaries about the first phase:
>> 1. There are no big changes in this phase
>> 2. This phase just uses FS lock and BM lock to replace
>>
>> the
>>
>> original
>>
>> global
>>
>> lock
>> 3. It's useful to improve the performance, since some
>>
>> operations
>>
>> just
>>
>> need
>>
>> to hold FS lock or BM lock instead of the global lock
>> 4. This feature is turned off by default, you can enable
>>
>> it by
>>
>> setting
>>
>> dfs.namenode.lock.model.provider.class to
>>
>>
>> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
>>
>> 5. This phase is very import for the ongoing development
>>
>> of the
>>
>> entire
>>
>> FGL
>>
>>
>> Here I would like to express my special thanks to
>>
>> @kokonguyen191
>>
>> and
>>
>> @yuanboliu for their contributions.  And you are also
>>
>> welcome to
>>
>> join us
>>
>> and complete it together.
>>
>>
>> On Wed, 24 Apr 2024 at 14:54, ZanderXu <
>>
>> zande...@apache.org>
>>
>> wrote:
>>
>>
>> Hi everyone
>>
>> All subtasks of the first phase of the FGL have been
>>
>> completed
>>
>> and I
>>
>> plan
>>
>> to merge them into the trunk and start the second
>>
>> phase based
>>
>> on the
>>
>> trunk.
>>
>>
>> Here is the PR that used to merge the first phases
>>
>> into trunk:
>>
>> https://github.com/apache/hadoop/pull/6762
>> Here is the ticket:
>>
>> https://issues.apache.org/jira/browse/HDFS-17384
>>
>>
>> I hope you can help to review this PR when you are
>>
>> available
>>
>> and give
>>
>> some
>>
>> ideas.
>>
>>
>> HDFS-17385 <
>>
>> https://issues.apache.org/jira/browse/HDFS-17385>
>>
>> is
>>
>> used for
>>
>> the second phase and I have created some subtasks to
>>
>> describe
>>
>> solutions for
>>
>> some problems, such as: snapshot, getListing, quota.
>> You are welcome to join us to complete it together.
>>
>>
>> ---------- Forwarded message ---------
>> From: Zengqiang XU <zande...@apache.org>
>> Date: Fri, 2 Feb 2024 at 11:07
>> Subject: Discussion about NameNode Fine-grained locking
>> To: <hdfs-dev@hadoop.apache.org>
>> Cc: Zengqiang XU <xuzengqiang5...@gmail.com>
>>
>>
>> Hi everyone
>>
>> I have started a discussion about NameNode
>>
>> Fine-grained Locking
>>
>> to
>>
>> improve
>>
>> performance of write operations in NameNode.
>>
>> I started this discussion again for serval main
>>
>> reasons:
>>
>> 1. We have implemented it and gained nearly 7x
>>
>> performance
>>
>> improvement in
>>
>> our prod environment
>> 2. Many other companies made similar improvements
>>
>> based on their
>>
>> internal
>>
>> branch.
>> 3. This topic has been discussed for a long time, but
>>
>> still
>>
>> without
>>
>> any
>>
>> results.
>>
>> I hope we can push this important improvement in the
>>
>> community
>>
>> so
>>
>> that all
>>
>> end-users can enjoy this significant improvement.
>>
>> I'd really appreciate you can join in and work with me
>>
>> to push
>>
>> this
>>
>> feature forward.
>>
>> Thanks very much.
>>
>> Ticket: HDFS-17366 <
>>
>> https://issues.apache.org/jira/browse/HDFS-17366>
>>
>> Design: NameNode Fine-grained locking based on
>>
>> directory tree
>>
>> <
>>
>>
>>
>>
>> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail:
>>
>> private-unsubscr...@hadoop.apache.org
>>
>> For additional commands, e-mail:
>>
>> private-h...@hadoop.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>>
>>
>>

Re: Discussion about NameNode Fine-grained locking

Reply via email to