Re: Discussion about NameNode Fine-grained locking

haiyang hu Tue, 31 Dec 2024 07:09:00 -0800

Thanks for your hard work and push it forward.
It looks good, +1 for merging phase 1 codes, hope we can work together to
promote this major HDFS optimization,
so that more companies can benefit from it.


Thanks everyone~

Ayush Saxena <ayush...@gmail.com> 于2024年12月31日周二 20:33写道：

> +1,
> Thanx folks for your efforts on this! I didn't have time to review
> everything thoroughly, but my initial pass suggests it looks good or
> atleast is safe to merge.
> If I find some spare time, I'll test it further and submit a ticket or
> so if I encounter any issues.
>
> Good Luck!!!
>
> -Ayush
>
> On Tue, 31 Dec 2024 at 16:39, Hui Fei <feihui.u...@gmail.com> wrote:
> >
> > Thanks Zander for bringing this discussion again and trying your best to
> push it forward. It's really a long time since last discussion.
> >
> > It’s indeed time, +1 for merging phase 1 codes based on the following
> points
> >  - The phase 1 feature has been running at scale within companies for a
> long time
> >  - The long-term plan is clear, and also addressed some questions raised
> by the community
> >  - The testing result of future features on memory and performance
> >
> > ZanderXu <zande...@apache.org> 于2024年12月31日周二 15:36写道：
> >>
> >> Hi, everyone:
> >>
> >> Time to Merge FGL Phase I
> >>
> >> The PR for FGL Phase I is ready for merging! Please take a moment to
> review and cast your vote: https://github.com/apache/hadoop/pull/6762.
> >>
> >> The FGL Phase I has been running successfully in production for over
> six months at Shopee and BOSS Zhipin, with no reported performance or
> stability issues. It’s now the right time to merge it into the trunk
> branch, allowing us to move forward with Phase II.
> >>
> >> The global lock remains the default lock mode, but users can enable FGL
> by configuring
> dfs.namenode.lock.model.provider.class=org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock.
> >>
> >> If there are no objections within 7 days, I will propose an official
> vote.
> >>
> >> Performance and Memory Usage of Phase I
> >>
> >> Conclusion：
> >>
> >> Fine-grained locks do not lead to significant performance improvements.
> >>
> >> Fine-grained locks do not result in additional memory consumption
> >>
> >> Reasons:
> >>
> >> BM operations heavily depend on FS operations: IBR and BR still acquire
> the global lock (FSLock and BMLock).
> >>
> >> FS operations depend on BM operations: Common operations (create,
> addBlock, getBlockLocations) also acquire the global lock (FSLock and
> BMLock).
> >>
> >> Phase II will bring significant performance improvements by decoupling
> FS and BM dependencies and replacing the global FSLock with a fine-grained
> IIPLock.
> >>
> >> Addressing Common Questions
> >>
> >> Thank you all for raising meaningful questions!
> >>
> >> I have rewritten the design document to improve clarity.
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?usp=sharing
> >>
> >> Below is a summary of frequently asked questions and answers:
> >>
> >> Summary of Questions:
> >>
> >> Question 1: How is the performance of LockPoolManager?
> >>
> >> Performance Report:
> >>
> >> Time to acquire a cached lock: 194 ns
> >>
> >> Time to acquire a non-cached lock: 1044 ns
> >>
> >> Time to release an in-use lock: 88 ns
> >>
> >> Time to release an unused lock: 112 ns
> >>
> >> Overall Performance:
> >>
> >> QPS: Over 10 million
> >>
> >> Time to acquire the IIP lock for a path with depth 10:
> >>
> >> Fully uncached: 10440 ns + 1120 ns (≈ 11 μs)
> >>
> >> Fully cached: 1940 ns + 1120 ns (≈ 3 μs)
> >>
> >> In global lock scenarios, lock wait times are typically in the
> millisecond range. Therefor, the cost of acquiring and releasing
> fine-grained locks can be ignored.
> >>
> >> Question 2: How much memory does the FGL consume?
> >>
> >> Memory Consumption:
> >>
> >> A single LockResource contains a read-write lock and a counter,
> totaling approximately 200 bytes:
> >>
> >> LockResource: 24 bytes
> >>
> >> ReentrantReadWriteLock: 150 bytes
> >>
> >> AtomicInteger: 16 bytes
> >>
> >> Memory Usage Estimates:
> >>
> >> 10-level directory depth, 100 handlers
> >>
> >> 1000 lock resources, approximately 200 KB
> >>
> >> 10-level directory depth, 1000 handlers
> >>
> >> 10000 lock resources, approximately 2 MB
> >>
> >> 1, 000,000 lock resources, approximately 200 MB
> >>
> >> Conclusion: Memory consumption is negligible.
> >>
> >> Question 3: What happens if no lock is available in the LockPoolManager?
> >>
> >> If there are not any available LockResources, two solutions are
> available:
> >>
> >> Return a RetryException, prompting the client to retry later.
> >>
> >> Temporarily increase the lock entity limit, allocate more locks to meet
> client requests, and use an asynchronous thread to recycle locks
> periodically.
> >>
> >> We can provide multiple LockPoolManager implementations for users to
> choose from based on production environments.
> >>
> >> Question 4: Regarding the IIPLock lock depth issue, can we consider
> holding only the first 3 or 4 levels of directory locks?
> >>
> >> This approach is not recommended for the following reasons:
> >>
> >> Cannot maximize concurrency.
> >>
> >> Limited savings in lock acquisition/release time and memory usage,
> yielding insignificant benefits.
> >>
> >> Question 5: How should attributes like StoragePolicy, ErasureCoding,
> and ACL, which can be set on parent or ancestor directory nodes, be handled?
> >>
> >> ErasureCoding and ACL:
> >>
> >> When changing node attributes, hold the corresponding INode’s write
> lock.
> >>
> >> When using ancestor node attributes, hold the corresponding INode’s
> read lock.
> >>
> >> StoragePolicy:
> >>
> >> More complex due to its impact on both directory tree operations and
> Block operations.
> >>
> >> To improve performance, commonly used block-related operations (such as
> BR/IBR) should not acquire IIPLock
> >>
> >> Detailed design documentation:
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.96lztsl4mwfk
> >>
> >> Question 6: How should FGL be implemented for the SNAPSHOT feature?
> >>
> >> Since the Rename operation on the SNAPSHOT directory is supported,
> holding only the write lock of the SNAPSHOT root directory cannot cover the
> rename situation, so the thread safety of SNAPSHOT-related operations
> cannot be guaranteed
> >>
> >> It is recommended to use global FS lock to ensure thread safety.
> >>
> >> Detailed design documentation:
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.sm36p6bfcpec
> >>
> >> Question 7: How should FGL be implemented for the Symlinks feature?
> >>
> >> The Target path of Symlinks is a string, and the client performs a
> second forward access to the Target path. So the fine-grained lock project
> requires no special handling
> >>
> >> For the createSymlink RPC, the FGL needs to acquire the IIPLocks for
> both target and link paths.
> >>
> >> Question 8: How should FGL be implemented for the reserved feature?
> >>
> >> The Reserved feature has two usage modes:
> >>
> >> /.reserved/iNodes/${inode id}
> >>
> >> /.reserved/raw/${path}
> >>
> >> INodeId Mode: During the resolvePath phase, obtain the real IIPLock
> lock via INodeId.
> >>
> >> Path Mode: During the resolvePath phase, obtain the real IIPLock lock
> via path.
> >>
> >> Detailed design documentation:
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.h6rcpzkbpanf
> >>
> >> Question 9: Why is INodeFileLock used as the FGL for BlockInfo?
> >>
> >> INodeFile and Block have mutual dependencies:
> >>
> >> INodeFile depends on Block for state and size.
> >>
> >> Block depends on INodeFile for state and storage policy.
> >>
> >> Therefore, using INodeFileLock as the fine-grained lock for BlockInfo
> is a reasonable choice.
> >>
> >> Detailed design documentation:
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.zesd6omuu3kr
> >>
> >> Seeking Community Feedback
> >>
> >> Your questions and concerns are always welcome.
> >>
> >> We can discuss them in detail on the Slack Channel:
> https://app.slack.com/client/T4S1WH2J3/C06UDTBQ2SH
> >>
> >> Let’s work together to advance the Fine-Grained Lock project. I believe
> this initiative will deliver significant performance improvements to the
> HDFS community and help reinvigorate its activity.
> >>
> >> Wishing everyone a Happy New Year 2025!
> >>
> >>
> >> On Wed, 5 Jun 2024 at 16:17, ZanderXu <zande...@apache.org> wrote:
> >>>
> >>> I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share
> the FGL's motivations and some concerns in detail in Chinese.
> >>>
> >>> The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II)
> >>>
> >>> The meeting URL is: https://sea.zoom.us/j/94168001269
> >>>
> >>> You are welcome to this meeting.
> >>>
> >>> On Mon, 6 May 2024 at 23:57, Hui Fei <feihui.u...@gmail.com> wrote:
> >>>>
> >>>> BTW, there is a Slack channel hdfs-fgl for this feature. can join it
> and discuss more details.
> >>>>
> >>>> Is it necessary to hold a meeting to discuss this? So that we can
> push it forward quickly. Agreed with ZanderXu, it seems inefficient to
> discuss details via email list.
> >>>>
> >>>>
> >>>> Hui Fei <feihui.u...@gmail.com> 于2024年5月6日周一 23:50写道：
> >>>>>
> >>>>> Thanks all
> >>>>>
> >>>>> Seems all concerns are related to the stage 2. We can address these
> and make it more clear before we start it.
> >>>>>
> >>>>> From development experience, I think it is reasonable to split the
> big feature into several stages. And stage 1 is also independent and it
> also can be as a minor feature that uses fs and bm locks instead of the
> global lock.
> >>>>>
> >>>>>
> >>>>> ZanderXu <zande...@apache.org> 于2024年4月29日周一 15:17写道：
> >>>>>>
> >>>>>> Thanks @Ayush Saxena <ayush...@gmail.com> and @Xiaoqiao He
> >>>>>> <hexiaoq...@apache.org> for your nice questions.
> >>>>>>
> >>>>>> Let me summarize your concerns and corresponding solutions:
> >>>>>>
> >>>>>> *1. Questions about the Snapshot feature*
> >>>>>> It's difficult to apply the FGL to Snapshot feature,  but we can
> just using
> >>>>>> the global FS write lock to make it thread safe.
> >>>>>> So if we can identity if a path contains the snapshot feature, we
> can just
> >>>>>> using the global FS write lock to protect it.
> >>>>>>
> >>>>>> You can refer to HDFS-17479
> >>>>>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to
> identify
> >>>>>> it.
> >>>>>>
> >>>>>> Regarding performance of the operations related to the snapshot
> features,
> >>>>>> we can discuss it in two categories:
> >>>>>> Read operations involves snapshots:
> >>>>>> The FGL branch uses the global write lock to protect them, the
> GLOBAL
> >>>>>> branch uses the global read lock to protect them. It's hard to
> conclude
> >>>>>> which version has better performance, it depends on the global lock
> >>>>>> competition.
> >>>>>>
> >>>>>> Write operations involves snapshots:
> >>>>>> Both FGL and GLOBAL branch use the global write lock to protect
> them. It's
> >>>>>> hard to conclude which version has better performance, it depends
> on the
> >>>>>> global lock competition too.
> >>>>>>
> >>>>>> So I think if namenode load is low, the GLOBAL branch will have a
> better
> >>>>>> performance than FGL; If namenode load is high, the FGL branch may
> have a
> >>>>>> better performance than the GLOBAL, which also depends on the ratio
> of read
> >>>>>> and write operations on the SNAPSHOT feature.
> >>>>>>
> >>>>>> We can do somethings to let end-user to choose a branch with a
> better
> >>>>>> branch according to their business:
> >>>>>> First, we need to make the lock mode can be selectable, so that
> end-user
> >>>>>> can choose to use FGL of GLOBAL.
> >>>>>> Second, using the global write lock to make operations related to
> snapshot
> >>>>>> thread safe as I described in HDFS-17479.
> >>>>>>
> >>>>>>
> >>>>>> *2. Questions about the Symlinks feature*
> >>>>>> If Symlink is related to snapshot, we can refer to the solution of
> the
> >>>>>> snapshot;  If Symlink is not related to snapshot, I think it's easy
> to meet
> >>>>>> the FGL.
> >>>>>> Only createSymlink involves two paths, FGL just need to lock them
> in the
> >>>>>> order to make this operation thread. For other operations, it is
> the same
> >>>>>> as other normal iNode, right?
> >>>>>>
> >>>>>> If I missed difficult points, please let me know.
> >>>>>>
> >>>>>>
> >>>>>> *3. Questions about Memory Usage of iNode locks*
> >>>>>> I think there are too many solutions to limit the memory usage of
> these
> >>>>>> iNode locks, such as: Using a limit capacity lock pool to ensure the
> >>>>>> maximum memory usage,  Just holding iNode locks for fixed depth of
> >>>>>> directories, etc.
> >>>>>>
> >>>>>> We can just abstract this LockManager first and then support its
> >>>>>> implementation with different ideas, so that we can limit the
> maximum
> >>>>>> memory usage of these iNode locks.
> >>>>>> FGL can acquire or lease iNode locks through LockManager.
> >>>>>>
> >>>>>>
> >>>>>> *4. Questions about Performance of acquiring and releasing iNode
> locks*
> >>>>>> We can add some benchmark for LockManager, to test the performance
> or
> >>>>>> acquire and release unblocked locks.
> >>>>>>
> >>>>>>
> >>>>>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
> >>>>>> These policies may be sot on an ancestor node and used by some
> children
> >>>>>> files.  The set operation for these policies will be protected by
> the
> >>>>>> directory tree, since there are all file-related operations.  In
> addition
> >>>>>> to Quota and StoragePolicy, the use of other policies will also be
> >>>>>> protected by directory tree, such as ECPolicy and ACL.
> >>>>>>
> >>>>>> Quota is a little special since its update operations may not be
> protected
> >>>>>> by the directory tree, we can assign a locks to each QuotaFeature
> and use
> >>>>>> these locks to make updating operations thread safe. you can refer
> to
> >>>>>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to
> get some
> >>>>>> detailed information.
> >>>>>>
> >>>>>> StoragePolicy is a little special since it is used not only by
> file-related
> >>>>>> operations but also block-related operations.
> ProcessExtraRedundancyBlock
> >>>>>> uses storage policy to choose redundancy replicas and
> >>>>>> BlockReconstructionWork uses storage policy to choose target DNs.
> In order
> >>>>>> to maximize the performance improvement, BR and IBR should only
> involve the
> >>>>>> iNodeFile to which the current processing block belongs. These
> redundancy
> >>>>>> blocks can be processed by the Redundancy monitor while holding the
> >>>>>> directory tree locks. You can refer to HDFS-17505
> >>>>>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more
> detailed
> >>>>>> informations.
> >>>>>>
> >>>>>> *6. Performance of the phase 1*
> >>>>>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is
> used to do
> >>>>>> some performance testing for phase 1, and I will complete it later.
> >>>>>>
> >>>>>>
> >>>>>> Discuss solution through mails is not efficient, you can create one
> >>>>>> sub-tasks under HDFS-17366
> >>>>>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
> >>>>>> concerns and I will try to give some answers.
> >>>>>>
> >>>>>> Thanks @Ayush Saxena <ayush...@gmail.com>  and @Xiaoqiao He
> >>>>>> <hexiaoq...@apache.org> again.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <ayush...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> > Thanx Everyone for chasing this, Great to see some momentum
> around FGL,
> >>>>>> > that should be a great improvement.
> >>>>>> >
> >>>>>> > I have some two broad categories:
> >>>>>> > ** About the process:*
> >>>>>> > I think in the above mails, there are mentions that phase one is
> complete
> >>>>>> > in a feature branch & we are gonna merge that to trunk. If I am
> catching it
> >>>>>> > right, then you can't hit the merge button like that. To merge a
> feature
> >>>>>> > branch. You need to call for a Vote specific to that branch & it
> requires 3
> >>>>>> > binding votes to merge, unlike any other code change which
> requires 1. It
> >>>>>> > is there in our Bylaws.
> >>>>>> >
> >>>>>> > So, do follow the process.
> >>>>>> >
> >>>>>> > ** About the feature itself:* (A very quick look at the doc and
> the Jira,
> >>>>>> > so please take it with a grain of salt)
> >>>>>> > * The Google Drive link that you folks shared as part of the
> first mail. I
> >>>>>> > don't have access to that. So, please open up the permissions for
> that doc
> >>>>>> > or share the new link
> >>>>>> > * Chasing the design doc present on the Jira
> >>>>>> > * I think we only have Phase-1 ready, so can you share some
> metrics just
> >>>>>> > for that? Perf improvements just with splitting the FS & BM Locks
> >>>>>> > * The memory implications of Phase-1? I don't think there should
> be any
> >>>>>> > major impact on the memory in case of just phase-1
> >>>>>> > * Regarding the snapshot stuff, you mentioned taking lock on the
> root
> >>>>>> > itself? Does just taking lock on the snapshot root rather than
> the FS root
> >>>>>> > works?
> >>>>>> > * Secondly about the usage of Snapshot or Symlinks, I don't think
> we
> >>>>>> > should operate under the assumptions that they aren't widely used
> or not,
> >>>>>> > we might just not know folks who don't use it widely or they are
> just users
> >>>>>> > not the ones contributing. We can just accept for now, that in
> those cases
> >>>>>> > it isn't optimised and we just lock the entire FS space, which it
> does even
> >>>>>> > today, so no regressions there.
> >>>>>> > * Regarding memory usage: Do you have some numbers on how much
> the memory
> >>>>>> > footprint increases?
> >>>>>> > * Under the Lock Pool: I think you are assuming there would be
> very few
> >>>>>> > inodes where lock would be required at any given time, so there
> won't be
> >>>>>> > too much heap consumption? I think you are compromising on the
> Horizontal
> >>>>>> > Scalability here. I doubt if your assumption doesn't hold true,
> under heavy
> >>>>>> > read load by concurrent clients accessing different inodes, the
> Namenode
> >>>>>> > will start giving memory troubles, that would do more harm than
> good.
> >>>>>> > Anyway Namenode heap is way bigger problem than anything, so we
> should be
> >>>>>> > very careful increasing load over there.
> >>>>>> > * For the Locks on the inodes: Do you plan to have locs for each
> inode?
> >>>>>> > Can we somehow limit that to the depth of the tree? Like
> currently we take
> >>>>>> > lock on the root, have a config which makes us take lock at
> Level-2 or 3
> >>>>>> > (configurable), that might fetch some perf benefits and can be
> used to
> >>>>>> > control the memory usage as well?
> >>>>>> > * What is the cost of creating these inode locks? If the lock
> isn't
> >>>>>> > already cached it would incur some cost? Do you have some numbers
> around
> >>>>>> > that? Say I disable caching altogether & then let a test load
> run, what
> >>>>>> > does the perf numbers look like in that case
> >>>>>> > * I think we need to limit the size of INodeLockPool, we can't
> let it grow
> >>>>>> > infinitely in case of heavy loads and we need to have some auto
> >>>>>> > throttling mechanism for it
> >>>>>> > * I didn't catch your Storage Policy problem. If I decode it
> right, the
> >>>>>> > problem is like the policy could be set on an ancestor node & the
> children
> >>>>>> > abide by that & this is the problem, if that is the case then
> isn't that
> >>>>>> > the case with ErasureCoding policies or even ACLs or so? Can you
> elaborate
> >>>>>> > a bit on that.
> >>>>>> >
> >>>>>> >
> >>>>>> > Anyway, regarding the Phase-1. If you share (the perf numbers
> with proper
> >>>>>> > details + Impact on memory if any) for just phase 1 & if they are
> good,
> >>>>>> > then if you call for a branch merge vote for Phase-1 FGL, you
> have my vote,
> >>>>>> > however you'll need to sway the rest of the folks on your own :-)
> >>>>>> >
> >>>>>> > Good Luck, Nice Work Guys!!!
> >>>>>> >
> >>>>>> > -Ayush
> >>>>>> >
> >>>>>> >
> >>>>>> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <hexiaoq...@apache.org>
> wrote:
> >>>>>> >
> >>>>>> >> Thanks ZanderXu and Hui Fei for your work on this feature. It
> will be
> >>>>>> >> a very helpful improvement for the HDFS module in the next
> journal.
> >>>>>> >>
> >>>>>> >> 1. If we need any more review bandwidth, I would like to be
> involved
> >>>>>> >> to help review if possible.
> >>>>>> >> 2. From the design document there are still missing some detailed
> >>>>>> >> descriptions such as snapshot, symbolic link and reserved etc as
> mentioned
> >>>>>> >> above. I think it will be helpful for newbies who want to be
> involved
> >>>>>> >> if all corner
> >>>>>> >> cases are considered and described.
> >>>>>> >> 3. From slack, we plan to check into the trunk at this phase. I
> am not
> >>>>>> >> sure
> >>>>>> >> If it is the proper time, following the dev plan there are two
> steps left
> >>>>>> >> to
> >>>>>> >> finish this feature from the design document, right? If that, I
> think we
> >>>>>> >> should
> >>>>>> >> postpone checking in when all plans are ready. Considering that
> there are
> >>>>>> >> many unfinished tries for this feature in history, I think
> postpone
> >>>>>> >> checking
> >>>>>> >> will be the safe way, another way it will involve more rebase
> cost if you
> >>>>>> >> keep
> >>>>>> >> separate dev branch, however I think It is not one difficult
> thing for
> >>>>>> >> you.
> >>>>>> >>
> >>>>>> >> Good luck and look forward to making that happen soon!
> >>>>>> >>
> >>>>>> >> Best Regards,
> >>>>>> >> - He Xiaoqiao
> >>>>>> >>
> >>>>>> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <feihui.u...@gmail.com>
> wrote:
> >>>>>> >> >
> >>>>>> >> > Thanks for interest and advice on this.
> >>>>>> >> >
> >>>>>> >> > Just would like to share some info here
> >>>>>> >> >
> >>>>>> >> > ZanderXu leads this feature and he has spent a lot of time on
> it. He is
> >>>>>> >> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also
> took some
> >>>>>> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99
> RocMarshal
> >>>>>> >> kokonguyen191) helped review PRs. (Forgive me if I missed
> someone)
> >>>>>> >> >
> >>>>>> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
> >>>>>> >> familiar with this feature. We discussed many details offline.
> >>>>>> >> >
> >>>>>> >> > Welcome to more people interested in joining the development
> and review
> >>>>>> >> of the stage 2 and 3.
> >>>>>> >> >
> >>>>>> >> >
> >>>>>> >> > Zengqiang XU <xuzengqiang5...@gmail.com> 于2024年4月26日周五
> 14:56写道：
> >>>>>> >> >>
> >>>>>> >> >> Thanks Shilun for your response:
> >>>>>> >> >>
> >>>>>> >> >> 1. This is a big and very useful feature, so it really needs
> more
> >>>>>> >> >> developers to get on board.
> >>>>>> >> >> 2. This fine grained lock has been implemented based on
> internal
> >>>>>> >> branches
> >>>>>> >> >> and has gained benefits by many companies, such as: Meituan,
> Kuaishou,
> >>>>>> >> >> Bytedance, etc.  But it has not been contributed to the
> community due
> >>>>>> >> to
> >>>>>> >> >> various reasons, such as there is a big difference between
> the version
> >>>>>> >> of
> >>>>>> >> >> the internal branch and the community trunk branch, the
> internal
> >>>>>> >> branch may
> >>>>>> >> >> ignore some functions to make FGL clear, and the contribution
> needs a
> >>>>>> >> lot
> >>>>>> >> >> of work and will take many times. It means that this solution
> has
> >>>>>> >> already
> >>>>>> >> >> been practiced in their prod environment. We have also
> practiced it in
> >>>>>> >> our
> >>>>>> >> >> prod environment and gained benefits, and we are also willing
> to spend
> >>>>>> >> a
> >>>>>> >> >> lot of time contributing to the community.
> >>>>>> >> >> 3. Regarding the benchmark testing, we don't need to pay more
> >>>>>> >> attention to
> >>>>>> >> >> whether the performance is improved by 5 times, 10 times or
> 20 times,
> >>>>>> >> >> because there are too many factors that affect it.
> >>>>>> >> >> 4. As I described above, this solution is already  being
> practiced by
> >>>>>> >> many
> >>>>>> >> >> companies. Right now, we just need to think about how to
> implement it
> >>>>>> >> with
> >>>>>> >> >> high quality and more comprehensively.
> >>>>>> >> >> 5. I firmly believe that all problems can be solved as long
> as the
> >>>>>> >> overall
> >>>>>> >> >> solution is right.
> >>>>>> >> >> 6. I can spend a lot of time leading the promotion of this
> entire
> >>>>>> >> feature
> >>>>>> >> >> and I hope more people can join us in promoting it.
> >>>>>> >> >> 7. You are always welcome to raise your concerns.
> >>>>>> >> >>
> >>>>>> >> >>
> >>>>>> >> >> Thanks Shilun again, I hope you can help review designs and
> PRs. Thanks
> >>>>>> >> >>
> >>>>>> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <slfan1...@apache.org>
> wrote:
> >>>>>> >> >>
> >>>>>> >> >> > Thank you for your hard work! This is a very meaningful
> improvement,
> >>>>>> >> and
> >>>>>> >> >> > from the design document, we can see a significant increase
> in HDFS
> >>>>>> >> >> > read/write throughput.
> >>>>>> >> >> >
> >>>>>> >> >> > I am happy to see the progress made on HDFS-17384.
> >>>>>> >> >> >
> >>>>>> >> >> > However, I still have some concerns, which roughly involve
> the
> >>>>>> >> following
> >>>>>> >> >> > aspects:
> >>>>>> >> >> >
> >>>>>> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS
> and are
> >>>>>> >> familiar
> >>>>>> >> >> > with related development details, we still need more
> community
> >>>>>> >> member to
> >>>>>> >> >> > review the code to ensure that the relevant upgrades meet
> >>>>>> >> expectations.
> >>>>>> >> >> >
> >>>>>> >> >> > 2. We need more details on benchmarks to ensure that test
> results
> >>>>>> >> can be
> >>>>>> >> >> > reproduced and to allow more community member to
> participate in the
> >>>>>> >> testing
> >>>>>> >> >> > process.
> >>>>>> >> >> >
> >>>>>> >> >> > Looking forward to everything going smoothly in the future.
> >>>>>> >> >> >
> >>>>>> >> >> > Best Regards,
> >>>>>> >> >> > - Shilun Fan.
> >>>>>> >> >> >
> >>>>>> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <
> hexiaoq...@apache.org>
> >>>>>> >> wrote:
> >>>>>> >> >> >
> >>>>>> >> >> >> cc private@h.a.o.
> >>>>>> >> >> >>
> >>>>>> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <
> zande...@apache.org>
> >>>>>> >> wrote:
> >>>>>> >> >> >> >
> >>>>>> >> >> >> > Here are some summaries about the first phase:
> >>>>>> >> >> >> > 1. There are no big changes in this phase
> >>>>>> >> >> >> > 2. This phase just uses FS lock and BM lock to replace
> the
> >>>>>> >> original
> >>>>>> >> >> >> global
> >>>>>> >> >> >> > lock
> >>>>>> >> >> >> > 3. It's useful to improve the performance, since some
> operations
> >>>>>> >> just
> >>>>>> >> >> >> need
> >>>>>> >> >> >> > to hold FS lock or BM lock instead of the global lock
> >>>>>> >> >> >> > 4. This feature is turned off by default, you can enable
> it by
> >>>>>> >> setting
> >>>>>> >> >> >> > dfs.namenode.lock.model.provider.class to
> >>>>>> >> >> >> >
> >>>>>> >>
> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
> >>>>>> >> >> >> > 5. This phase is very import for the ongoing development
> of the
> >>>>>> >> entire
> >>>>>> >> >> >> FGL
> >>>>>> >> >> >> >
> >>>>>> >> >> >> > Here I would like to express my special thanks to
> @kokonguyen191
> >>>>>> >> and
> >>>>>> >> >> >> > @yuanboliu for their contributions.  And you are also
> welcome to
> >>>>>> >> join us
> >>>>>> >> >> >> > and complete it together.
> >>>>>> >> >> >> >
> >>>>>> >> >> >> >
> >>>>>> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <
> zande...@apache.org>
> >>>>>> >> wrote:
> >>>>>> >> >> >> >
> >>>>>> >> >> >> > > Hi everyone
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > All subtasks of the first phase of the FGL have been
> completed
> >>>>>> >> and I
> >>>>>> >> >> >> plan
> >>>>>> >> >> >> > > to merge them into the trunk and start the second
> phase based
> >>>>>> >> on the
> >>>>>> >> >> >> trunk.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > Here is the PR that used to merge the first phases
> into trunk:
> >>>>>> >> >> >> > > https://github.com/apache/hadoop/pull/6762
> >>>>>> >> >> >> > > Here is the ticket:
> >>>>>> >> https://issues.apache.org/jira/browse/HDFS-17384
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > I hope you can help to review this PR when you are
> available
> >>>>>> >> and give
> >>>>>> >> >> >> some
> >>>>>> >> >> >> > > ideas.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > HDFS-17385 <
> https://issues.apache.org/jira/browse/HDFS-17385>
> >>>>>> >> is
> >>>>>> >> >> >> used for
> >>>>>> >> >> >> > > the second phase and I have created some subtasks to
> describe
> >>>>>> >> >> >> solutions for
> >>>>>> >> >> >> > > some problems, such as: snapshot, getListing, quota.
> >>>>>> >> >> >> > > You are welcome to join us to complete it together.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > ---------- Forwarded message ---------
> >>>>>> >> >> >> > > From: Zengqiang XU <zande...@apache.org>
> >>>>>> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07
> >>>>>> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking
> >>>>>> >> >> >> > > To: <hdfs-dev@hadoop.apache.org>
> >>>>>> >> >> >> > > Cc: Zengqiang XU <xuzengqiang5...@gmail.com>
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > Hi everyone
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > I have started a discussion about NameNode
> Fine-grained Locking
> >>>>>> >> to
> >>>>>> >> >> >> improve
> >>>>>> >> >> >> > > performance of write operations in NameNode.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > I started this discussion again for serval main
> reasons:
> >>>>>> >> >> >> > > 1. We have implemented it and gained nearly 7x
> performance
> >>>>>> >> >> >> improvement in
> >>>>>> >> >> >> > > our prod environment
> >>>>>> >> >> >> > > 2. Many other companies made similar improvements
> based on their
> >>>>>> >> >> >> internal
> >>>>>> >> >> >> > > branch.
> >>>>>> >> >> >> > > 3. This topic has been discussed for a long time, but
> still
> >>>>>> >> without
> >>>>>> >> >> >> any
> >>>>>> >> >> >> > > results.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > I hope we can push this important improvement in the
> community
> >>>>>> >> so
> >>>>>> >> >> >> that all
> >>>>>> >> >> >> > > end-users can enjoy this significant improvement.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > I'd really appreciate you can join in and work with me
> to push
> >>>>>> >> this
> >>>>>> >> >> >> > > feature forward.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > Thanks very much.
> >>>>>> >> >> >> > >
> >>>>>> >> >> >> > > Ticket: HDFS-17366 <
> >>>>>> >> https://issues.apache.org/jira/browse/HDFS-17366>
> >>>>>> >> >> >> > > Design: NameNode Fine-grained locking based on
> directory tree
> >>>>>> >> >> >> > > <
> >>>>>> >> >> >>
> >>>>>> >>
> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
> >>>>>> >> >> >> >
> >>>>>> >> >> >> > >
> >>>>>> >> >> >>
> >>>>>> >> >> >>
> >>>>>> >>
> ---------------------------------------------------------------------
> >>>>>> >> >> >> To unsubscribe, e-mail:
> private-unsubscr...@hadoop.apache.org
> >>>>>> >> >> >> For additional commands, e-mail:
> private-h...@hadoop.apache.org
> >>>>>> >> >> >>
> >>>>>> >> >> >>
> >>>>>> >>
> >>>>>> >>
> ---------------------------------------------------------------------
> >>>>>> >> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> >>>>>> >> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> >>>>>> >>
> >>>>>> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>
>

Re: Discussion about NameNode Fine-grained locking

Reply via email to