Re: Discussion about NameNode Fine-grained locking

Hui Fei Tue, 31 Dec 2024 03:11:26 -0800

Thanks Zander for bringing this discussion again and trying your best to
push it forward. It's really a long time since last discussion.


It’s indeed time, +1 for merging phase 1 codes based on the following points
 - The phase 1 feature has been running at scale within companies for a
long time
 - The long-term plan is clear, and also addressed some questions raised by
the community
 - The testing result of future features on memory and performance

ZanderXu <zande...@apache.org> 于2024年12月31日周二 15:36写道：

> Hi, everyone:
> Time to Merge FGL Phase I
>
> The PR for *FGL Phase I* is ready for merging! Please take a moment to
> review and cast your vote: https://github.com/apache/hadoop/pull/6762.
>
> The *FGL Phase I* has been running successfully in production for over
> six months at *Shopee* and *BOSS Zhipin*, with no reported performance or
> stability issues. It’s now the right time to merge it into the trunk
> branch, allowing us to move forward with Phase II.
>
> The global lock remains the default lock mode, but users can enable FGL by
> configuring
> dfs.namenode.lock.model.provider.class=org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
> .
>
> If there are no objections within 7 days, I will propose an official vote.
> Performance and Memory Usage of Phase I
>
> Conclusion：
>
>    1.
>
>    Fine-grained locks do not lead to significant performance improvements.
>    2.
>
>    Fine-grained locks do not result in additional memory consumption
>
> Reasons:
>
>    -
>
>    *BM operations heavily depend on FS operations*: IBR and BR still
>    acquire the global lock (FSLock and BMLock).
>    -
>
>    *FS operations depend on BM operations*: Common operations (create,
>    addBlock, getBlockLocations) also acquire the global lock (FSLock and
>    BMLock).
>
> Phase II will bring significant performance improvements by decoupling FS
> and BM dependencies and replacing the global FSLock with a fine-grained
> IIPLock.
>
> Addressing Common Questions
>
> Thank you all for raising meaningful questions!
>
> I have rewritten the design document to improve clarity.
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?usp=sharing
>
> Below is a summary of frequently asked questions and answers:
> Summary of Questions:*Question 1: How is the performance of
> LockPoolManager?*
>
>    -
>
>    *Performance Report*:
>    -
>
>       Time to acquire a cached lock: 194 ns
>       -
>
>       Time to acquire a non-cached lock: 1044 ns
>       -
>
>       Time to release an in-use lock: 88 ns
>       -
>
>       Time to release an unused lock: 112 ns
>       -
>
>    *Overall Performance*:
>    -
>
>       *QPS*: Over 10 million
>       -
>
>       Time to acquire the IIP lock for a path with depth 10:
>       -
>
>          Fully uncached: 10440 ns + 1120 ns (≈ 11 μs)
>          -
>
>          Fully cached: 1940 ns + 1120 ns (≈ 3 μs)
>          -
>
>       In *global lock scenarios*, lock wait times are typically in the
>       millisecond range. Therefor, the cost of acquiring and releasing
>       fine-grained locks can be ignored.
>
> *Question 2: How much memory does the FGL consume?*
>
>    -
>
>    *Memory Consumption*:
>    -
>
>       A single LockResource contains a read-write lock and a counter,
>       totaling approximately 200 bytes:
>       -
>
>          LockResource: 24 bytes
>          -
>
>          ReentrantReadWriteLock: 150 bytes
>          -
>
>          AtomicInteger: 16 bytes
>          -
>
>    *Memory Usage Estimates*:
>    -
>
>       10-level directory depth, 100 handlers
>       -
>
>          1000 lock resources, approximately 200 KB
>          -
>
>       10-level directory depth, 1000 handlers
>       -
>
>          10000 lock resources, approximately 2 MB
>          -
>
>       1, 000,000 lock resources, approximately 200 MB
>
> *Conclusion*: Memory consumption is negligible.
> *Question 3: What happens if no lock is available in the LockPoolManager?*
>
> If there are not any available LockResources, two solutions are available:
>
>    1.
>
>    Return a *RetryException*, prompting the client to retry later.
>    2.
>
>    Temporarily increase the lock entity limit, allocate more locks to
>    meet client requests, and use an asynchronous thread to recycle locks
>    periodically.
>
> We can provide multiple LockPoolManager implementations for users to
> choose from based on production environments.
> *Question 4: Regarding the IIPLock lock depth issue, can we consider
> holding only the first 3 or 4 levels of directory locks?*
>
> This approach is not recommended for the following reasons:
>
>    1.
>
>    *Cannot maximize concurrency*.
>    2.
>
>    *Limited savings in lock acquisition/release time and memory usage*,
>    yielding insignificant benefits.
>
> *Question 5: How should attributes like StoragePolicy, ErasureCoding, and
> ACL, which can be set on parent or ancestor directory nodes, be handled?*
>
>    -
>
>    *ErasureCoding and ACL*:
>    -
>
>       When changing node attributes, hold the corresponding INode’s write
>       lock.
>       -
>
>       When using ancestor node attributes, hold the corresponding INode’s
>       read lock.
>       -
>
>    *StoragePolicy*:
>    -
>
>       More complex due to its impact on both directory tree operations
>       and Block operations.
>       -
>
>       To improve performance, commonly used block-related operations
>       (such as BR/IBR) should not acquire IIPLock
>       -
>
>       Detailed design documentation:
>       
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.96lztsl4mwfk
>
> *Question 6: How should FGL be implemented for the SNAPSHOT feature?*
>
>    -
>
>    Since the Rename operation on the SNAPSHOT directory is supported,
>    holding only the write lock of the SNAPSHOT root directory cannot cover the
>    rename situation, so the thread safety of SNAPSHOT-related operations
>    cannot be guaranteed
>    -
>
>    It is recommended to use *global FS lock* to ensure thread safety.
>    -
>
>    Detailed design documentation:
>    
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.sm36p6bfcpec
>
> *Question 7: How should FGL be implemented for the Symlinks feature?*
>
>    -
>
>    The Target path of Symlinks is a string, and the client performs a
>    second forward access to the Target path. So the fine-grained lock project
>    requires no special handling
>    -
>
>    For the createSymlink RPC, the FGL needs to acquire the IIPLocks for
>    both target and link paths.
>
> *Question 8: How should FGL be implemented for the reserved feature?*
>
> The Reserved feature has two usage modes:
>
>    1.
>
>    /.reserved/iNodes/${inode id}
>    2.
>
>    /.reserved/raw/${path}
>
>
>    -
>
>    *INodeId Mode*: During the resolvePath phase, obtain the real IIPLock
>    lock via INodeId.
>    -
>
>    *Path Mode*: During the resolvePath phase, obtain the real IIPLock
>    lock via path.
>    -
>
>    Detailed design documentation:
>    
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.h6rcpzkbpanf
>
> *Question 9: Why is INodeFileLock used as the FGL for BlockInfo?*
>
> INodeFile and Block have mutual dependencies:
>
>    -
>
>    *INodeFile depends on Block* for state and size.
>    -
>
>    *Block depends on INodeFile* for state and storage policy.
>
> Therefore, using INodeFileLock as the fine-grained lock for BlockInfo is a
> reasonable choice.
>
> Detailed design documentation:
> https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.zesd6omuu3kr
>
> Seeking Community Feedback
>
> Your questions and concerns are always welcome.
>
> We can discuss them in detail on the Slack Channel:
> https://app.slack.com/client/T4S1WH2J3/C06UDTBQ2SH
>
> Let’s work together to advance the Fine-Grained Lock project. I believe
> this initiative will deliver significant performance improvements to the
> HDFS community and help reinvigorate its activity.
>
> Wishing everyone a Happy New Year 2025!
>
> On Wed, 5 Jun 2024 at 16:17, ZanderXu <zande...@apache.org> wrote:
>
>> I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share
>> the FGL's motivations and some concerns in detail in Chinese.
>>
>> The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II)
>> <https://docs.google.com/document/d/1QGLM67u6tWjj00gOWYqgxHqghb43g4dmH8QcUZtSrYE/edit?usp=sharing>
>>
>> The meeting URL is: https://sea.zoom.us/j/94168001269
>>
>> You are welcome to this meeting.
>>
>> On Mon, 6 May 2024 at 23:57, Hui Fei <feihui.u...@gmail.com> wrote:
>>
>>> BTW, there is a Slack channel hdfs-fgl for this feature. can join it and
>>> discuss more details.
>>>
>>> Is it necessary to hold a meeting to discuss this? So that we can push
>>> it forward quickly. Agreed with ZanderXu, it seems inefficient to discuss
>>> details via email list.
>>>
>>>
>>> Hui Fei <feihui.u...@gmail.com> 于2024年5月6日周一 23:50写道：
>>>
>>>> Thanks all
>>>>
>>>> Seems all concerns are related to the stage 2. We can address these and
>>>> make it more clear before we start it.
>>>>
>>>> From development experience, I think it is reasonable to split the big
>>>> feature into several stages. And stage 1 is also independent and it also
>>>> can be as a minor feature that uses fs and bm locks instead of the global
>>>> lock.
>>>>
>>>>
>>>> ZanderXu <zande...@apache.org> 于2024年4月29日周一 15:17写道：
>>>>
>>>>> Thanks @Ayush Saxena <ayush...@gmail.com> and @Xiaoqiao He
>>>>> <hexiaoq...@apache.org> for your nice questions.
>>>>>
>>>>> Let me summarize your concerns and corresponding solutions:
>>>>>
>>>>> *1. Questions about the Snapshot feature*
>>>>> It's difficult to apply the FGL to Snapshot feature,  but we can just
>>>>> using
>>>>> the global FS write lock to make it thread safe.
>>>>> So if we can identity if a path contains the snapshot feature, we can
>>>>> just
>>>>> using the global FS write lock to protect it.
>>>>>
>>>>> You can refer to HDFS-17479
>>>>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to
>>>>> identify
>>>>> it.
>>>>>
>>>>> Regarding performance of the operations related to the snapshot
>>>>> features,
>>>>> we can discuss it in two categories:
>>>>> Read operations involves snapshots:
>>>>> The FGL branch uses the global write lock to protect them, the GLOBAL
>>>>> branch uses the global read lock to protect them. It's hard to conclude
>>>>> which version has better performance, it depends on the global lock
>>>>> competition.
>>>>>
>>>>> Write operations involves snapshots:
>>>>> Both FGL and GLOBAL branch use the global write lock to protect them.
>>>>> It's
>>>>> hard to conclude which version has better performance, it depends on
>>>>> the
>>>>> global lock competition too.
>>>>>
>>>>> So I think if namenode load is low, the GLOBAL branch will have a
>>>>> better
>>>>> performance than FGL; If namenode load is high, the FGL branch may
>>>>> have a
>>>>> better performance than the GLOBAL, which also depends on the ratio of
>>>>> read
>>>>> and write operations on the SNAPSHOT feature.
>>>>>
>>>>> We can do somethings to let end-user to choose a branch with a better
>>>>> branch according to their business:
>>>>> First, we need to make the lock mode can be selectable, so that
>>>>> end-user
>>>>> can choose to use FGL of GLOBAL.
>>>>> Second, using the global write lock to make operations related to
>>>>> snapshot
>>>>> thread safe as I described in HDFS-17479.
>>>>>
>>>>>
>>>>> *2. Questions about the Symlinks feature*
>>>>> If Symlink is related to snapshot, we can refer to the solution of the
>>>>> snapshot;  If Symlink is not related to snapshot, I think it's easy to
>>>>> meet
>>>>> the FGL.
>>>>> Only createSymlink involves two paths, FGL just need to lock them in
>>>>> the
>>>>> order to make this operation thread. For other operations, it is the
>>>>> same
>>>>> as other normal iNode, right?
>>>>>
>>>>> If I missed difficult points, please let me know.
>>>>>
>>>>>
>>>>> *3. Questions about Memory Usage of iNode locks*
>>>>> I think there are too many solutions to limit the memory usage of these
>>>>> iNode locks, such as: Using a limit capacity lock pool to ensure the
>>>>> maximum memory usage,  Just holding iNode locks for fixed depth of
>>>>> directories, etc.
>>>>>
>>>>> We can just abstract this LockManager first and then support its
>>>>> implementation with different ideas, so that we can limit the maximum
>>>>> memory usage of these iNode locks.
>>>>> FGL can acquire or lease iNode locks through LockManager.
>>>>>
>>>>>
>>>>> *4. Questions about Performance of acquiring and releasing iNode locks*
>>>>> We can add some benchmark for LockManager, to test the performance or
>>>>> acquire and release unblocked locks.
>>>>>
>>>>>
>>>>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
>>>>> These policies may be sot on an ancestor node and used by some children
>>>>> files.  The set operation for these policies will be protected by the
>>>>> directory tree, since there are all file-related operations.  In
>>>>> addition
>>>>> to Quota and StoragePolicy, the use of other policies will also be
>>>>> protected by directory tree, such as ECPolicy and ACL.
>>>>>
>>>>> Quota is a little special since its update operations may not be
>>>>> protected
>>>>> by the directory tree, we can assign a locks to each QuotaFeature and
>>>>> use
>>>>> these locks to make updating operations thread safe. you can refer to
>>>>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to get
>>>>> some
>>>>> detailed information.
>>>>>
>>>>> StoragePolicy is a little special since it is used not only by
>>>>> file-related
>>>>> operations but also block-related operations.
>>>>> ProcessExtraRedundancyBlock
>>>>> uses storage policy to choose redundancy replicas and
>>>>> BlockReconstructionWork uses storage policy to choose target DNs. In
>>>>> order
>>>>> to maximize the performance improvement, BR and IBR should only
>>>>> involve the
>>>>> iNodeFile to which the current processing block belongs. These
>>>>> redundancy
>>>>> blocks can be processed by the Redundancy monitor while holding the
>>>>> directory tree locks. You can refer to HDFS-17505
>>>>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more
>>>>> detailed
>>>>> informations.
>>>>>
>>>>> *6. Performance of the phase 1*
>>>>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is used
>>>>> to do
>>>>> some performance testing for phase 1, and I will complete it later.
>>>>>
>>>>>
>>>>> Discuss solution through mails is not efficient, you can create one
>>>>> sub-tasks under HDFS-17366
>>>>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
>>>>> concerns and I will try to give some answers.
>>>>>
>>>>> Thanks @Ayush Saxena <ayush...@gmail.com>  and @Xiaoqiao He
>>>>> <hexiaoq...@apache.org> again.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <ayush...@gmail.com> wrote:
>>>>>
>>>>> > Thanx Everyone for chasing this, Great to see some momentum around
>>>>> FGL,
>>>>> > that should be a great improvement.
>>>>> >
>>>>> > I have some two broad categories:
>>>>> > ** About the process:*
>>>>> > I think in the above mails, there are mentions that phase one is
>>>>> complete
>>>>> > in a feature branch & we are gonna merge that to trunk. If I am
>>>>> catching it
>>>>> > right, then you can't hit the merge button like that. To merge a
>>>>> feature
>>>>> > branch. You need to call for a Vote specific to that branch & it
>>>>> requires 3
>>>>> > binding votes to merge, unlike any other code change which requires
>>>>> 1. It
>>>>> > is there in our Bylaws.
>>>>> >
>>>>> > So, do follow the process.
>>>>> >
>>>>> > ** About the feature itself:* (A very quick look at the doc and the
>>>>> Jira,
>>>>> > so please take it with a grain of salt)
>>>>> > * The Google Drive link that you folks shared as part of the first
>>>>> mail. I
>>>>> > don't have access to that. So, please open up the permissions for
>>>>> that doc
>>>>> > or share the new link
>>>>> > * Chasing the design doc present on the Jira
>>>>> > * I think we only have Phase-1 ready, so can you share some metrics
>>>>> just
>>>>> > for that? Perf improvements just with splitting the FS & BM Locks
>>>>> > * The memory implications of Phase-1? I don't think there should be
>>>>> any
>>>>> > major impact on the memory in case of just phase-1
>>>>> > * Regarding the snapshot stuff, you mentioned taking lock on the root
>>>>> > itself? Does just taking lock on the snapshot root rather than the
>>>>> FS root
>>>>> > works?
>>>>> > * Secondly about the usage of Snapshot or Symlinks, I don't think we
>>>>> > should operate under the assumptions that they aren't widely used or
>>>>> not,
>>>>> > we might just not know folks who don't use it widely or they are
>>>>> just users
>>>>> > not the ones contributing. We can just accept for now, that in those
>>>>> cases
>>>>> > it isn't optimised and we just lock the entire FS space, which it
>>>>> does even
>>>>> > today, so no regressions there.
>>>>> > * Regarding memory usage: Do you have some numbers on how much the
>>>>> memory
>>>>> > footprint increases?
>>>>> > * Under the Lock Pool: I think you are assuming there would be very
>>>>> few
>>>>> > inodes where lock would be required at any given time, so there
>>>>> won't be
>>>>> > too much heap consumption? I think you are compromising on the
>>>>> Horizontal
>>>>> > Scalability here. I doubt if your assumption doesn't hold true,
>>>>> under heavy
>>>>> > read load by concurrent clients accessing different inodes, the
>>>>> Namenode
>>>>> > will start giving memory troubles, that would do more harm than good.
>>>>> > Anyway Namenode heap is way bigger problem than anything, so we
>>>>> should be
>>>>> > very careful increasing load over there.
>>>>> > * For the Locks on the inodes: Do you plan to have locs for each
>>>>> inode?
>>>>> > Can we somehow limit that to the depth of the tree? Like currently
>>>>> we take
>>>>> > lock on the root, have a config which makes us take lock at Level-2
>>>>> or 3
>>>>> > (configurable), that might fetch some perf benefits and can be used
>>>>> to
>>>>> > control the memory usage as well?
>>>>> > * What is the cost of creating these inode locks? If the lock isn't
>>>>> > already cached it would incur some cost? Do you have some numbers
>>>>> around
>>>>> > that? Say I disable caching altogether & then let a test load run,
>>>>> what
>>>>> > does the perf numbers look like in that case
>>>>> > * I think we need to limit the size of INodeLockPool, we can't let
>>>>> it grow
>>>>> > infinitely in case of heavy loads and we need to have some auto
>>>>> > throttling mechanism for it
>>>>> > * I didn't catch your Storage Policy problem. If I decode it right,
>>>>> the
>>>>> > problem is like the policy could be set on an ancestor node & the
>>>>> children
>>>>> > abide by that & this is the problem, if that is the case then isn't
>>>>> that
>>>>> > the case with ErasureCoding policies or even ACLs or so? Can you
>>>>> elaborate
>>>>> > a bit on that.
>>>>> >
>>>>> >
>>>>> > Anyway, regarding the Phase-1. If you share (the perf numbers with
>>>>> proper
>>>>> > details + Impact on memory if any) for just phase 1 & if they are
>>>>> good,
>>>>> > then if you call for a branch merge vote for Phase-1 FGL, you have
>>>>> my vote,
>>>>> > however you'll need to sway the rest of the folks on your own :-)
>>>>> >
>>>>> > Good Luck, Nice Work Guys!!!
>>>>> >
>>>>> > -Ayush
>>>>> >
>>>>> >
>>>>> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <hexiaoq...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> >> Thanks ZanderXu and Hui Fei for your work on this feature. It will
>>>>> be
>>>>> >> a very helpful improvement for the HDFS module in the next journal.
>>>>> >>
>>>>> >> 1. If we need any more review bandwidth, I would like to be involved
>>>>> >> to help review if possible.
>>>>> >> 2. From the design document there are still missing some detailed
>>>>> >> descriptions such as snapshot, symbolic link and reserved etc as
>>>>> mentioned
>>>>> >> above. I think it will be helpful for newbies who want to be
>>>>> involved
>>>>> >> if all corner
>>>>> >> cases are considered and described.
>>>>> >> 3. From slack, we plan to check into the trunk at this phase. I am
>>>>> not
>>>>> >> sure
>>>>> >> If it is the proper time, following the dev plan there are two
>>>>> steps left
>>>>> >> to
>>>>> >> finish this feature from the design document, right? If that, I
>>>>> think we
>>>>> >> should
>>>>> >> postpone checking in when all plans are ready. Considering that
>>>>> there are
>>>>> >> many unfinished tries for this feature in history, I think postpone
>>>>> >> checking
>>>>> >> will be the safe way, another way it will involve more rebase cost
>>>>> if you
>>>>> >> keep
>>>>> >> separate dev branch, however I think It is not one difficult thing
>>>>> for
>>>>> >> you.
>>>>> >>
>>>>> >> Good luck and look forward to making that happen soon!
>>>>> >>
>>>>> >> Best Regards,
>>>>> >> - He Xiaoqiao
>>>>> >>
>>>>> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <feihui.u...@gmail.com>
>>>>> wrote:
>>>>> >> >
>>>>> >> > Thanks for interest and advice on this.
>>>>> >> >
>>>>> >> > Just would like to share some info here
>>>>> >> >
>>>>> >> > ZanderXu leads this feature and he has spent a lot of time on it.
>>>>> He is
>>>>> >> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also
>>>>> took some
>>>>> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99
>>>>> RocMarshal
>>>>> >> kokonguyen191) helped review PRs. (Forgive me if I missed someone)
>>>>> >> >
>>>>> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
>>>>> >> familiar with this feature. We discussed many details offline.
>>>>> >> >
>>>>> >> > Welcome to more people interested in joining the development and
>>>>> review
>>>>> >> of the stage 2 and 3.
>>>>> >> >
>>>>> >> >
>>>>> >> > Zengqiang XU <xuzengqiang5...@gmail.com> 于2024年4月26日周五 14:56写道：
>>>>> >> >>
>>>>> >> >> Thanks Shilun for your response:
>>>>> >> >>
>>>>> >> >> 1. This is a big and very useful feature, so it really needs more
>>>>> >> >> developers to get on board.
>>>>> >> >> 2. This fine grained lock has been implemented based on internal
>>>>> >> branches
>>>>> >> >> and has gained benefits by many companies, such as: Meituan,
>>>>> Kuaishou,
>>>>> >> >> Bytedance, etc.  But it has not been contributed to the
>>>>> community due
>>>>> >> to
>>>>> >> >> various reasons, such as there is a big difference between the
>>>>> version
>>>>> >> of
>>>>> >> >> the internal branch and the community trunk branch, the internal
>>>>> >> branch may
>>>>> >> >> ignore some functions to make FGL clear, and the contribution
>>>>> needs a
>>>>> >> lot
>>>>> >> >> of work and will take many times. It means that this solution has
>>>>> >> already
>>>>> >> >> been practiced in their prod environment. We have also practiced
>>>>> it in
>>>>> >> our
>>>>> >> >> prod environment and gained benefits, and we are also willing to
>>>>> spend
>>>>> >> a
>>>>> >> >> lot of time contributing to the community.
>>>>> >> >> 3. Regarding the benchmark testing, we don't need to pay more
>>>>> >> attention to
>>>>> >> >> whether the performance is improved by 5 times, 10 times or 20
>>>>> times,
>>>>> >> >> because there are too many factors that affect it.
>>>>> >> >> 4. As I described above, this solution is already  being
>>>>> practiced by
>>>>> >> many
>>>>> >> >> companies. Right now, we just need to think about how to
>>>>> implement it
>>>>> >> with
>>>>> >> >> high quality and more comprehensively.
>>>>> >> >> 5. I firmly believe that all problems can be solved as long as
>>>>> the
>>>>> >> overall
>>>>> >> >> solution is right.
>>>>> >> >> 6. I can spend a lot of time leading the promotion of this entire
>>>>> >> feature
>>>>> >> >> and I hope more people can join us in promoting it.
>>>>> >> >> 7. You are always welcome to raise your concerns.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Thanks Shilun again, I hope you can help review designs and PRs.
>>>>> Thanks
>>>>> >> >>
>>>>> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <slfan1...@apache.org>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> > Thank you for your hard work! This is a very meaningful
>>>>> improvement,
>>>>> >> and
>>>>> >> >> > from the design document, we can see a significant increase in
>>>>> HDFS
>>>>> >> >> > read/write throughput.
>>>>> >> >> >
>>>>> >> >> > I am happy to see the progress made on HDFS-17384.
>>>>> >> >> >
>>>>> >> >> > However, I still have some concerns, which roughly involve the
>>>>> >> following
>>>>> >> >> > aspects:
>>>>> >> >> >
>>>>> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS and
>>>>> are
>>>>> >> familiar
>>>>> >> >> > with related development details, we still need more community
>>>>> >> member to
>>>>> >> >> > review the code to ensure that the relevant upgrades meet
>>>>> >> expectations.
>>>>> >> >> >
>>>>> >> >> > 2. We need more details on benchmarks to ensure that test
>>>>> results
>>>>> >> can be
>>>>> >> >> > reproduced and to allow more community member to participate
>>>>> in the
>>>>> >> testing
>>>>> >> >> > process.
>>>>> >> >> >
>>>>> >> >> > Looking forward to everything going smoothly in the future.
>>>>> >> >> >
>>>>> >> >> > Best Regards,
>>>>> >> >> > - Shilun Fan.
>>>>> >> >> >
>>>>> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <
>>>>> hexiaoq...@apache.org>
>>>>> >> wrote:
>>>>> >> >> >
>>>>> >> >> >> cc private@h.a.o.
>>>>> >> >> >>
>>>>> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <zande...@apache.org
>>>>> >
>>>>> >> wrote:
>>>>> >> >> >> >
>>>>> >> >> >> > Here are some summaries about the first phase:
>>>>> >> >> >> > 1. There are no big changes in this phase
>>>>> >> >> >> > 2. This phase just uses FS lock and BM lock to replace the
>>>>> >> original
>>>>> >> >> >> global
>>>>> >> >> >> > lock
>>>>> >> >> >> > 3. It's useful to improve the performance, since some
>>>>> operations
>>>>> >> just
>>>>> >> >> >> need
>>>>> >> >> >> > to hold FS lock or BM lock instead of the global lock
>>>>> >> >> >> > 4. This feature is turned off by default, you can enable it
>>>>> by
>>>>> >> setting
>>>>> >> >> >> > dfs.namenode.lock.model.provider.class to
>>>>> >> >> >> >
>>>>> >>
>>>>> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
>>>>> >> >> >> > 5. This phase is very import for the ongoing development of
>>>>> the
>>>>> >> entire
>>>>> >> >> >> FGL
>>>>> >> >> >> >
>>>>> >> >> >> > Here I would like to express my special thanks to
>>>>> @kokonguyen191
>>>>> >> and
>>>>> >> >> >> > @yuanboliu for their contributions.  And you are also
>>>>> welcome to
>>>>> >> join us
>>>>> >> >> >> > and complete it together.
>>>>> >> >> >> >
>>>>> >> >> >> >
>>>>> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <zande...@apache.org
>>>>> >
>>>>> >> wrote:
>>>>> >> >> >> >
>>>>> >> >> >> > > Hi everyone
>>>>> >> >> >> > >
>>>>> >> >> >> > > All subtasks of the first phase of the FGL have been
>>>>> completed
>>>>> >> and I
>>>>> >> >> >> plan
>>>>> >> >> >> > > to merge them into the trunk and start the second phase
>>>>> based
>>>>> >> on the
>>>>> >> >> >> trunk.
>>>>> >> >> >> > >
>>>>> >> >> >> > > Here is the PR that used to merge the first phases into
>>>>> trunk:
>>>>> >> >> >> > > https://github.com/apache/hadoop/pull/6762
>>>>> >> >> >> > > Here is the ticket:
>>>>> >> https://issues.apache.org/jira/browse/HDFS-17384
>>>>> >> >> >> > >
>>>>> >> >> >> > > I hope you can help to review this PR when you are
>>>>> available
>>>>> >> and give
>>>>> >> >> >> some
>>>>> >> >> >> > > ideas.
>>>>> >> >> >> > >
>>>>> >> >> >> > >
>>>>> >> >> >> > > HDFS-17385 <
>>>>> https://issues.apache.org/jira/browse/HDFS-17385>
>>>>> >> is
>>>>> >> >> >> used for
>>>>> >> >> >> > > the second phase and I have created some subtasks to
>>>>> describe
>>>>> >> >> >> solutions for
>>>>> >> >> >> > > some problems, such as: snapshot, getListing, quota.
>>>>> >> >> >> > > You are welcome to join us to complete it together.
>>>>> >> >> >> > >
>>>>> >> >> >> > >
>>>>> >> >> >> > > ---------- Forwarded message ---------
>>>>> >> >> >> > > From: Zengqiang XU <zande...@apache.org>
>>>>> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07
>>>>> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking
>>>>> >> >> >> > > To: <hdfs-dev@hadoop.apache.org>
>>>>> >> >> >> > > Cc: Zengqiang XU <xuzengqiang5...@gmail.com>
>>>>> >> >> >> > >
>>>>> >> >> >> > >
>>>>> >> >> >> > > Hi everyone
>>>>> >> >> >> > >
>>>>> >> >> >> > > I have started a discussion about NameNode Fine-grained
>>>>> Locking
>>>>> >> to
>>>>> >> >> >> improve
>>>>> >> >> >> > > performance of write operations in NameNode.
>>>>> >> >> >> > >
>>>>> >> >> >> > > I started this discussion again for serval main reasons:
>>>>> >> >> >> > > 1. We have implemented it and gained nearly 7x performance
>>>>> >> >> >> improvement in
>>>>> >> >> >> > > our prod environment
>>>>> >> >> >> > > 2. Many other companies made similar improvements based
>>>>> on their
>>>>> >> >> >> internal
>>>>> >> >> >> > > branch.
>>>>> >> >> >> > > 3. This topic has been discussed for a long time, but
>>>>> still
>>>>> >> without
>>>>> >> >> >> any
>>>>> >> >> >> > > results.
>>>>> >> >> >> > >
>>>>> >> >> >> > > I hope we can push this important improvement in the
>>>>> community
>>>>> >> so
>>>>> >> >> >> that all
>>>>> >> >> >> > > end-users can enjoy this significant improvement.
>>>>> >> >> >> > >
>>>>> >> >> >> > > I'd really appreciate you can join in and work with me to
>>>>> push
>>>>> >> this
>>>>> >> >> >> > > feature forward.
>>>>> >> >> >> > >
>>>>> >> >> >> > > Thanks very much.
>>>>> >> >> >> > >
>>>>> >> >> >> > > Ticket: HDFS-17366 <
>>>>> >> https://issues.apache.org/jira/browse/HDFS-17366>
>>>>> >> >> >> > > Design: NameNode Fine-grained locking based on directory
>>>>> tree
>>>>> >> >> >> > > <
>>>>> >> >> >>
>>>>> >>
>>>>> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
>>>>> >> >> >> >
>>>>> >> >> >> > >
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> >> >> To unsubscribe, e-mail: private-unsubscr...@hadoop.apache.org
>>>>> >> >> >> For additional commands, e-mail:
>>>>> private-h...@hadoop.apache.org
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>>>>> >> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>>>>> >>
>>>>> >>
>>>>>
>>>>

Re: Discussion about NameNode Fine-grained locking

Reply via email to