Re: Discussion about NameNode Fine-grained locking

Hui Fei Mon, 06 May 2024 08:57:21 -0700

BTW, there is a Slack channel hdfs-fgl for this feature. can join it and
discuss more details.


Is it necessary to hold a meeting to discuss this? So that we can push it
forward quickly. Agreed with ZanderXu, it seems inefficient to discuss
details via email list.


Hui Fei <[email protected]> 于2024年5月6日周一 23:50写道：

> Thanks all
>
> Seems all concerns are related to the stage 2. We can address these and
> make it more clear before we start it.
>
> From development experience, I think it is reasonable to split the big
> feature into several stages. And stage 1 is also independent and it also
> can be as a minor feature that uses fs and bm locks instead of the global
> lock.
>
>
> ZanderXu <[email protected]> 于2024年4月29日周一 15:17写道：
>
>> Thanks @Ayush Saxena <[email protected]> and @Xiaoqiao He
>> <[email protected]> for your nice questions.
>>
>> Let me summarize your concerns and corresponding solutions:
>>
>> *1. Questions about the Snapshot feature*
>> It's difficult to apply the FGL to Snapshot feature,  but we can just
>> using
>> the global FS write lock to make it thread safe.
>> So if we can identity if a path contains the snapshot feature, we can just
>> using the global FS write lock to protect it.
>>
>> You can refer to HDFS-17479
>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to identify
>> it.
>>
>> Regarding performance of the operations related to the snapshot features,
>> we can discuss it in two categories:
>> Read operations involves snapshots:
>> The FGL branch uses the global write lock to protect them, the GLOBAL
>> branch uses the global read lock to protect them. It's hard to conclude
>> which version has better performance, it depends on the global lock
>> competition.
>>
>> Write operations involves snapshots:
>> Both FGL and GLOBAL branch use the global write lock to protect them. It's
>> hard to conclude which version has better performance, it depends on the
>> global lock competition too.
>>
>> So I think if namenode load is low, the GLOBAL branch will have a better
>> performance than FGL; If namenode load is high, the FGL branch may have a
>> better performance than the GLOBAL, which also depends on the ratio of
>> read
>> and write operations on the SNAPSHOT feature.
>>
>> We can do somethings to let end-user to choose a branch with a better
>> branch according to their business:
>> First, we need to make the lock mode can be selectable, so that end-user
>> can choose to use FGL of GLOBAL.
>> Second, using the global write lock to make operations related to snapshot
>> thread safe as I described in HDFS-17479.
>>
>>
>> *2. Questions about the Symlinks feature*
>> If Symlink is related to snapshot, we can refer to the solution of the
>> snapshot;  If Symlink is not related to snapshot, I think it's easy to
>> meet
>> the FGL.
>> Only createSymlink involves two paths, FGL just need to lock them in the
>> order to make this operation thread. For other operations, it is the same
>> as other normal iNode, right?
>>
>> If I missed difficult points, please let me know.
>>
>>
>> *3. Questions about Memory Usage of iNode locks*
>> I think there are too many solutions to limit the memory usage of these
>> iNode locks, such as: Using a limit capacity lock pool to ensure the
>> maximum memory usage,  Just holding iNode locks for fixed depth of
>> directories, etc.
>>
>> We can just abstract this LockManager first and then support its
>> implementation with different ideas, so that we can limit the maximum
>> memory usage of these iNode locks.
>> FGL can acquire or lease iNode locks through LockManager.
>>
>>
>> *4. Questions about Performance of acquiring and releasing iNode locks*
>> We can add some benchmark for LockManager, to test the performance or
>> acquire and release unblocked locks.
>>
>>
>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
>> These policies may be sot on an ancestor node and used by some children
>> files.  The set operation for these policies will be protected by the
>> directory tree, since there are all file-related operations.  In addition
>> to Quota and StoragePolicy, the use of other policies will also be
>> protected by directory tree, such as ECPolicy and ACL.
>>
>> Quota is a little special since its update operations may not be protected
>> by the directory tree, we can assign a locks to each QuotaFeature and use
>> these locks to make updating operations thread safe. you can refer to
>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to get some
>> detailed information.
>>
>> StoragePolicy is a little special since it is used not only by
>> file-related
>> operations but also block-related operations.  ProcessExtraRedundancyBlock
>> uses storage policy to choose redundancy replicas and
>> BlockReconstructionWork uses storage policy to choose target DNs. In order
>> to maximize the performance improvement, BR and IBR should only involve
>> the
>> iNodeFile to which the current processing block belongs. These redundancy
>> blocks can be processed by the Redundancy monitor while holding the
>> directory tree locks. You can refer to HDFS-17505
>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more detailed
>> informations.
>>
>> *6. Performance of the phase 1*
>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is used to
>> do
>> some performance testing for phase 1, and I will complete it later.
>>
>>
>> Discuss solution through mails is not efficient, you can create one
>> sub-tasks under HDFS-17366
>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
>> concerns and I will try to give some answers.
>>
>> Thanks @Ayush Saxena <[email protected]>  and @Xiaoqiao He
>> <[email protected]> again.
>>
>>
>>
>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <[email protected]> wrote:
>>
>> > Thanx Everyone for chasing this, Great to see some momentum around FGL,
>> > that should be a great improvement.
>> >
>> > I have some two broad categories:
>> > ** About the process:*
>> > I think in the above mails, there are mentions that phase one is
>> complete
>> > in a feature branch & we are gonna merge that to trunk. If I am
>> catching it
>> > right, then you can't hit the merge button like that. To merge a feature
>> > branch. You need to call for a Vote specific to that branch & it
>> requires 3
>> > binding votes to merge, unlike any other code change which requires 1.
>> It
>> > is there in our Bylaws.
>> >
>> > So, do follow the process.
>> >
>> > ** About the feature itself:* (A very quick look at the doc and the
>> Jira,
>> > so please take it with a grain of salt)
>> > * The Google Drive link that you folks shared as part of the first
>> mail. I
>> > don't have access to that. So, please open up the permissions for that
>> doc
>> > or share the new link
>> > * Chasing the design doc present on the Jira
>> > * I think we only have Phase-1 ready, so can you share some metrics just
>> > for that? Perf improvements just with splitting the FS & BM Locks
>> > * The memory implications of Phase-1? I don't think there should be any
>> > major impact on the memory in case of just phase-1
>> > * Regarding the snapshot stuff, you mentioned taking lock on the root
>> > itself? Does just taking lock on the snapshot root rather than the FS
>> root
>> > works?
>> > * Secondly about the usage of Snapshot or Symlinks, I don't think we
>> > should operate under the assumptions that they aren't widely used or
>> not,
>> > we might just not know folks who don't use it widely or they are just
>> users
>> > not the ones contributing. We can just accept for now, that in those
>> cases
>> > it isn't optimised and we just lock the entire FS space, which it does
>> even
>> > today, so no regressions there.
>> > * Regarding memory usage: Do you have some numbers on how much the
>> memory
>> > footprint increases?
>> > * Under the Lock Pool: I think you are assuming there would be very few
>> > inodes where lock would be required at any given time, so there won't be
>> > too much heap consumption? I think you are compromising on the
>> Horizontal
>> > Scalability here. I doubt if your assumption doesn't hold true, under
>> heavy
>> > read load by concurrent clients accessing different inodes, the Namenode
>> > will start giving memory troubles, that would do more harm than good.
>> > Anyway Namenode heap is way bigger problem than anything, so we should
>> be
>> > very careful increasing load over there.
>> > * For the Locks on the inodes: Do you plan to have locs for each inode?
>> > Can we somehow limit that to the depth of the tree? Like currently we
>> take
>> > lock on the root, have a config which makes us take lock at Level-2 or 3
>> > (configurable), that might fetch some perf benefits and can be used to
>> > control the memory usage as well?
>> > * What is the cost of creating these inode locks? If the lock isn't
>> > already cached it would incur some cost? Do you have some numbers around
>> > that? Say I disable caching altogether & then let a test load run, what
>> > does the perf numbers look like in that case
>> > * I think we need to limit the size of INodeLockPool, we can't let it
>> grow
>> > infinitely in case of heavy loads and we need to have some auto
>> > throttling mechanism for it
>> > * I didn't catch your Storage Policy problem. If I decode it right, the
>> > problem is like the policy could be set on an ancestor node & the
>> children
>> > abide by that & this is the problem, if that is the case then isn't that
>> > the case with ErasureCoding policies or even ACLs or so? Can you
>> elaborate
>> > a bit on that.
>> >
>> >
>> > Anyway, regarding the Phase-1. If you share (the perf numbers with
>> proper
>> > details + Impact on memory if any) for just phase 1 & if they are good,
>> > then if you call for a branch merge vote for Phase-1 FGL, you have my
>> vote,
>> > however you'll need to sway the rest of the folks on your own :-)
>> >
>> > Good Luck, Nice Work Guys!!!
>> >
>> > -Ayush
>> >
>> >
>> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <[email protected]>
>> wrote:
>> >
>> >> Thanks ZanderXu and Hui Fei for your work on this feature. It will be
>> >> a very helpful improvement for the HDFS module in the next journal.
>> >>
>> >> 1. If we need any more review bandwidth, I would like to be involved
>> >> to help review if possible.
>> >> 2. From the design document there are still missing some detailed
>> >> descriptions such as snapshot, symbolic link and reserved etc as
>> mentioned
>> >> above. I think it will be helpful for newbies who want to be involved
>> >> if all corner
>> >> cases are considered and described.
>> >> 3. From slack, we plan to check into the trunk at this phase. I am not
>> >> sure
>> >> If it is the proper time, following the dev plan there are two steps
>> left
>> >> to
>> >> finish this feature from the design document, right? If that, I think
>> we
>> >> should
>> >> postpone checking in when all plans are ready. Considering that there
>> are
>> >> many unfinished tries for this feature in history, I think postpone
>> >> checking
>> >> will be the safe way, another way it will involve more rebase cost if
>> you
>> >> keep
>> >> separate dev branch, however I think It is not one difficult thing for
>> >> you.
>> >>
>> >> Good luck and look forward to making that happen soon!
>> >>
>> >> Best Regards,
>> >> - He Xiaoqiao
>> >>
>> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <[email protected]> wrote:
>> >> >
>> >> > Thanks for interest and advice on this.
>> >> >
>> >> > Just would like to share some info here
>> >> >
>> >> > ZanderXu leads this feature and he has spent a lot of time on it. He
>> is
>> >> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also took
>> some
>> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99 RocMarshal
>> >> kokonguyen191) helped review PRs. (Forgive me if I missed someone)
>> >> >
>> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
>> >> familiar with this feature. We discussed many details offline.
>> >> >
>> >> > Welcome to more people interested in joining the development and
>> review
>> >> of the stage 2 and 3.
>> >> >
>> >> >
>> >> > Zengqiang XU <[email protected]> 于2024年4月26日周五 14:56写道：
>> >> >>
>> >> >> Thanks Shilun for your response:
>> >> >>
>> >> >> 1. This is a big and very useful feature, so it really needs more
>> >> >> developers to get on board.
>> >> >> 2. This fine grained lock has been implemented based on internal
>> >> branches
>> >> >> and has gained benefits by many companies, such as: Meituan,
>> Kuaishou,
>> >> >> Bytedance, etc.  But it has not been contributed to the community
>> due
>> >> to
>> >> >> various reasons, such as there is a big difference between the
>> version
>> >> of
>> >> >> the internal branch and the community trunk branch, the internal
>> >> branch may
>> >> >> ignore some functions to make FGL clear, and the contribution needs
>> a
>> >> lot
>> >> >> of work and will take many times. It means that this solution has
>> >> already
>> >> >> been practiced in their prod environment. We have also practiced it
>> in
>> >> our
>> >> >> prod environment and gained benefits, and we are also willing to
>> spend
>> >> a
>> >> >> lot of time contributing to the community.
>> >> >> 3. Regarding the benchmark testing, we don't need to pay more
>> >> attention to
>> >> >> whether the performance is improved by 5 times, 10 times or 20
>> times,
>> >> >> because there are too many factors that affect it.
>> >> >> 4. As I described above, this solution is already  being practiced
>> by
>> >> many
>> >> >> companies. Right now, we just need to think about how to implement
>> it
>> >> with
>> >> >> high quality and more comprehensively.
>> >> >> 5. I firmly believe that all problems can be solved as long as the
>> >> overall
>> >> >> solution is right.
>> >> >> 6. I can spend a lot of time leading the promotion of this entire
>> >> feature
>> >> >> and I hope more people can join us in promoting it.
>> >> >> 7. You are always welcome to raise your concerns.
>> >> >>
>> >> >>
>> >> >> Thanks Shilun again, I hope you can help review designs and PRs.
>> Thanks
>> >> >>
>> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <[email protected]>
>> wrote:
>> >> >>
>> >> >> > Thank you for your hard work! This is a very meaningful
>> improvement,
>> >> and
>> >> >> > from the design document, we can see a significant increase in
>> HDFS
>> >> >> > read/write throughput.
>> >> >> >
>> >> >> > I am happy to see the progress made on HDFS-17384.
>> >> >> >
>> >> >> > However, I still have some concerns, which roughly involve the
>> >> following
>> >> >> > aspects:
>> >> >> >
>> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS and are
>> >> familiar
>> >> >> > with related development details, we still need more community
>> >> member to
>> >> >> > review the code to ensure that the relevant upgrades meet
>> >> expectations.
>> >> >> >
>> >> >> > 2. We need more details on benchmarks to ensure that test results
>> >> can be
>> >> >> > reproduced and to allow more community member to participate in
>> the
>> >> testing
>> >> >> > process.
>> >> >> >
>> >> >> > Looking forward to everything going smoothly in the future.
>> >> >> >
>> >> >> > Best Regards,
>> >> >> > - Shilun Fan.
>> >> >> >
>> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <
>> [email protected]>
>> >> wrote:
>> >> >> >
>> >> >> >> cc [email protected].
>> >> >> >>
>> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <[email protected]>
>> >> wrote:
>> >> >> >> >
>> >> >> >> > Here are some summaries about the first phase:
>> >> >> >> > 1. There are no big changes in this phase
>> >> >> >> > 2. This phase just uses FS lock and BM lock to replace the
>> >> original
>> >> >> >> global
>> >> >> >> > lock
>> >> >> >> > 3. It's useful to improve the performance, since some
>> operations
>> >> just
>> >> >> >> need
>> >> >> >> > to hold FS lock or BM lock instead of the global lock
>> >> >> >> > 4. This feature is turned off by default, you can enable it by
>> >> setting
>> >> >> >> > dfs.namenode.lock.model.provider.class to
>> >> >> >> >
>> >> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
>> >> >> >> > 5. This phase is very import for the ongoing development of the
>> >> entire
>> >> >> >> FGL
>> >> >> >> >
>> >> >> >> > Here I would like to express my special thanks to
>> @kokonguyen191
>> >> and
>> >> >> >> > @yuanboliu for their contributions.  And you are also welcome
>> to
>> >> join us
>> >> >> >> > and complete it together.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <[email protected]>
>> >> wrote:
>> >> >> >> >
>> >> >> >> > > Hi everyone
>> >> >> >> > >
>> >> >> >> > > All subtasks of the first phase of the FGL have been
>> completed
>> >> and I
>> >> >> >> plan
>> >> >> >> > > to merge them into the trunk and start the second phase based
>> >> on the
>> >> >> >> trunk.
>> >> >> >> > >
>> >> >> >> > > Here is the PR that used to merge the first phases into
>> trunk:
>> >> >> >> > > https://github.com/apache/hadoop/pull/6762
>> >> >> >> > > Here is the ticket:
>> >> https://issues.apache.org/jira/browse/HDFS-17384
>> >> >> >> > >
>> >> >> >> > > I hope you can help to review this PR when you are available
>> >> and give
>> >> >> >> some
>> >> >> >> > > ideas.
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > HDFS-17385 <https://issues.apache.org/jira/browse/HDFS-17385
>> >
>> >> is
>> >> >> >> used for
>> >> >> >> > > the second phase and I have created some subtasks to describe
>> >> >> >> solutions for
>> >> >> >> > > some problems, such as: snapshot, getListing, quota.
>> >> >> >> > > You are welcome to join us to complete it together.
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > ---------- Forwarded message ---------
>> >> >> >> > > From: Zengqiang XU <[email protected]>
>> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07
>> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking
>> >> >> >> > > To: <[email protected]>
>> >> >> >> > > Cc: Zengqiang XU <[email protected]>
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > Hi everyone
>> >> >> >> > >
>> >> >> >> > > I have started a discussion about NameNode Fine-grained
>> Locking
>> >> to
>> >> >> >> improve
>> >> >> >> > > performance of write operations in NameNode.
>> >> >> >> > >
>> >> >> >> > > I started this discussion again for serval main reasons:
>> >> >> >> > > 1. We have implemented it and gained nearly 7x performance
>> >> >> >> improvement in
>> >> >> >> > > our prod environment
>> >> >> >> > > 2. Many other companies made similar improvements based on
>> their
>> >> >> >> internal
>> >> >> >> > > branch.
>> >> >> >> > > 3. This topic has been discussed for a long time, but still
>> >> without
>> >> >> >> any
>> >> >> >> > > results.
>> >> >> >> > >
>> >> >> >> > > I hope we can push this important improvement in the
>> community
>> >> so
>> >> >> >> that all
>> >> >> >> > > end-users can enjoy this significant improvement.
>> >> >> >> > >
>> >> >> >> > > I'd really appreciate you can join in and work with me to
>> push
>> >> this
>> >> >> >> > > feature forward.
>> >> >> >> > >
>> >> >> >> > > Thanks very much.
>> >> >> >> > >
>> >> >> >> > > Ticket: HDFS-17366 <
>> >> https://issues.apache.org/jira/browse/HDFS-17366>
>> >> >> >> > > Design: NameNode Fine-grained locking based on directory tree
>> >> >> >> > > <
>> >> >> >>
>> >>
>> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
>> >> >> >> >
>> >> >> >> > >
>> >> >> >>
>> >> >> >>
>> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: [email protected]
>> >> >> >> For additional commands, e-mail: [email protected]
>> >> >> >>
>> >> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>>
>

Re: Discussion about NameNode Fine-grained locking

Reply via email to