Re: Discussion about NameNode Fine-grained locking

Hui Fei Mon, 06 May 2024 08:50:25 -0700

Thanks all

Seems all concerns are related to the stage 2. We can address these and
make it more clear before we start it.


>From development experience, I think it is reasonable to split the big
feature into several stages. And stage 1 is also independent and it also
can be as a minor feature that uses fs and bm locks instead of the global
lock.


ZanderXu <[email protected]> 于2024年4月29日周一 15:17写道：

> Thanks @Ayush Saxena <[email protected]> and @Xiaoqiao He
> <[email protected]> for your nice questions.
>
> Let me summarize your concerns and corresponding solutions:
>
> *1. Questions about the Snapshot feature*
> It's difficult to apply the FGL to Snapshot feature,  but we can just using
> the global FS write lock to make it thread safe.
> So if we can identity if a path contains the snapshot feature, we can just
> using the global FS write lock to protect it.
>
> You can refer to HDFS-17479
> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to identify
> it.
>
> Regarding performance of the operations related to the snapshot features,
> we can discuss it in two categories:
> Read operations involves snapshots:
> The FGL branch uses the global write lock to protect them, the GLOBAL
> branch uses the global read lock to protect them. It's hard to conclude
> which version has better performance, it depends on the global lock
> competition.
>
> Write operations involves snapshots:
> Both FGL and GLOBAL branch use the global write lock to protect them. It's
> hard to conclude which version has better performance, it depends on the
> global lock competition too.
>
> So I think if namenode load is low, the GLOBAL branch will have a better
> performance than FGL; If namenode load is high, the FGL branch may have a
> better performance than the GLOBAL, which also depends on the ratio of read
> and write operations on the SNAPSHOT feature.
>
> We can do somethings to let end-user to choose a branch with a better
> branch according to their business:
> First, we need to make the lock mode can be selectable, so that end-user
> can choose to use FGL of GLOBAL.
> Second, using the global write lock to make operations related to snapshot
> thread safe as I described in HDFS-17479.
>
>
> *2. Questions about the Symlinks feature*
> If Symlink is related to snapshot, we can refer to the solution of the
> snapshot;  If Symlink is not related to snapshot, I think it's easy to meet
> the FGL.
> Only createSymlink involves two paths, FGL just need to lock them in the
> order to make this operation thread. For other operations, it is the same
> as other normal iNode, right?
>
> If I missed difficult points, please let me know.
>
>
> *3. Questions about Memory Usage of iNode locks*
> I think there are too many solutions to limit the memory usage of these
> iNode locks, such as: Using a limit capacity lock pool to ensure the
> maximum memory usage,  Just holding iNode locks for fixed depth of
> directories, etc.
>
> We can just abstract this LockManager first and then support its
> implementation with different ideas, so that we can limit the maximum
> memory usage of these iNode locks.
> FGL can acquire or lease iNode locks through LockManager.
>
>
> *4. Questions about Performance of acquiring and releasing iNode locks*
> We can add some benchmark for LockManager, to test the performance or
> acquire and release unblocked locks.
>
>
> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
> These policies may be sot on an ancestor node and used by some children
> files.  The set operation for these policies will be protected by the
> directory tree, since there are all file-related operations.  In addition
> to Quota and StoragePolicy, the use of other policies will also be
> protected by directory tree, such as ECPolicy and ACL.
>
> Quota is a little special since its update operations may not be protected
> by the directory tree, we can assign a locks to each QuotaFeature and use
> these locks to make updating operations thread safe. you can refer to
> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to get some
> detailed information.
>
> StoragePolicy is a little special since it is used not only by file-related
> operations but also block-related operations.  ProcessExtraRedundancyBlock
> uses storage policy to choose redundancy replicas and
> BlockReconstructionWork uses storage policy to choose target DNs. In order
> to maximize the performance improvement, BR and IBR should only involve the
> iNodeFile to which the current processing block belongs. These redundancy
> blocks can be processed by the Redundancy monitor while holding the
> directory tree locks. You can refer to HDFS-17505
> <https://issues.apache.org/jira/browse/HDFS-17505> to get more detailed
> informations.
>
> *6. Performance of the phase 1*
> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is used to
> do
> some performance testing for phase 1, and I will complete it later.
>
>
> Discuss solution through mails is not efficient, you can create one
> sub-tasks under HDFS-17366
> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
> concerns and I will try to give some answers.
>
> Thanks @Ayush Saxena <[email protected]>  and @Xiaoqiao He
> <[email protected]> again.
>
>
>
> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <[email protected]> wrote:
>
> > Thanx Everyone for chasing this, Great to see some momentum around FGL,
> > that should be a great improvement.
> >
> > I have some two broad categories:
> > ** About the process:*
> > I think in the above mails, there are mentions that phase one is complete
> > in a feature branch & we are gonna merge that to trunk. If I am catching
> it
> > right, then you can't hit the merge button like that. To merge a feature
> > branch. You need to call for a Vote specific to that branch & it
> requires 3
> > binding votes to merge, unlike any other code change which requires 1. It
> > is there in our Bylaws.
> >
> > So, do follow the process.
> >
> > ** About the feature itself:* (A very quick look at the doc and the Jira,
> > so please take it with a grain of salt)
> > * The Google Drive link that you folks shared as part of the first mail.
> I
> > don't have access to that. So, please open up the permissions for that
> doc
> > or share the new link
> > * Chasing the design doc present on the Jira
> > * I think we only have Phase-1 ready, so can you share some metrics just
> > for that? Perf improvements just with splitting the FS & BM Locks
> > * The memory implications of Phase-1? I don't think there should be any
> > major impact on the memory in case of just phase-1
> > * Regarding the snapshot stuff, you mentioned taking lock on the root
> > itself? Does just taking lock on the snapshot root rather than the FS
> root
> > works?
> > * Secondly about the usage of Snapshot or Symlinks, I don't think we
> > should operate under the assumptions that they aren't widely used or not,
> > we might just not know folks who don't use it widely or they are just
> users
> > not the ones contributing. We can just accept for now, that in those
> cases
> > it isn't optimised and we just lock the entire FS space, which it does
> even
> > today, so no regressions there.
> > * Regarding memory usage: Do you have some numbers on how much the memory
> > footprint increases?
> > * Under the Lock Pool: I think you are assuming there would be very few
> > inodes where lock would be required at any given time, so there won't be
> > too much heap consumption? I think you are compromising on the Horizontal
> > Scalability here. I doubt if your assumption doesn't hold true, under
> heavy
> > read load by concurrent clients accessing different inodes, the Namenode
> > will start giving memory troubles, that would do more harm than good.
> > Anyway Namenode heap is way bigger problem than anything, so we should be
> > very careful increasing load over there.
> > * For the Locks on the inodes: Do you plan to have locs for each inode?
> > Can we somehow limit that to the depth of the tree? Like currently we
> take
> > lock on the root, have a config which makes us take lock at Level-2 or 3
> > (configurable), that might fetch some perf benefits and can be used to
> > control the memory usage as well?
> > * What is the cost of creating these inode locks? If the lock isn't
> > already cached it would incur some cost? Do you have some numbers around
> > that? Say I disable caching altogether & then let a test load run, what
> > does the perf numbers look like in that case
> > * I think we need to limit the size of INodeLockPool, we can't let it
> grow
> > infinitely in case of heavy loads and we need to have some auto
> > throttling mechanism for it
> > * I didn't catch your Storage Policy problem. If I decode it right, the
> > problem is like the policy could be set on an ancestor node & the
> children
> > abide by that & this is the problem, if that is the case then isn't that
> > the case with ErasureCoding policies or even ACLs or so? Can you
> elaborate
> > a bit on that.
> >
> >
> > Anyway, regarding the Phase-1. If you share (the perf numbers with proper
> > details + Impact on memory if any) for just phase 1 & if they are good,
> > then if you call for a branch merge vote for Phase-1 FGL, you have my
> vote,
> > however you'll need to sway the rest of the folks on your own :-)
> >
> > Good Luck, Nice Work Guys!!!
> >
> > -Ayush
> >
> >
> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <[email protected]> wrote:
> >
> >> Thanks ZanderXu and Hui Fei for your work on this feature. It will be
> >> a very helpful improvement for the HDFS module in the next journal.
> >>
> >> 1. If we need any more review bandwidth, I would like to be involved
> >> to help review if possible.
> >> 2. From the design document there are still missing some detailed
> >> descriptions such as snapshot, symbolic link and reserved etc as
> mentioned
> >> above. I think it will be helpful for newbies who want to be involved
> >> if all corner
> >> cases are considered and described.
> >> 3. From slack, we plan to check into the trunk at this phase. I am not
> >> sure
> >> If it is the proper time, following the dev plan there are two steps
> left
> >> to
> >> finish this feature from the design document, right? If that, I think we
> >> should
> >> postpone checking in when all plans are ready. Considering that there
> are
> >> many unfinished tries for this feature in history, I think postpone
> >> checking
> >> will be the safe way, another way it will involve more rebase cost if
> you
> >> keep
> >> separate dev branch, however I think It is not one difficult thing for
> >> you.
> >>
> >> Good luck and look forward to making that happen soon!
> >>
> >> Best Regards,
> >> - He Xiaoqiao
> >>
> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <[email protected]> wrote:
> >> >
> >> > Thanks for interest and advice on this.
> >> >
> >> > Just would like to share some info here
> >> >
> >> > ZanderXu leads this feature and he has spent a lot of time on it. He
> is
> >> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also took
> some
> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99 RocMarshal
> >> kokonguyen191) helped review PRs. (Forgive me if I missed someone)
> >> >
> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
> >> familiar with this feature. We discussed many details offline.
> >> >
> >> > Welcome to more people interested in joining the development and
> review
> >> of the stage 2 and 3.
> >> >
> >> >
> >> > Zengqiang XU <[email protected]> 于2024年4月26日周五 14:56写道：
> >> >>
> >> >> Thanks Shilun for your response:
> >> >>
> >> >> 1. This is a big and very useful feature, so it really needs more
> >> >> developers to get on board.
> >> >> 2. This fine grained lock has been implemented based on internal
> >> branches
> >> >> and has gained benefits by many companies, such as: Meituan,
> Kuaishou,
> >> >> Bytedance, etc.  But it has not been contributed to the community due
> >> to
> >> >> various reasons, such as there is a big difference between the
> version
> >> of
> >> >> the internal branch and the community trunk branch, the internal
> >> branch may
> >> >> ignore some functions to make FGL clear, and the contribution needs a
> >> lot
> >> >> of work and will take many times. It means that this solution has
> >> already
> >> >> been practiced in their prod environment. We have also practiced it
> in
> >> our
> >> >> prod environment and gained benefits, and we are also willing to
> spend
> >> a
> >> >> lot of time contributing to the community.
> >> >> 3. Regarding the benchmark testing, we don't need to pay more
> >> attention to
> >> >> whether the performance is improved by 5 times, 10 times or 20 times,
> >> >> because there are too many factors that affect it.
> >> >> 4. As I described above, this solution is already  being practiced by
> >> many
> >> >> companies. Right now, we just need to think about how to implement it
> >> with
> >> >> high quality and more comprehensively.
> >> >> 5. I firmly believe that all problems can be solved as long as the
> >> overall
> >> >> solution is right.
> >> >> 6. I can spend a lot of time leading the promotion of this entire
> >> feature
> >> >> and I hope more people can join us in promoting it.
> >> >> 7. You are always welcome to raise your concerns.
> >> >>
> >> >>
> >> >> Thanks Shilun again, I hope you can help review designs and PRs.
> Thanks
> >> >>
> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <[email protected]>
> wrote:
> >> >>
> >> >> > Thank you for your hard work! This is a very meaningful
> improvement,
> >> and
> >> >> > from the design document, we can see a significant increase in HDFS
> >> >> > read/write throughput.
> >> >> >
> >> >> > I am happy to see the progress made on HDFS-17384.
> >> >> >
> >> >> > However, I still have some concerns, which roughly involve the
> >> following
> >> >> > aspects:
> >> >> >
> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS and are
> >> familiar
> >> >> > with related development details, we still need more community
> >> member to
> >> >> > review the code to ensure that the relevant upgrades meet
> >> expectations.
> >> >> >
> >> >> > 2. We need more details on benchmarks to ensure that test results
> >> can be
> >> >> > reproduced and to allow more community member to participate in the
> >> testing
> >> >> > process.
> >> >> >
> >> >> > Looking forward to everything going smoothly in the future.
> >> >> >
> >> >> > Best Regards,
> >> >> > - Shilun Fan.
> >> >> >
> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <[email protected]
> >
> >> wrote:
> >> >> >
> >> >> >> cc [email protected].
> >> >> >>
> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <[email protected]>
> >> wrote:
> >> >> >> >
> >> >> >> > Here are some summaries about the first phase:
> >> >> >> > 1. There are no big changes in this phase
> >> >> >> > 2. This phase just uses FS lock and BM lock to replace the
> >> original
> >> >> >> global
> >> >> >> > lock
> >> >> >> > 3. It's useful to improve the performance, since some operations
> >> just
> >> >> >> need
> >> >> >> > to hold FS lock or BM lock instead of the global lock
> >> >> >> > 4. This feature is turned off by default, you can enable it by
> >> setting
> >> >> >> > dfs.namenode.lock.model.provider.class to
> >> >> >> >
> >> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
> >> >> >> > 5. This phase is very import for the ongoing development of the
> >> entire
> >> >> >> FGL
> >> >> >> >
> >> >> >> > Here I would like to express my special thanks to @kokonguyen191
> >> and
> >> >> >> > @yuanboliu for their contributions.  And you are also welcome to
> >> join us
> >> >> >> > and complete it together.
> >> >> >> >
> >> >> >> >
> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <[email protected]>
> >> wrote:
> >> >> >> >
> >> >> >> > > Hi everyone
> >> >> >> > >
> >> >> >> > > All subtasks of the first phase of the FGL have been completed
> >> and I
> >> >> >> plan
> >> >> >> > > to merge them into the trunk and start the second phase based
> >> on the
> >> >> >> trunk.
> >> >> >> > >
> >> >> >> > > Here is the PR that used to merge the first phases into trunk:
> >> >> >> > > https://github.com/apache/hadoop/pull/6762
> >> >> >> > > Here is the ticket:
> >> https://issues.apache.org/jira/browse/HDFS-17384
> >> >> >> > >
> >> >> >> > > I hope you can help to review this PR when you are available
> >> and give
> >> >> >> some
> >> >> >> > > ideas.
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > HDFS-17385 <https://issues.apache.org/jira/browse/HDFS-17385>
> >> is
> >> >> >> used for
> >> >> >> > > the second phase and I have created some subtasks to describe
> >> >> >> solutions for
> >> >> >> > > some problems, such as: snapshot, getListing, quota.
> >> >> >> > > You are welcome to join us to complete it together.
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > ---------- Forwarded message ---------
> >> >> >> > > From: Zengqiang XU <[email protected]>
> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07
> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking
> >> >> >> > > To: <[email protected]>
> >> >> >> > > Cc: Zengqiang XU <[email protected]>
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > Hi everyone
> >> >> >> > >
> >> >> >> > > I have started a discussion about NameNode Fine-grained
> Locking
> >> to
> >> >> >> improve
> >> >> >> > > performance of write operations in NameNode.
> >> >> >> > >
> >> >> >> > > I started this discussion again for serval main reasons:
> >> >> >> > > 1. We have implemented it and gained nearly 7x performance
> >> >> >> improvement in
> >> >> >> > > our prod environment
> >> >> >> > > 2. Many other companies made similar improvements based on
> their
> >> >> >> internal
> >> >> >> > > branch.
> >> >> >> > > 3. This topic has been discussed for a long time, but still
> >> without
> >> >> >> any
> >> >> >> > > results.
> >> >> >> > >
> >> >> >> > > I hope we can push this important improvement in the community
> >> so
> >> >> >> that all
> >> >> >> > > end-users can enjoy this significant improvement.
> >> >> >> > >
> >> >> >> > > I'd really appreciate you can join in and work with me to push
> >> this
> >> >> >> > > feature forward.
> >> >> >> > >
> >> >> >> > > Thanks very much.
> >> >> >> > >
> >> >> >> > > Ticket: HDFS-17366 <
> >> https://issues.apache.org/jira/browse/HDFS-17366>
> >> >> >> > > Design: NameNode Fine-grained locking based on directory tree
> >> >> >> > > <
> >> >> >>
> >>
> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
> >> >> >> >
> >> >> >> > >
> >> >> >>
> >> >> >>
> >> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: [email protected]
> >> >> >> For additional commands, e-mail: [email protected]
> >> >> >>
> >> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>

Re: Discussion about NameNode Fine-grained locking

Reply via email to