Discussion about NameNode Fine-grained locking

Zhanghaobo Wed, 01 Jan 2025 18:55:39 -0800

Thanks for your great work!  +1 for merging phase 1 codes.

My product clusters have been running phase 1 codes for several months, it 
looks good.


Hope to push this feature forward.






| |
张浩博
|
|
[email protected]
|


---- Replied Message ----
| From | haiyang hu<[email protected]> |
| Date | 12/31/2024 23:08 |
| To | Ayush Saxena<[email protected]> |
| Cc | Hui Fei<[email protected]> ,
ZanderXu<[email protected]> ,
Hdfs-dev<[email protected]> ,
<[email protected]> ,
Xiaoqiao He<[email protected]> ,
slfan1989<[email protected]> ,
<[email protected]> |
| Subject | Re: Discussion about NameNode Fine-grained locking |
Thanks for your hard work and push it forward.
It looks good, +1 for merging phase 1 codes, hope we can work together to
promote this major HDFS optimization,
so that more companies can benefit from it.

Thanks everyone~

Ayush Saxena <[email protected]> 于2024年12月31日周二 20:33写道：

+1,
Thanx folks for your efforts on this! I didn't have time to review
everything thoroughly, but my initial pass suggests it looks good or
atleast is safe to merge.
If I find some spare time, I'll test it further and submit a ticket or
so if I encounter any issues.

Good Luck!!!

-Ayush

On Tue, 31 Dec 2024 at 16:39, Hui Fei <[email protected]> wrote:

Thanks Zander for bringing this discussion again and trying your best to
push it forward. It's really a long time since last discussion.

It’s indeed time, +1 for merging phase 1 codes based on the following
points
- The phase 1 feature has been running at scale within companies for a
long time
- The long-term plan is clear, and also addressed some questions raised
by the community
- The testing result of future features on memory and performance

ZanderXu <[email protected]> 于2024年12月31日周二 15:36写道：

Hi, everyone:

Time to Merge FGL Phase I

The PR for FGL Phase I is ready for merging! Please take a moment to
review and cast your vote: https://github.com/apache/hadoop/pull/6762.

The FGL Phase I has been running successfully in production for over
six months at Shopee and BOSS Zhipin, with no reported performance or
stability issues. It’s now the right time to merge it into the trunk
branch, allowing us to move forward with Phase II.

The global lock remains the default lock mode, but users can enable FGL
by configuring
dfs.namenode.lock.model.provider.class=org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock.

If there are no objections within 7 days, I will propose an official
vote.

Performance and Memory Usage of Phase I

Conclusion：

Fine-grained locks do not lead to significant performance improvements.

Fine-grained locks do not result in additional memory consumption

Reasons:

BM operations heavily depend on FS operations: IBR and BR still acquire
the global lock (FSLock and BMLock).

FS operations depend on BM operations: Common operations (create,
addBlock, getBlockLocations) also acquire the global lock (FSLock and
BMLock).

Phase II will bring significant performance improvements by decoupling
FS and BM dependencies and replacing the global FSLock with a fine-grained
IIPLock.

Addressing Common Questions

Thank you all for raising meaningful questions!

I have rewritten the design document to improve clarity.
https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?usp=sharing

Below is a summary of frequently asked questions and answers:

Summary of Questions:

Question 1: How is the performance of LockPoolManager?

Performance Report:

Time to acquire a cached lock: 194 ns

Time to acquire a non-cached lock: 1044 ns

Time to release an in-use lock: 88 ns

Time to release an unused lock: 112 ns

Overall Performance:

QPS: Over 10 million

Time to acquire the IIP lock for a path with depth 10:

Fully uncached: 10440 ns + 1120 ns (≈ 11 μs)

Fully cached: 1940 ns + 1120 ns (≈ 3 μs)

In global lock scenarios, lock wait times are typically in the
millisecond range. Therefor, the cost of acquiring and releasing
fine-grained locks can be ignored.

Question 2: How much memory does the FGL consume?

Memory Consumption:

A single LockResource contains a read-write lock and a counter,
totaling approximately 200 bytes:

LockResource: 24 bytes

ReentrantReadWriteLock: 150 bytes

AtomicInteger: 16 bytes

Memory Usage Estimates:

10-level directory depth, 100 handlers

1000 lock resources, approximately 200 KB

10-level directory depth, 1000 handlers

10000 lock resources, approximately 2 MB

1, 000,000 lock resources, approximately 200 MB

Conclusion: Memory consumption is negligible.

Question 3: What happens if no lock is available in the LockPoolManager?

If there are not any available LockResources, two solutions are
available:

Return a RetryException, prompting the client to retry later.

Temporarily increase the lock entity limit, allocate more locks to meet
client requests, and use an asynchronous thread to recycle locks
periodically.

We can provide multiple LockPoolManager implementations for users to
choose from based on production environments.

Question 4: Regarding the IIPLock lock depth issue, can we consider
holding only the first 3 or 4 levels of directory locks?

This approach is not recommended for the following reasons:

Cannot maximize concurrency.

Limited savings in lock acquisition/release time and memory usage,
yielding insignificant benefits.

Question 5: How should attributes like StoragePolicy, ErasureCoding,
and ACL, which can be set on parent or ancestor directory nodes, be handled?

ErasureCoding and ACL:

When changing node attributes, hold the corresponding INode’s write
lock.

When using ancestor node attributes, hold the corresponding INode’s
read lock.

StoragePolicy:

More complex due to its impact on both directory tree operations and
Block operations.

To improve performance, commonly used block-related operations (such as
BR/IBR) should not acquire IIPLock

Detailed design documentation:
https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.96lztsl4mwfk

Question 6: How should FGL be implemented for the SNAPSHOT feature?

Since the Rename operation on the SNAPSHOT directory is supported,
holding only the write lock of the SNAPSHOT root directory cannot cover the
rename situation, so the thread safety of SNAPSHOT-related operations
cannot be guaranteed

It is recommended to use global FS lock to ensure thread safety.

Detailed design documentation:
https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.sm36p6bfcpec

Question 7: How should FGL be implemented for the Symlinks feature?

The Target path of Symlinks is a string, and the client performs a
second forward access to the Target path. So the fine-grained lock project
requires no special handling

For the createSymlink RPC, the FGL needs to acquire the IIPLocks for
both target and link paths.

Question 8: How should FGL be implemented for the reserved feature?

The Reserved feature has two usage modes:

/.reserved/iNodes/${inode id}

/.reserved/raw/${path}

INodeId Mode: During the resolvePath phase, obtain the real IIPLock
lock via INodeId.

Path Mode: During the resolvePath phase, obtain the real IIPLock lock
via path.

Detailed design documentation:
https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.h6rcpzkbpanf

Question 9: Why is INodeFileLock used as the FGL for BlockInfo?

INodeFile and Block have mutual dependencies:

INodeFile depends on Block for state and size.

Block depends on INodeFile for state and storage policy.

Therefore, using INodeFileLock as the fine-grained lock for BlockInfo
is a reasonable choice.

Detailed design documentation:
https://docs.google.com/document/d/1DXkiVxef9wCmICjpZyIQO-yxsgwc4wnf2lTKQ3UXe30/edit?tab=t.0#heading=h.zesd6omuu3kr

Seeking Community Feedback

Your questions and concerns are always welcome.

We can discuss them in detail on the Slack Channel:
https://app.slack.com/client/T4S1WH2J3/C06UDTBQ2SH

Let’s work together to advance the Fine-Grained Lock project. I believe
this initiative will deliver significant performance improvements to the
HDFS community and help reinvigorate its activity.

Wishing everyone a Happy New Year 2025!


On Wed, 5 Jun 2024 at 16:17, ZanderXu <[email protected]> wrote:

I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share
the FGL's motivations and some concerns in detail in Chinese.

The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II)

The meeting URL is: https://sea.zoom.us/j/94168001269

You are welcome to this meeting.

On Mon, 6 May 2024 at 23:57, Hui Fei <[email protected]> wrote:

BTW, there is a Slack channel hdfs-fgl for this feature. can join it
and discuss more details.

Is it necessary to hold a meeting to discuss this? So that we can
push it forward quickly. Agreed with ZanderXu, it seems inefficient to
discuss details via email list.


Hui Fei <[email protected]> 于2024年5月6日周一 23:50写道：

Thanks all

Seems all concerns are related to the stage 2. We can address these
and make it more clear before we start it.

From development experience, I think it is reasonable to split the
big feature into several stages. And stage 1 is also independent and it
also can be as a minor feature that uses fs and bm locks instead of the
global lock.


ZanderXu <[email protected]> 于2024年4月29日周一 15:17写道：

Thanks @Ayush Saxena <[email protected]> and @Xiaoqiao He
<[email protected]> for your nice questions.

Let me summarize your concerns and corresponding solutions:

*1. Questions about the Snapshot feature*
It's difficult to apply the FGL to Snapshot feature,  but we can
just using
the global FS write lock to make it thread safe.
So if we can identity if a path contains the snapshot feature, we
can just
using the global FS write lock to protect it.

You can refer to HDFS-17479
<https://issues.apache.org/jira/browse/HDFS-17479> to get how to
identify
it.

Regarding performance of the operations related to the snapshot
features,
we can discuss it in two categories:
Read operations involves snapshots:
The FGL branch uses the global write lock to protect them, the
GLOBAL
branch uses the global read lock to protect them. It's hard to
conclude
which version has better performance, it depends on the global lock
competition.

Write operations involves snapshots:
Both FGL and GLOBAL branch use the global write lock to protect
them. It's
hard to conclude which version has better performance, it depends
on the
global lock competition too.

So I think if namenode load is low, the GLOBAL branch will have a
better
performance than FGL; If namenode load is high, the FGL branch may
have a
better performance than the GLOBAL, which also depends on the ratio
of read
and write operations on the SNAPSHOT feature.

We can do somethings to let end-user to choose a branch with a
better
branch according to their business:
First, we need to make the lock mode can be selectable, so that
end-user
can choose to use FGL of GLOBAL.
Second, using the global write lock to make operations related to
snapshot
thread safe as I described in HDFS-17479.


*2. Questions about the Symlinks feature*
If Symlink is related to snapshot, we can refer to the solution of
the
snapshot;  If Symlink is not related to snapshot, I think it's easy
to meet
the FGL.
Only createSymlink involves two paths, FGL just need to lock them
in the
order to make this operation thread. For other operations, it is
the same
as other normal iNode, right?

If I missed difficult points, please let me know.


*3. Questions about Memory Usage of iNode locks*
I think there are too many solutions to limit the memory usage of
these
iNode locks, such as: Using a limit capacity lock pool to ensure the
maximum memory usage,  Just holding iNode locks for fixed depth of
directories, etc.

We can just abstract this LockManager first and then support its
implementation with different ideas, so that we can limit the
maximum
memory usage of these iNode locks.
FGL can acquire or lease iNode locks through LockManager.


*4. Questions about Performance of acquiring and releasing iNode
locks*
We can add some benchmark for LockManager, to test the performance
or
acquire and release unblocked locks.


*5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
These policies may be sot on an ancestor node and used by some
children
files.  The set operation for these policies will be protected by
the
directory tree, since there are all file-related operations.  In
addition
to Quota and StoragePolicy, the use of other policies will also be
protected by directory tree, such as ECPolicy and ACL.

Quota is a little special since its update operations may not be
protected
by the directory tree, we can assign a locks to each QuotaFeature
and use
these locks to make updating operations thread safe. you can refer
to
HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to
get some
detailed information.

StoragePolicy is a little special since it is used not only by
file-related
operations but also block-related operations.
ProcessExtraRedundancyBlock
uses storage policy to choose redundancy replicas and
BlockReconstructionWork uses storage policy to choose target DNs.
In order
to maximize the performance improvement, BR and IBR should only
involve the
iNodeFile to which the current processing block belongs. These
redundancy
blocks can be processed by the Redundancy monitor while holding the
directory tree locks. You can refer to HDFS-17505
<https://issues.apache.org/jira/browse/HDFS-17505> to get more
detailed
informations.

*6. Performance of the phase 1*
HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is
used to do
some performance testing for phase 1, and I will complete it later.


Discuss solution through mails is not efficient, you can create one
sub-tasks under HDFS-17366
<https://issues.apache.org/jira/browse/HDFS-17366> to describe your
concerns and I will try to give some answers.

Thanks @Ayush Saxena <[email protected]>  and @Xiaoqiao He
<[email protected]> again.



On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <[email protected]>
wrote:

Thanx Everyone for chasing this, Great to see some momentum
around FGL,
that should be a great improvement.

I have some two broad categories:
** About the process:*
I think in the above mails, there are mentions that phase one is
complete
in a feature branch & we are gonna merge that to trunk. If I am
catching it
right, then you can't hit the merge button like that. To merge a
feature
branch. You need to call for a Vote specific to that branch & it
requires 3
binding votes to merge, unlike any other code change which
requires 1. It
is there in our Bylaws.

So, do follow the process.

** About the feature itself:* (A very quick look at the doc and
the Jira,
so please take it with a grain of salt)
* The Google Drive link that you folks shared as part of the
first mail. I
don't have access to that. So, please open up the permissions for
that doc
or share the new link
* Chasing the design doc present on the Jira
* I think we only have Phase-1 ready, so can you share some
metrics just
for that? Perf improvements just with splitting the FS & BM Locks
* The memory implications of Phase-1? I don't think there should
be any
major impact on the memory in case of just phase-1
* Regarding the snapshot stuff, you mentioned taking lock on the
root
itself? Does just taking lock on the snapshot root rather than
the FS root
works?
* Secondly about the usage of Snapshot or Symlinks, I don't think
we
should operate under the assumptions that they aren't widely used
or not,
we might just not know folks who don't use it widely or they are
just users
not the ones contributing. We can just accept for now, that in
those cases
it isn't optimised and we just lock the entire FS space, which it
does even
today, so no regressions there.
* Regarding memory usage: Do you have some numbers on how much
the memory
footprint increases?
* Under the Lock Pool: I think you are assuming there would be
very few
inodes where lock would be required at any given time, so there
won't be
too much heap consumption? I think you are compromising on the
Horizontal
Scalability here. I doubt if your assumption doesn't hold true,
under heavy
read load by concurrent clients accessing different inodes, the
Namenode
will start giving memory troubles, that would do more harm than
good.
Anyway Namenode heap is way bigger problem than anything, so we
should be
very careful increasing load over there.
* For the Locks on the inodes: Do you plan to have locs for each
inode?
Can we somehow limit that to the depth of the tree? Like
currently we take
lock on the root, have a config which makes us take lock at
Level-2 or 3
(configurable), that might fetch some perf benefits and can be
used to
control the memory usage as well?
* What is the cost of creating these inode locks? If the lock
isn't
already cached it would incur some cost? Do you have some numbers
around
that? Say I disable caching altogether & then let a test load
run, what
does the perf numbers look like in that case
* I think we need to limit the size of INodeLockPool, we can't
let it grow
infinitely in case of heavy loads and we need to have some auto
throttling mechanism for it
* I didn't catch your Storage Policy problem. If I decode it
right, the
problem is like the policy could be set on an ancestor node & the
children
abide by that & this is the problem, if that is the case then
isn't that
the case with ErasureCoding policies or even ACLs or so? Can you
elaborate
a bit on that.


Anyway, regarding the Phase-1. If you share (the perf numbers
with proper
details + Impact on memory if any) for just phase 1 & if they are
good,
then if you call for a branch merge vote for Phase-1 FGL, you
have my vote,
however you'll need to sway the rest of the folks on your own :-)

Good Luck, Nice Work Guys!!!

-Ayush


On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <[email protected]>
wrote:

Thanks ZanderXu and Hui Fei for your work on this feature. It
will be
a very helpful improvement for the HDFS module in the next
journal.

1. If we need any more review bandwidth, I would like to be
involved
to help review if possible.
2. From the design document there are still missing some detailed
descriptions such as snapshot, symbolic link and reserved etc as
mentioned
above. I think it will be helpful for newbies who want to be
involved
if all corner
cases are considered and described.
3. From slack, we plan to check into the trunk at this phase. I
am not
sure
If it is the proper time, following the dev plan there are two
steps left
to
finish this feature from the design document, right? If that, I
think we
should
postpone checking in when all plans are ready. Considering that
there are
many unfinished tries for this feature in history, I think
postpone
checking
will be the safe way, another way it will involve more rebase
cost if you
keep
separate dev branch, however I think It is not one difficult
thing for
you.

Good luck and look forward to making that happen soon!

Best Regards,
- He Xiaoqiao

On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <[email protected]>
wrote:

Thanks for interest and advice on this.

Just would like to share some info here

ZanderXu leads this feature and he has spent a lot of time on
it. He is
the main developer in stage 1.  Yuanboliu and Kokonguyen191 also
took some
tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99
RocMarshal
kokonguyen191) helped review PRs. (Forgive me if I missed
someone)

Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
familiar with this feature. We discussed many details offline.

Welcome to more people interested in joining the development
and review
of the stage 2 and 3.


Zengqiang XU <[email protected]> 于2024年4月26日周五
14:56写道：

Thanks Shilun for your response:

1. This is a big and very useful feature, so it really needs
more
developers to get on board.
2. This fine grained lock has been implemented based on
internal
branches
and has gained benefits by many companies, such as: Meituan,
Kuaishou,
Bytedance, etc.  But it has not been contributed to the
community due
to
various reasons, such as there is a big difference between
the version
of
the internal branch and the community trunk branch, the
internal
branch may
ignore some functions to make FGL clear, and the contribution
needs a
lot
of work and will take many times. It means that this solution
has
already
been practiced in their prod environment. We have also
practiced it in
our
prod environment and gained benefits, and we are also willing
to spend
a
lot of time contributing to the community.
3. Regarding the benchmark testing, we don't need to pay more
attention to
whether the performance is improved by 5 times, 10 times or
20 times,
because there are too many factors that affect it.
4. As I described above, this solution is already  being
practiced by
many
companies. Right now, we just need to think about how to
implement it
with
high quality and more comprehensively.
5. I firmly believe that all problems can be solved as long
as the
overall
solution is right.
6. I can spend a lot of time leading the promotion of this
entire
feature
and I hope more people can join us in promoting it.
7. You are always welcome to raise your concerns.


Thanks Shilun again, I hope you can help review designs and
PRs. Thanks

On Fri, 26 Apr 2024 at 08:00, slfan1989 <[email protected]>
wrote:

Thank you for your hard work! This is a very meaningful
improvement,
and
from the design document, we can see a significant increase
in HDFS
read/write throughput.

I am happy to see the progress made on HDFS-17384.

However, I still have some concerns, which roughly involve
the
following
aspects:

1. While ZanderXu and Hui Fei have deep expertise in HDFS
and are
familiar
with related development details, we still need more
community
member to
review the code to ensure that the relevant upgrades meet
expectations.

2. We need more details on benchmarks to ensure that test
results
can be
reproduced and to allow more community member to
participate in the
testing
process.

Looking forward to everything going smoothly in the future.

Best Regards,
- Shilun Fan.

On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <
[email protected]>
wrote:

cc [email protected].

On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <
[email protected]>
wrote:

Here are some summaries about the first phase:
1. There are no big changes in this phase
2. This phase just uses FS lock and BM lock to replace
the
original
global
lock
3. It's useful to improve the performance, since some
operations
just
need
to hold FS lock or BM lock instead of the global lock
4. This feature is turned off by default, you can enable
it by
setting
dfs.namenode.lock.model.provider.class to


org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
5. This phase is very import for the ongoing development
of the
entire
FGL

Here I would like to express my special thanks to
@kokonguyen191
and
@yuanboliu for their contributions.  And you are also
welcome to
join us
and complete it together.


On Wed, 24 Apr 2024 at 14:54, ZanderXu <
[email protected]>
wrote:

Hi everyone

All subtasks of the first phase of the FGL have been
completed
and I
plan
to merge them into the trunk and start the second
phase based
on the
trunk.

Here is the PR that used to merge the first phases
into trunk:
https://github.com/apache/hadoop/pull/6762
Here is the ticket:
https://issues.apache.org/jira/browse/HDFS-17384

I hope you can help to review this PR when you are
available
and give
some
ideas.


HDFS-17385 <
https://issues.apache.org/jira/browse/HDFS-17385>
is
used for
the second phase and I have created some subtasks to
describe
solutions for
some problems, such as: snapshot, getListing, quota.
You are welcome to join us to complete it together.


---------- Forwarded message ---------
From: Zengqiang XU <[email protected]>
Date: Fri, 2 Feb 2024 at 11:07
Subject: Discussion about NameNode Fine-grained locking
To: <[email protected]>
Cc: Zengqiang XU <[email protected]>


Hi everyone

I have started a discussion about NameNode
Fine-grained Locking
to
improve
performance of write operations in NameNode.

I started this discussion again for serval main
reasons:
1. We have implemented it and gained nearly 7x
performance
improvement in
our prod environment
2. Many other companies made similar improvements
based on their
internal
branch.
3. This topic has been discussed for a long time, but
still
without
any
results.

I hope we can push this important improvement in the
community
so
that all
end-users can enjoy this significant improvement.

I'd really appreciate you can join in and work with me
to push
this
feature forward.

Thanks very much.

Ticket: HDFS-17366 <
https://issues.apache.org/jira/browse/HDFS-17366>
Design: NameNode Fine-grained locking based on
directory tree
<


https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing





---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]
For additional commands, e-mail:
[email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Discussion about NameNode Fine-grained locking

Reply via email to