Re: [RFC] Lock-free XLog Reservation from WAL

Zhou, Zhiguo Mon, 27 Jan 2025 01:41:15 -0800



On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:
On Thu, 23 Jan 2025 at 21:44, Japin Li <[email protected]> wrote:
On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <[email protected]>wrote:
23.01.2025 11:46, Japin Li пишет:
On Wed, 22 Jan 2025 at 22:44, Japin Li <[email protected]> wrote:
On Wed, 22 Jan 2025 at 17:02, Yura Sokolov<[email protected]> wrote:
I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines"strategy.
You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3and see
if it affects measurably.
Thanks for your quick fixing.  I will retest it tomorrow.
Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------+------------|| master (44b61efb79) | 865,743.55 | 871,237.40 |874,492.59 || v3 | 857,020.58 | 860,180.11 |864,355.00 || v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |858,436.44 || v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |865,396.42 |
It seems there are some performance decreases :( or something I
missed?
Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).
Never mind!
v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better thanany
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".
And even this way I don't claim "Lock-free reservation" gives anyprofit.
That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?
Hi, Yura Sokolov
Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3patch.
| case | min | avg | max |rate% ||-----------------------+--------------+--------------+--------------+-------|| master (4108440) | 891,225.77 | 904,868.75 | 913,708.17 | || lock 64 | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 |11.84 || lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 |12.47 || lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 |12.52 || lock 128 | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 |12.07 || lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 |13.09 || lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 |12.61 || lock 64 v3 | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 |13.03 || lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 |12.70 || lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 |12.54 || lock 128 v3 | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 |12.02 || lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 |12.38 || lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 |13.24 |

The data looks really interesting and I recognize the need for furtherinvestigation. I'm not very familiar with BenchmarkSQL but we've donesimilar tests with HammerDB/TPCC by solely increasingNUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significantperformance drop of ~50% and the cycle ratio of spinlock acquisition(s_lock) rose to over 60% of the total, which is basically consistentwith the previous findings in [1].

Could you please share the details of your test environment, includingthe device, configuration, and test approach, so we can collaborate onunderstanding the differences?


Sorry for pause, it was my birthday, so I was on short vacation.

So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
   but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
   applied alone, but look strange in conjunction with additional
   lock attempts.

I don't see small improvement from my version of Lock-Free reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to measurethe performance impact of your v1 patch of the lock-free hash table with64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I observed animprovement of *76.4%* (RSD: 4.1%) when combining them together on theSPR with 480 vCPUs. I understand that your test devices may not have asmany cores, which might be why this optimization brings an unnoticeableimpact. However, I don't think this is an unreal problem. In fact, thisissue was raised by our customer who is trying to deploy Postgres ondevices with hundreds of cores, and I believe the resolution of thisperformance issue would result in real impacts.


Probably, when other places will be optimized/improved, it will pay
more.

Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,either via the hash table or directly within the XLog buffer. During myanalysis, I didn't identify significant hotspots in the hash tablefunctions, leading me to believe that both implementations shouldachieve comparable performance improvements.

Following your advice, I revised my implementation to update theprev-link atomically and resolved the known TAP tests. However, Iencountered the last failure in the recovery/t/027_stream_regress.pltest. Addressing this issue might require a redesign of the underlyingwriting convention of XLog, which I believe is not necessary, especiallysince your implementation already achieves the desired performanceimprovements without suffering from the test failures. I think we mayneed to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I planto measure the performance impact of your latest patch and the"Dump-lock-free..." patch once I regain access.

[1] https://postgr.es/m/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
------

regards
Yura


Hi Yura and Japin,

Thanks so much for your recent patch works and discussions whichinspired me a lot! I agree with you that we need to:

- Align the test approach and environment
- Address the motivation and necessity of this optimization

- Further identify the optimization opportunities after applying Yura'spatch


WDYT?

[1]https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj


Regards,
Zhiguo

Re: [RFC] Lock-free XLog Reservation from WAL

Reply via email to