On 2/11/2025 9:25 AM, Japin Li wrote:
On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.z...@intel.com> wrote:
On 2/5/2025 4:32 PM, Japin Li wrote:
On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.z...@intel.com> wrote:
On 1/26/2025 10:59 PM, Yura Sokolov wrote:
24.01.2025 12:07, Japin Li пишет:
On Thu, 23 Jan 2025 at 21:44, Japin Li <japi...@hotmail.com> wrote:
On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.soko...@postgrespro.ru> wrote:
23.01.2025 11:46, Japin Li пишет:
On Wed, 22 Jan 2025 at 22:44, Japin Li <japi...@hotmail.com> wrote:
On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.soko...@postgrespro.ru> wrote:
I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

            pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
            pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
      of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.
Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?


Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?


Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?
Sorry for the late reply.  I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
     but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
     applied alone, but look strange in conjunction with additional
     lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.


Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.


I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.
[1] https://postgr.es/m/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
    Yura's patch

WDYT?

[1]
https://www.postgresql.org/message-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt%40lkrn4maox2wj

Regards,
Zhiguo


Hi Japin,

Apologies for the delay in responding—I've just returned from
vacation. To move things forward, I'll be running the BenchmarkSQL
workload on my end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!


Sorry, I currently don't have access to the test device, I will try to test
it if I can regain access.


Good day, Yura and Japin!

I recently acquired the SUT device again and had the opportunity to conduct performance experiments using the TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the results and key observations:

+----------------+-------------+------------+-------------+------------+
| Version        | NOPM        | NOPM Gain% | TPM         | TPM(Gain%) |
+----------------+-------------+------------+-------------+------------+
| master(b4a07f5)|  1,681,233  | 0.0%       |  3,874,491  | 0.0%       |
| 64-lock        |  643,853    | -61.7%     |  1,479,647  | -61.8%     |
| 64-lock-v4     |  2,423,972  | 44.2%      |  5,577,580  | 44.0%      |
| 128-lock       |  462,993    | -72.5%     |  1,064,733  | -72.5%     |
| 128-lock-v4    |  2,468,034  | 46.8%      |  5,673,349  | 46.4%      |
+----------------+-------------+------------+-------------+------------+

- Though the baseline (b4a07f5) has improved compared to when we created this mailing list, we still achieve 44% improvement with this optimization. - Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe performance regression due to intensified lock contention. - Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog insertion optimization jointly improve overall performance. - 64 locks seems the sweet spot for achieving the most performance improvement.

I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not sure if the difference of their implementation of TPCC would lead to some performance gap). I observed that:

- The performance indicator (NOPM) shows differences of several magnitudes compared to Japin's results. - NOPM/TPM seems insensitive to code changes (lock count increase, lock-free algorithm), which is quite strange. - Possible reasons may include: 1) scaling parameters [1] are not aligned, 2) test configuration did not reach the pain point of the XLog insertions.

And I noticed a 64-core device (32 cores for the server) was used in Japin's test. In our previous core-scaling test (attached), 32/64 cores may not be enough to show the impact of the optimization, I think that would be one of the reason why Japin observed minimal impact from the lock-free optimization.

In summary, I think:
- The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB effectively reflects performance in XLog insertions. - This test on a device with hundreds of cores reflects a real user scenario, making it a significant consideration. - The lock-free algorithm with the lock count increased to 64 can bring significant performance improvements.

So I propose to continue the code review process for this optimization patch. WDYT?

Regards,
Zhiguo

[1] https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters

Reply via email to