On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > # Benchmark results > > * Test-1: parallel heap scan on the table without indexes > > I created 20GB table, made garbage on the table, and run vacuum while > changing parallel degree: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 600000000); --- 20GB table > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > Here are the results (total time and heap scan time): > > PARALLEL 0: 21.99 s (single process) > PARALLEL 1: 11.39 s > PARALLEL 2: 8.36 s > PARALLEL 3: 6.14 s > PARALLEL 4: 5.08 s > > * Test-2: parallel heap scan on the table with one index > > I used a similar table to the test case 1 but created one btree index on it: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 600000000); --- 20GB table > create index on test (a); > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > I've measured the total execution time as well as the time of each > vacuum phase (from left heap scan time, index vacuum time, and heap > vacuum time): > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > scalability; In both test-1 and test-2, the execution time of heap > scan got ~4x faster with 4 parallel workers. On the other hand, when > it comes to the total vacuum execution time, I could not see much > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > worker (12.75 vs. 11.39). > > I think the reason is the shared TidStore is not very scalable since > we have a single lock on it. In all cases in the test-1, we don't use > the shared TidStore since all dead tuples are removed during heap > pruning. So the scalability was better overall than in test-2. In > parallel 0 case in test-2, we use the local TidStore, and from > parallel degree of 1 in test-2, we use the shared TidStore and > parallel worker concurrently update it. Also, I guess that the lookup > performance of the local TidStore is better than the shared TidStore's > lookup performance because of the differences between a bump context > and an DSA area. I think that this difference contributed the fact > that index vacuuming got slower (16.74 vs. 22.04). > > There are two obvious improvement ideas to improve overall vacuum > execution time: (1) improve the shared TidStore scalability and (2) > support parallel heap vacuum. For (1), several ideas are proposed by > the ART authors[1]. I've not tried these ideas but it might be > applicable to our ART implementation. But I prefer to start with (2) > since it would be easier. Feedback is very welcome. >
Starting with (2) sounds like a reasonable approach. We should study a few more things like (a) the performance results where there are 3-4 indexes, (b) What is the reason for performance improvement seen with only heap scans. We normally get benefits of parallelism because of using multiple CPUs but parallelizing scans (I/O) shouldn't give much benefits. Is it possible that you are seeing benefits because most of the data is either in shared_buffers or in memory? We can probably try vacuuming tables by restarting the nodes to ensure the data is not in memory. -- With Regards, Amit Kapila.