Re: Parallel heap vacuum

Amit Kapila Fri, 28 Jun 2024 05:06:54 -0700

On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> # Benchmark results
>
> * Test-1: parallel heap scan on the table without indexes
>
> I created 20GB table, made garbage on the table, and run vacuum while
> changing parallel degree:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> Here are the results (total time and heap scan time):
>
> PARALLEL 0: 21.99 s (single process)
> PARALLEL 1: 11.39 s
> PARALLEL 2:   8.36 s
> PARALLEL 3:   6.14 s
> PARALLEL 4:   5.08 s
>
> * Test-2: parallel heap scan on the table with one index
>
> I used a similar table to the test case 1 but created one btree index on it:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> create index on test (a);
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> I've measured the total execution time as well as the time of each
> vacuum phase (from left heap scan time, index vacuum time, and heap
> vacuum time):
>
> PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
>
> Overall, I can see the parallel heap scan in lazy vacuum has a decent
> scalability; In both test-1 and test-2, the execution time of heap
> scan got ~4x faster with 4 parallel workers. On the other hand, when
> it comes to the total vacuum execution time, I could not see much
> performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> worker (12.75 vs. 11.39).
>
> I think the reason is the shared TidStore is not very scalable since
> we have a single lock on it. In all cases in the test-1, we don't use
> the shared TidStore since all dead tuples are removed during heap
> pruning. So the scalability was better overall than in test-2. In
> parallel 0 case in test-2, we use the local TidStore, and from
> parallel degree of 1 in test-2, we use the shared TidStore and
> parallel worker concurrently update it. Also, I guess that the lookup
> performance of the local TidStore is better than the shared TidStore's
> lookup performance because of the differences between a bump context
> and an DSA area. I think that this difference contributed the fact
> that index vacuuming got slower (16.74 vs. 22.04).
>
> There are two obvious improvement ideas to improve overall vacuum
> execution time: (1) improve the shared TidStore scalability and (2)
> support parallel heap vacuum. For (1), several ideas are proposed by
> the ART authors[1]. I've not tried these ideas but it might be
> applicable to our ART implementation. But I prefer to start with (2)
> since it would be easier. Feedback is very welcome.
>


Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

-- 
With Regards,
Amit Kapila.

Re: Parallel heap vacuum

Reply via email to