Re: Support tid range scan in parallel?

David Rowley Mon, 29 Apr 2024 16:15:07 -0700

On Tue, 30 Apr 2024 at 10:36, Cary Huang <cary.hu...@highgo.ca> wrote:
> In one of our migration scenarios, we rely on tid range scan to migrate huge 
> table from one database to another once the lower and upper ctid bound is 
> determined. With the support of parallel ctid range scan, this process could 
> be done much quicker.


I would have thought that the best way to migrate would be to further
divide the TID range into N segments and run N queries, one per
segment to get the data out.

>From a CPU point of view, I'd hard to imagine that a SELECT * query
without any other items in the WHERE clause other than the TID range
quals would run faster with multiple workers than with 1.  The problem
is the overhead of pushing tuples to the main process often outweighs
the benefits of the parallelism.  However, from an I/O point of view
on a server with slow enough disks, I can imagine there'd be a
speedup.

> The attached patch is my approach to add parallel ctid range scan to 
> PostgreSQL's planner and executor. In my tests, I do see an increase in 
> performance using parallel tid range scan over the single worker tid range 
> scan and it is also faster than parallel sequential scan covering similar 
> ranges. Of course, the table needs to be large enough to reflect the 
> performance increase.
>
> below is the timing to complete a select query covering all the records in a 
> simple 2-column table with 40 million records,
>
>  - tid range scan takes 10216ms
>  - tid range scan with 2 workers takes 7109ms
>  - sequential scan with 2 workers takes 8499ms

Can you share more details about this test? i.e. the query, what the
times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?).
Also, which version/commit did you patch against?  I was wondering if
the read stream code added in v17 would result in the serial case
running faster because the parallelism just resulted in more I/O
concurrency.

Of course, it may be beneficial to have parallel TID Range for other
cases when more row filtering or aggregation is being done as that
requires pushing fewer tuples over from the parallel worker to the
main process. It just would be good to get to the bottom of if there's
still any advantage to parallelism when no filtering other than the
ctid quals is being done now that we've less chance of having to wait
for I/O coming from disk with the read streams code.

David

Re: Support tid range scan in parallel?

Reply via email to