Re: Parallel copy

Heikki Linnakangas Sun, 01 Nov 2020 23:51:38 -0800

On 02/11/2020 09:10, Heikki Linnakangas wrote:

On 02/11/2020 08:14, Amit Kapila wrote:

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].


[1] - 
https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de


Sorry I'm late to the party. I don't think the design I proposed was
discussed in that threads. The alternative that's discussed in that
thread seems to be something much more fine-grained, where processes
claim individual lines. I'm not sure though, I didn't fully understand
the alternative designs.

I read the thread more carefully, and I think Robert had basically theright idea here(https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com):

I really think we don't want a single worker in charge of finding
tuple boundaries for everybody. That adds a lot of unnecessary
inter-process communication and synchronization. Each process should
just get the next tuple starting after where the last one ended, and
then advance the end pointer so that the next process can do the same
thing. [...]

And here(https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com):

On Thu, Apr 9, 2020 at 2:55 PM Andres Freund

<andres(at)anarazel(dot)de> wrote:

I'm fairly certain that we do *not* want to distribute input data
between processes on a single tuple basis. Probably not even below
a few

hundred kb. If there's any sort of natural clustering in the loaded data
- extremely common, think timestamps - splitting on a granular basis
will make indexing much more expensive. And have a lot more contention.


That's a fair point. I think the solution ought to be that once any
process starts finding line endings, it continues until it's grabbed
at least a certain amount of data for itself. Then it stops and lets
some other process grab a chunk of data.

Yes! That's pretty close to the design I sketched. I imagined that theleader would divide the input into 64 kB blocks, and each block wouldhave few metadata fields, notably the starting position of the firstline in the block. I think Robert envisioned having a single "nextstarting position" field in shared memory. That works too, and is evensimpler, so +1 for that.

For some reason, the discussion took a different turn from there, todiscuss how the line-endings (called "chunks" in the discussion) shouldbe represented in shared memory. But none of that is necessary withRobert's design.


- Heikki

Re: Parallel copy

Reply via email to