On 03/11/2020 10:59, Amit Kapila wrote:
On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
However, the point of parallel copy is to maximize bandwidth.
Okay, but this first-phase (finding the line boundaries) can anyway
be not done in parallel and we have seen in some of the initial
benchmarking that this initial phase is a small part of work
especially when the table has indexes, constraints, etc. So, I think
it won't matter much if this splitting is done in a single process
or multiple processes.
Right, it won't matter performance-wise. That's not my point. The
difference is in the complexity. If you don't store the line boundaries
in shared memory, you get away with much simpler shared memory structures.
I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review.
Sure. I think the roadmap here is:
1. Split copy.c [1]. Not strictly necessary, but I think it'd make this
nice to review and work with.
2. Refactor CopyReadLine(), so that finding the line-endings and the
rest of the line-parsing are separated into separate functions.
3. Implement parallel copy.
I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.
Agreed, that needs to be solved. I haven't given it any thought myself.
- Heikki
[1]
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi