Re: pg_dump is broken for partition tablespaces

Amit Langote Mon, 22 Apr 2019 18:49:37 -0700

On 2019/04/23 7:51, Alvaro Herrera wrote:
> On 2019-Mar-06, Tom Lane wrote:
>> David Rowley <[email protected]> writes:
>>> As far as I can see, the biggest fundamental difference with doing
>>> things this way will be that the column order of partitions will be
>>> preserved, where before it would inherit the order of the partitioned
>>> table.  I'm a little unsure if doing this column reordering was an
>>> intended side-effect or not.
>>
>> Well, if the normal behavior results in changing the column order,
>> it'd be necessary to do things differently in --binary-upgrade mode
>> anyway, because there we *must* preserve column order.  I don't know
>> if what you're describing represents a separate bug for pg_upgrade runs,
>> but it might.  Is there any test case for the situation left behind by
>> the core regression tests?
> 
> Now that I re-read this complaint once again, I wonder if a mismatching
> column order in partitions isn't a thing we ought to preserve anyhow.
> Robert, Amit -- is it by design that pg_dump loses the original column
> order for partitions, when not in binary-upgrade mode?


I do remember being too wary initially about letting partitions devolve
into a state of needing tuple conversion during DML execution, which very
well may have been a reason to write the pg_dump support code the way it
is now.  pg_dump chooses to emit partitions with the CREATE TABLE
PARTITION OF syntax because, as it seems has been correctly interpreted on
this thread, it allows partitions to end up with same TupleDesc as the
parent and hence not require tuple conversion in DML execution, unless of
course it's run with --binary-upgrade mode.

Needing tuple conversion is still an overhead but maybe there aren't that
many cases where TupleDescs differ among tables in partition trees, so the
considerations for emitting PARTITION OF syntax may not be all that
relevant.  Also, we've made DML involving partitions pretty efficient
these days by reducing most other overheads, even though nothing has been
done to prevent tuple conversion in the cases it is needed anyway.

> To me, it sounds
> unintuitive to accept partitions that don't exactly match the order of
> the parent table; but it's been supported all along.

You might know it already, but even though column sets of two tables may
appear identical, their TupleDescs still may not match due to dropped
columns being different in the two tables.

> In the statu quo,
> if users dump and restore such a database, the restored partition ends
> up with the column order of the parent instead of its own column order
> (by virtue of being created as CREATE TABLE PARTITION OF).  Isn't that
> wrong?  It'll cause an INSERT/COPY direct to the partition that worked
> prior to the restore to fail after the restore, if the column list isn't
> specified.

That's true, although there is a workaround as you mentioned -- specify
column names to match the input data.  pg_dump itself specifies them, so
the dumped output can be loaded unchanged.

Anyway, I don't see a problem with changing pg_dump to *always* emit
CREATE TABLE followed by ATTACH PARTITION, not just in --binary-upgrade
mode, if it lets us deal with the tablespace-related issues smoothly.

Thanks,
Amit

Re: pg_dump is broken for partition tablespaces

Reply via email to