On 2019/04/23 7:51, Alvaro Herrera wrote: > On 2019-Mar-06, Tom Lane wrote: >> David Rowley <david.row...@2ndquadrant.com> writes: >>> As far as I can see, the biggest fundamental difference with doing >>> things this way will be that the column order of partitions will be >>> preserved, where before it would inherit the order of the partitioned >>> table. I'm a little unsure if doing this column reordering was an >>> intended side-effect or not. >> >> Well, if the normal behavior results in changing the column order, >> it'd be necessary to do things differently in --binary-upgrade mode >> anyway, because there we *must* preserve column order. I don't know >> if what you're describing represents a separate bug for pg_upgrade runs, >> but it might. Is there any test case for the situation left behind by >> the core regression tests? > > Now that I re-read this complaint once again, I wonder if a mismatching > column order in partitions isn't a thing we ought to preserve anyhow. > Robert, Amit -- is it by design that pg_dump loses the original column > order for partitions, when not in binary-upgrade mode?
I do remember being too wary initially about letting partitions devolve into a state of needing tuple conversion during DML execution, which very well may have been a reason to write the pg_dump support code the way it is now. pg_dump chooses to emit partitions with the CREATE TABLE PARTITION OF syntax because, as it seems has been correctly interpreted on this thread, it allows partitions to end up with same TupleDesc as the parent and hence not require tuple conversion in DML execution, unless of course it's run with --binary-upgrade mode. Needing tuple conversion is still an overhead but maybe there aren't that many cases where TupleDescs differ among tables in partition trees, so the considerations for emitting PARTITION OF syntax may not be all that relevant. Also, we've made DML involving partitions pretty efficient these days by reducing most other overheads, even though nothing has been done to prevent tuple conversion in the cases it is needed anyway. > To me, it sounds > unintuitive to accept partitions that don't exactly match the order of > the parent table; but it's been supported all along. You might know it already, but even though column sets of two tables may appear identical, their TupleDescs still may not match due to dropped columns being different in the two tables. > In the statu quo, > if users dump and restore such a database, the restored partition ends > up with the column order of the parent instead of its own column order > (by virtue of being created as CREATE TABLE PARTITION OF). Isn't that > wrong? It'll cause an INSERT/COPY direct to the partition that worked > prior to the restore to fail after the restore, if the column list isn't > specified. That's true, although there is a workaround as you mentioned -- specify column names to match the input data. pg_dump itself specifies them, so the dumped output can be loaded unchanged. Anyway, I don't see a problem with changing pg_dump to *always* emit CREATE TABLE followed by ATTACH PARTITION, not just in --binary-upgrade mode, if it lets us deal with the tablespace-related issues smoothly. Thanks, Amit