[HACKERS] pg_dump and dependencies and --section ... it's a mess

Tom Lane Thu, 21 Jun 2012 11:14:09 -0700

Don't know if everybody on this list has been paying attention to the
pgsql-bugs thread about bug #6699.  The shortest example of the problem
is


create table t1 (f1 int primary key, f2 text);
create view v1 as select f1, f2 from t1 group by f1;

The view's query is legal only because of the primary key on f1,
else the reference to f2 would be considered inadequately grouped.
So when we create the view we mark it as dependent on that primary key
(or more accurately, the view's _RETURN rule is so marked).  This all
works fine so far as the backend is concerned: you can't drop the PK
without dropping or redefining the view.  But it gives pg_dump
indigestion.  If you do "pg_dump -Fc | pg_restore -l -v" you see
(relevant entries only):

5; 2615 2200 SCHEMA - public postgres                   PRE_DATA
168; 1259 41968 TABLE public t1 postgres                PRE_DATA
;       depends on: 5
1921; 2606 41975 CONSTRAINT public t1_pkey postgres     POST_DATA
;       depends on: 168 168
169; 1259 41976 VIEW public v1 postgres                 PRE_DATA
;       depends on: 1919 5
1922; 0 41968 TABLE DATA public t1 postgres             DATA
;       depends on: 168

Now, there are two not-nice things about this.  First, the view is shown
as depending on "object 1919", which is pg_dump's internal DumpId for
the view's _RETURN rule --- but that's nowhere to be seen in the dump,
so the fact that it's dependent on the t1_pkey constraint is not
visible in the dump.  This puts parallel pg_restore at risk of doing
the wrong thing.  Second, because view definitions are preferentially
dumped in the PRE_DATA section, the t1_pkey constraint has been hoisted
up to come before the table data.  That means that in an ordinary serial
restore, the index will be created before the table's data is loaded,
which is undesirable for efficiency reasons.

It gets worse though.  I've labeled the above archive items with their
"SECTION" codes (which for some reason pg_restore -l -v doesn't print)
so that you can see that we've got a POST_DATA item that must come
before a PRE_DATA item.  This wreaks havoc with the entire concept
of splitting dump files into three sections.  When I first realized
that, I was thinking that we would have to revert the --section flag
we added to pg_dump/pg_restore in 9.2, for lack of any way to guarantee
that the items can actually be restored if they are split according
to sections.

I think that we can fix it though.  There are provisions in pg_dump
already for splitting a view apart from its _RETURN rule --- and if the
rule comes out as a separate object, it's POST_DATA.  So at least in
principle, we could fix things by dumping this scenario this way:

SCHEMA public           PRE_DATA
TABLE t1                PRE_DATA
TABLE v1                PRE_DATA (at this point it's a table not a view)
TABLE DATA t1           DATA
CONSTRAINT t1_pkey      POST_DATA
RULE for v1             POST_DATA (adding the rule turns v1 into a view)

The problem is how to persuade pg_dump to do that; as the code stands,
it will only break a view apart from its rule if that's necessary to
break a circular dependency loop, and there is none in this example.

Another point worth thinking about is that if --section is to be trusted
at all, we need some way of guaranteeing that the dependency-sorting
code won't happily create other similar cases.  There is nothing in
there right now that would think anything is wrong with an ordering
that breaks the section division.

I believe the right fix for both of these issues is to add knowledge of
the section concept to the topological sort logic, so that an ordering
that puts POST_DATA before DATA or PRE_DATA after DATA is considered to
be a dependency-ordering violation.  One way to do that is to add dummy
"fencepost" objects to the sort, representing the start and end of the
DATA section.  However, these objects would need explicit dependency
links to every other DumpableObject, so that doesn't sound very good
from a performance standpoint.  What I'm going to go look at is whether
we can mark DumpableObjects with their SECTION codes at creation time
(rather than adding that information at ArchiveEntry() time) and then
have the topo sort logic take that marking into account in addition to
the explicit dependency links.

Assuming we can make that work, how far should it be back-patched?
The --section issue is new in 9.2, but this would also take care of
the restore problems for views with constraint dependencies, which
are allowed as of 9.1, so I'm thinking we'd better put it into 9.1.

The other problem is the bogus dependency IDs that pg_dump emits by
virtue of not caring whether the dependency links to an object it's
actually dumped.  I posted a patch for that in the pgsql-bugs thread
but pointed out that it would not work before 9.2 without additional
surgery.  If we fix the view vs constraint issue by adding section
knowledge to the sort, I think we can maybe get away with not fixing
the dependency IDs in back branches.  parallel pg_restore is definitely
at risk without the fix, but in the absence of any other known problem
cases I'm inclined to not change the back branches more than we have to.

Thoughts?

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_dump and dependencies and --section ... it's a mess

Reply via email to