Hello, I studied your lastest patch. At Fri, 27 Nov 2015 16:59:20 +0900, Michael Paquier <michael.paqu...@gmail.com> wrote in <CAB7nPqRoaCMhr4hjEgq4rCZ4GaCB-6=ch8b2u7k7t5-kbgc...@mail.gmail.com> > On Fri, Nov 27, 2015 at 3:42 PM, Michael Paquier wrote: > > I am still investigating for a correct fix, looking at reinit.c the > > code in charge of copying the init fork as the main fork for a > > relation at the end of recovery looks to be doing its job correctly... > > Attached is a patch that fixes the issue for me in master and 9.5. > Actually in the last patch I forgot a call to smgrwrite to ensure that > the INIT_FORKNUM is correctly synced to disk when those pages are > replayed at recovery, letting the reset routines for unlogged > relations do their job correctly. I have noticed as well that we need > to do the same for gin and brin relations. In this case I think that > we could limit the flush to unlogged relations, my patch does it > unconditionally though to generalize the logic. Thoughts?
I feel quite uncomfortable that it solves the problem from a kind of nature of unlogged object by arbitrary flagging which is not fully corresponds to the nature. If we can deduce the necessity of fsync from some nature, it would be preferable. In the current patch, is_sync for log_newpage is generally true for and only for INIT_FORKNUM pages. Exceptions as far as I can see are, copy_relation_data: called with arbitrary forknum but it doesn't set is_fsync even for coying INIT_FORKNUM. (Is this not a problem?) spgbuildempty, ginbuildempty: these emits two or three newpage logs at once so only the last one is set is_fsync for performance reason. And other anormallies are, ginbuildempty, gistbuildempty: These funciton doesn't seem to immediately fsync but is_fsync is set to INIT_FORKNUM. Of course it wouldn't be a problem. In short, it seems to me that the reason to choose using XLOG_FPI_FOR_SYNC here is only performance of processing successive FPIs for INIT_FORKNUM. INIT_FORKNUM is generated only for unlogged tables and their belongings. I suppose such successive fsyncs doesn't cause observable performance drop assuming that the number of unlogged tables and belongings is not so high, especially with smarter storages. All we should do is that just fsync only for INIT_FORKNUM's FPIs for the case. If the performance does matter even so, we still can fsync the last md-file when any wal record other than FPI for INIT_FORK comes. (But this would be a bit complex..) By the way, I suppose that fsyncing only the last page in successive new pages still theoretically can cause this problem when the last pages is not in the same file with other pages. That cannot occur for INIT_FORKNUM files though in reality:) Thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers