On Mon, May 11, 2020 at 8:43 PM Paul Guo <p...@pivotal.io> wrote: > I have this concern since I saw an issue in a real product environment that > the startup process needs 10+ seconds to start wal replay after relaunch due > to elog(PANIC) (it was seen on postgres based product Greenplum but it is a > common issue in postgres also). I highly suspect the delay was mostly due to > this. Also it is noticed that on public clouds fsync is much slower than that > on local storage so the slowness should be more severe on cloud. If we at > least disable fsync on the table directories we could skip a lot of file > fsync - this may save a lot of seconds during crash recovery.
I've seen this problem be way worse than that. Running fsync() on all the files and performing the unlogged table cleanup steps can together take minutes or, I think, even tens of minutes. What I think sucks most in this area is that we don't even emit any log messages if the process takes a long time, so the user has no idea why things are apparently hanging. I think we really ought to try to figure out some way to give the user a periodic progress indication when this kind of thing is underway, so that they at least have some idea what's happening. As Tom says, I don't think there's any realistic way that we can disable it altogether, but maybe there's some way we could make it quicker, like some kind of parallelism, or by overlapping it with other things. It seems to me that we have to complete the fsync pass before we can safely checkpoint, but I don't know that it needs to be done any sooner than that... not sure though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company