Re: Two fsync related performance issues?

Robert Haas Tue, 19 May 2020 05:51:11 -0700

On Mon, May 11, 2020 at 8:43 PM Paul Guo <[email protected]> wrote:
> I have this concern since I saw an issue in a real product environment that 
> the startup process needs 10+ seconds to start wal replay after relaunch due 
> to elog(PANIC) (it was seen on postgres based product Greenplum but it is a 
> common issue in postgres also). I highly suspect the delay was mostly due to 
> this. Also it is noticed that on public clouds fsync is much slower than that 
> on local storage so the slowness should be more severe on cloud. If we at 
> least disable fsync on the table directories we could skip a lot of file 
> fsync - this may save a lot of seconds during crash recovery.


I've seen this problem be way worse than that. Running fsync() on all
the files and performing the unlogged table cleanup steps can together
take minutes or, I think, even tens of minutes. What I think sucks
most in this area is that we don't even emit any log messages if the
process takes a long time, so the user has no idea why things are
apparently hanging. I think we really ought to try to figure out some
way to give the user a periodic progress indication when this kind of
thing is underway, so that they at least have some idea what's
happening.

As Tom says, I don't think there's any realistic way that we can
disable it altogether, but maybe there's some way we could make it
quicker, like some kind of parallelism, or by overlapping it with
other things. It seems to me that we have to complete the fsync pass
before we can safely checkpoint, but I don't know that it needs to be
done any sooner than that... not sure though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Two fsync related performance issues?

Reply via email to