On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <and...@2ndquadrant.com> > As a quick recap, relation extension basically works like: > 1) We lock the relation for extension > 2) ReadBuffer*(P_NEW) is being called, to extend the relation > 3) smgrnblocks() is used to find the new target block > 4) We search for a victim buffer (via BufferAlloc()) to put the new > block into > 5) If dirty the victim buffer is cleaned > 6) The relation is extended using smgrextend() > 7) The page is initialized > > The problems come from 4) and 5) potentially each taking a fair > while. If the working set mostly fits into shared_buffers 4) can > requiring iterating over all shared buffers several times to find a > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's > dirty limits 5) can take a couple seconds.
Interesting. I had always assumed the bottleneck was waiting for the filesystem to extend the relation. > Secondly I think we could maybe remove the requirement of needing an > extension lock alltogether. It's primarily required because we're > worried that somebody else can come along, read the page, and initialize > it before us. ISTM that could be resolved by *not* writing any data via > smgrextend()/mdextend(). If we instead only do the write once we've read > in & locked the page exclusively there's no need for the extension > lock. We probably still should write out the new page to the OS > immediately once we've initialized it; to avoid creating sparse files. > > The other reason we need the extension lock is that code like > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing > pages that are about to be initilized by the extending backend. I think > we should just remove that code and deal with the problem by retrying in > the extending backend; that's why I think moving extension to a > different file might be helpful. I thought the primary reason we did this is because we wanted to write-and-fsync the block so that, if we're out of disk space, any attendant failure will happen before we put data into the block. Once we've initialized the block, a subsequent failure to write or fsync it will be hard to recover from; basically, we won't be able to checkpoint any more. If we discover the problem while the block is still all-zeroes, the transaction that uncovers the problem errors out, but the system as a whole is still OK. Or at least, I think. Maybe I'm misunderstanding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers