Re: [HACKERS] Relation extension scalability

Robert Haas Sun, 29 Mar 2015 17:52:52 -0700

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <and...@2ndquadrant.com>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.


Interesting.  I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend(). If we instead only do the write once we've read
> in & locked the page exclusively there's no need for the extension
> lock. We probably still should write out the new page to the OS
> immediately once we've initialized it; to avoid creating sparse files.
>
> The other reason we need the extension lock is that code like
> lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> pages that are about to be initilized by the extending backend. I think
> we should just remove that code and deal with the problem by retrying in
> the extending backend; that's why I think moving extension to a
> different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block.  Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more.  If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think.  Maybe I'm misunderstanding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

Reply via email to