Re: [HACKERS] Relation extension scalability

Andres Freund Sun, 29 Mar 2015 12:54:05 -0700

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
> > There's two things that seem to make sense to me:
>
> > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> > and introduce a bufmgr function specifically for extension.
>
> I think that removing P_NEW is likely to require a fair amount of
> refactoring of calling code, so I'm not thrilled with doing that.
> On the other hand, perhaps all that code would have to be touched
> anyway to modify the scope over which the extension lock is held.


It's not *that* many locations that need to extend relations. In my
playing around it seemed to me they all would need to be modified
anyway; if we want to remove/reduce the scope of extension locks to deal
with the fact that somebody else could have started to use the buffer.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend().
>
> I'm afraid this would break stuff rather thoroughly, in particular
> handling of out-of-disk-space cases.

Hm. Not a bad point.

> And I really don't see how you get
> consistent behavior at all for multiple concurrent callers if there's no
> locking.

What I was thinking is something like this:

while (true)
{
    targetBuffer = AcquireFromFSMEquivalent();

    if (targetBuffer == InvalidBuffer)
         targetBuffer = ExtendRelation();

    LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE);

    if (BufferHasEnoughSpace(targetBuffer))
        break;

    LockBuffer(BUFFER_LOCK_UNLOCK);
}

where ExtendRelation() would basically work like
while (true)
{
    targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192;
    buffer = ReadBuffer(rel, targetBlock);
    LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
    page = BufferGetPage(buffer);
    if (PageIsNew(page))
    {
        PageInit(page);
        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
        FlushBuffer(buffer);
        break;
    }
    else
    {
        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
        continue;
    }
}

Obviously it's a tad more complex than that pseudocode, but I think that
basically should work. Except that it, as you can say, lead to some
oddities with out of space handling. I think it should actually be ok,
it might just be confusing to the user.

I think we might be able to address those issues by not using
lseek(SEEK_SET) but instead
fcntl(fd, F_SETFL, O_APPEND, 1);
write(fd, pre-init-block, BLCKSZ);
fcntl(fd, F_SETFL, O_APPEND, 0);
newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ;

by using O_APPEND and a pre-initialized block we can be sure to write a
block at the end, that's valid, and shouldn't run afould of any out of
space issues that we don't already have.

Unfortunately I'm not sure whether fcntl for O_APPEND is portable :(

> One idea that might help is to change smgrextend's API so that it doesn't
> need a buffer to write from, but just has an API of "add a prezeroed block
> on-disk and tell me the number of the block you added".  On the other
> hand, that would then require reading in the block after allocating a
> buffer to hold it (I don't think you can safely assume otherwise) so the
> added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

Reply via email to