On 2015-03-29 15:21:44 -0400, Tom Lane wrote: > > There's two things that seem to make sense to me: > > > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW > > and introduce a bufmgr function specifically for extension. > > I think that removing P_NEW is likely to require a fair amount of > refactoring of calling code, so I'm not thrilled with doing that. > On the other hand, perhaps all that code would have to be touched > anyway to modify the scope over which the extension lock is held.
It's not *that* many locations that need to extend relations. In my playing around it seemed to me they all would need to be modified anyway; if we want to remove/reduce the scope of extension locks to deal with the fact that somebody else could have started to use the buffer. > > Secondly I think we could maybe remove the requirement of needing an > > extension lock alltogether. It's primarily required because we're > > worried that somebody else can come along, read the page, and initialize > > it before us. ISTM that could be resolved by *not* writing any data via > > smgrextend()/mdextend(). > > I'm afraid this would break stuff rather thoroughly, in particular > handling of out-of-disk-space cases. Hm. Not a bad point. > And I really don't see how you get > consistent behavior at all for multiple concurrent callers if there's no > locking. What I was thinking is something like this: while (true) { targetBuffer = AcquireFromFSMEquivalent(); if (targetBuffer == InvalidBuffer) targetBuffer = ExtendRelation(); LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE); if (BufferHasEnoughSpace(targetBuffer)) break; LockBuffer(BUFFER_LOCK_UNLOCK); } where ExtendRelation() would basically work like while (true) { targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192; buffer = ReadBuffer(rel, targetBlock); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(buffer); if (PageIsNew(page)) { PageInit(page); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); FlushBuffer(buffer); break; } else { LockBuffer(buffer, BUFFER_LOCK_UNLOCK); continue; } } Obviously it's a tad more complex than that pseudocode, but I think that basically should work. Except that it, as you can say, lead to some oddities with out of space handling. I think it should actually be ok, it might just be confusing to the user. I think we might be able to address those issues by not using lseek(SEEK_SET) but instead fcntl(fd, F_SETFL, O_APPEND, 1); write(fd, pre-init-block, BLCKSZ); fcntl(fd, F_SETFL, O_APPEND, 0); newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ; by using O_APPEND and a pre-initialized block we can be sure to write a block at the end, that's valid, and shouldn't run afould of any out of space issues that we don't already have. Unfortunately I'm not sure whether fcntl for O_APPEND is portable :( > One idea that might help is to change smgrextend's API so that it doesn't > need a buffer to write from, but just has an API of "add a prezeroed block > on-disk and tell me the number of the block you added". On the other > hand, that would then require reading in the block after allocating a > buffer to hold it (I don't think you can safely assume otherwise) so the > added read step might eat any savings. Yea, I was thinking that as well. We simply could skip the reading step by setting up the contents in the buffer manager without a read in this case... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers