On Sat, 2002-10-05 at 20:32, Tom Lane wrote: > Hannu Krosing <[EMAIL PROTECTED]> writes: > > The writer process should just issue a continuous stream of > > aio_write()'s while there are any waiters and keep track which waiters > > are safe to continue - thus no guessing of who's gonna commit. > > This recipe sounds like "eat I/O bandwidth whether we need it or not". > It might be optimal in the case where activity is so heavy that we > do actually need a WAL write on every disk revolution, but in any > scenario where we're not maxing out the WAL disk's bandwidth, it will > hurt performance. In particular, it would seriously degrade performance > if the WAL file isn't on its own spindle but has to share bandwidth with > data file access. > > What we really want, of course, is "write on every revolution where > there's something worth writing" --- either we've filled a WAL blovk > or there is a commit pending.
That's what I meant by "while there are any waiters". > But that just gets us back into the > same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. > I don't see how an extra process makes that problem any easier. I still think that we could get gang writes automatically, if we just ask for aio_write at completion of each WAL file page and keep track of those that are written. We could also keep track of write position inside the WAL page for 1. end of last write() of each process 2. WAL files write position at each aio_write() Then we can safely(?) assume, that each backend wants only its own write()'s be on disk before it can assume the trx has committed. If the fsync()-like request comes in at time when aio_write for that processes last position has committed, we can let that process continue without even a context switch. In the above scenario I assume that kernel can do the right thing by doing multiple aio_write requests for the same page in one sweep and not doing one physical write for each aio_write. > BTW, it would seem to me that aio_write() buys nothing over plain write() > in terms of ability to gang writes. If we issue the write at time T > and it completes at T+X, we really know nothing about exactly when in > that interval the data was read out of our WAL buffers. Yes, most likely. If we do several write's of the same pages they will hit physical disk at the same physical write. > We cannot > assume that commit records that were stored into the WAL buffer during > that interval got written to disk. The only safe assumption is that > only records that were in the buffer at time T are down to disk; and > that means that late arrivals lose. I assume that if each commit record issues an aio_write when all of those which actually reached the disk will be notified. IOW the first aio_write orders the write, but all the latecomers which arrive before actual write will also get written and notified. > You can't issue aio_write > immediately after the previous one completes and expect that this > optimizes performance --- you have to delay it as long as you possibly > can in hopes that more commit records arrive. I guess we have quite different cases for different hardware configurations - if we have a separate disk subsystem for WAL, we may want to keep the log flowing to disk as fast as it is ready, including the writing of last, partial page as often as new writes to it are done - as we possibly can't write more than ~ 250 times/sec (with 15K drives, no battery RAM) we will always have at least two context switches between writes (for 500Hz ontext switch clock), and much more if processes background themselves while waiting for small transactions to commit. > So it comes down to being the same problem. Or its solution ;) as instead of the predicting we just write all data in log that is ready to be written. If we postpone writing, there will be hickups when we suddenly discover that we need to write a whole lot of pages (fsync()) after idling the disk for some period. --------------- Hannu ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org