Thanks for a lot of inspiring discussions. Please note that my proposal includes only a few lines of change to the recovery code itself. It does not affect buffer management, order of WAL record applying etc. Only change needed is to invoke prefetch feature if redo is going to read WAL which has not been handled by the prefetch (prefetch function returns last-handled LSN).
Before writing the readahead code, I ran several experiment how posix_fadvise() speeds up random read and I found that POSIX_FADV_WILLNEED can improve total read performance for around five times, if we schedule the order of posix_fadvise() call to the order of block position. Without random position, the improvement ratio was around three times. This result was achieved with single process, but for RAID configuration. I'd like to do the similar measurement against single disk. I'd like to run some benchmark to clarify the improvement. I agree I should show how my proposal is useful. In terms of the influence to the recovery code, pg_readahead just calls posix_fadvise() to tell the operating system to prefetch the data page to kernel's cash, not PG's shared memory, so we don't have to implement this in PG core code. Because of this and I think it is more practical to have platform-specific code to outside as possible, I wrote most of the prefetch in the external process, which can be available at contrib or PgFoundry, perhaps the latter. Heikki suggested to have separate reader process. I think it's very good idea but with this idea, but this will change PG's performance dramatically. Better in some case, but even worse in other cases possibly. I don't have clear on this. So I think background reader issue should be a challange to 8.5 or further and we must call for research works. So far, I think it is reasonable to keep improving specific code. I'd like to hear some more about these. I'm more than happy to write all the code inside PG core to avoid overhead to create another process. --- Koichi Suzuki 2008/10/29 Gregory Stark <[EMAIL PROTECTED]>: > Simon Riggs <[EMAIL PROTECTED]> writes: > >> On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: >>> Gregory Stark wrote: >>> > Simon Riggs <[EMAIL PROTECTED]> writes: >>> > >>> > > I'm happy with the idea of a readahead process. I thought we were >>> > > implementing a BackgroundReader process for other uses. Is that dead >>> > > now? >>> > >>> > You and Bruce seem to keep resurrecting that idea. I've never liked it -- >>> > I >>> > always hated that in Oracle and thought it was a terrible kludge. >>> >>> I didn't think I was promoting the separate reader process after you had >>> the posix_fadvise() idea. > > I'm sorry, I thought I remembered you mentioning it again. But perhaps I was > thinking of someone else (perhaps it was Simon again?) or perhaps it was > before you saw the actual patch. > >> It would be good if the solutions for normal running and recovery were >> similar. Greg, please could you look into that? > > I could do the readahead side of things but what I'm not sure how to arrange > is how to restructure the wal reading logic to read records ahead of the > actual replay. > > I think we would have to maintain two pointers one for the prefetch and one > for the actual running. But the logic in for recovery is complex enough that > I'm concerned about changing it enough to do that and whether it can be done > without uglifying the code quite a bit. > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Ask me about EnterpriseDB's RemoteDBA services! > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers