* Claudio Freire (klaussfre...@gmail.com) wrote: > But, still, the implementation is very similar to what postgres needs: > sharing a physical page for two distinct logical pages, efficiently, > with efficient copy-on-write.
Agreed, except that KSM seems like it'd be slow/lazy about it and I'm guessing there's a reason the pagecache isn't included normally.. > So it'd be just a matter of removing that limitation regarding page > cache and shared pages. Any idea why that limitation is there? > If you asked me, I'd implement it as copy-on-write on the page cache > (not the user page). That ought to be low-overhead. Not entirely sure I'm following this- if it's a shared page, it doesn't matter who starts writing to it, as soon as that happens, it need to get copied. Perhaps you mean that the application should keep the "original" and that the page-cache should get the "copy" (or, really, perhaps just forget about the page existing at that point- we won't want it again...). Would that be a way to go, perhaps? This does go back to the "make it act like mmap, but not *be* mmap", but the idea would be: open(..., O_ZEROCOPY_READ) read() - Goes to PG's shared buffers, pagecache and PG share the page page fault (PG writes to it) - pagecache forgets about the page write() / fsync() - operate as normal The differences here from O_DIRECT are that the pagecache will keep the page while clean (absolutely valuable from PG's perspective- we might have to evict the page from shared buffers sooner than the kernel does), and the write()'s happen at the kernel's pace, allowing for write-combining, etc, until an fsync() happens, of course. This isn't the "big win" of dealing with I/O issues during checkpoints that we'd like to see, but it certainly feels like it'd be an improvement over the current double-buffering situation at least. Thanks, Stephen
signature.asc
Description: Digital signature