On Mon, Apr 20, 2015 at 07:13:38PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: > > > Bruce Momjian wrote: > > > > > > This seems simple to implement: keep two counters, where the second one > > > is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, > > > reset the first counter so that further 5 pages will get HOT pruned. 5% > > > seems a bit high though. (In Simon's design, SOME_MAX_VALUE is > > > essentially +infinity.) > > > > This would tend to dirty non-sequential heap pages --- it seems best to > > just clean as many as we are supposed to, then skip the rest, so we can > > write sequential dirty pages to storage. > > Keep in mind there's a disconnect between dirtying a page and writing it > to storage. A page could remain dirty for a long time in the buffer > cache. This writing of sequential pages would occur at checkpoint time > only, which seems the wrong thing to optimize. If some other process > needs to evict pages to make room to read some other page in, surely > it's going to try one page at a time, not write "many sequential dirty > pages."
Yes, it might be too much optimization to try to get the checkpoint to flush all those pages sequentially, but I was thinking of our current behavior where, after an update of all rows, we effectively write out the entire table because we have dirtied every page. I guess with later prune-based writes, we aren't really writing all the pages as we have the pattern where pages with prunable content is kind of random. I guess I was just wondering what value there is to your write-then-skip idea, vs just writing the first X% of pages we find? Your idea certainly spreads out the pruning, and doesn't require knowing the size of the table, though I though that information was easily determined. One thing to consider is how we handle pruning of index scans that hit multiple heap pages. Do we still write X% of the pages in the table, or %X of the heap pages we actually access via SELECT? With the write-then-skip approach, we would do X% of the pages we access, while with the first-X% approach, we would probably prune all of them as we would not be accessing most of the table. I don't think we can do the first first-X% of pages and have the percentage based on the number of pages accessed as we have no way to know how many heap pages we will access from the index. (We would know for bitmap scans, but that complexity doesn't seem worth it.) That would argue, for consistency with sequential and index-based heap access, that your approach is best. -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers