On Tue, Jan 12, 2016 at 5:52 PM, Andres Freund <and...@anarazel.de> wrote: > > On 2016-01-12 17:50:36 +0530, Amit Kapila wrote: > > On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <and...@anarazel.de> wrote:> > > > > > > My theory is that this happens due to the sorting: pgbench is an update > > > heavy workload, the first few pages are always going to be used if > > > there's free space as freespacemap.c essentially prefers those. Due to > > > the sorting all a relation's early pages are going to be in "in a row". > > > > > > > Not sure, what is best way to tackle this problem, but I think one way could > > be to perform sorting at flush requests level rather than before writing > > to OS buffers. > > I'm not following. If you just sort a couple hundred more or less random > buffers - which is what you get if you look in buf_id order through > shared_buffers - the likelihood of actually finding neighbouring writes > is pretty low. >
Why can't we do it at larger intervals (relative to total amount of writes)? To explain, what I have in mind, let us assume that checkpoint interval is longer (10 mins) and in the mean time all the writes are being done by bgwriter which it registers in shared memory so that later checkpoint can perform corresponding fsync's, now when the request queue becomes threshhold size (let us say 1/3rd) full, then we can perform sorting and merging and issue flush hints. Checkpointer task can also follow somewhat similar technique which means that once it has written 1/3rd or so of buffers (which we need to track), it can perform flush hints after sort+merge. Now, I think we can also do it in checkpointer alone rather than in bgwriter and checkpointer. Basically, I think this can lead to lesser merging of neighbouring writes, but might not hurt if sync_file_range() API is cheap. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com