Hi,

I've talked a few times about a bgwriter replacement prototype I'd
written a few years back. That happened somewhere deep in another thread
[1], and thus not easy to fix.

Tomas Vondra asked me for a link, but there was some considerable bitrot
since. Attached is a rebased and slightly improved version. It's also
available at [2][3].

The basic observation is that there's some fairly fundamental issues
with the current bgwriter implementation:

1) The pacing logic is complicated, but doesn't work well
2) If most/all buffers have a usagecount, it cannot do anything, because
   it doesn't participate in the clock-sweep
3) Backends have to re-discover the now clean buffers.


The prototype is much simpler - in my opinion of course. It has a
ringbuffer of buffers it thinks are clean (which might be reused
concurrently though). It fills that ringbuffer by performing
clock-sweep, and if necessary cleaning, usagecount=pincount=0
buffers. Backends can then pop buffers from that ringbuffer.

Pacing works by bgwriter trying to keep the ringbuffer full, and
backends emptying the ringbuffer. If the ringbuffer is less than 1/4
full, backends wake up bgwriter using the existing latch mechanism.

The ringbuffer is a pretty simplistic lockless (but just obstruction
free, not lock free) implementation, with a lot of unneccessary
constraints.

I've had to improve the current instrumentation for pgwriter
(i.e. pg_stat_bgwriter) considerably - the details in there imo are not
even remotely good enough to actually understand the system (nor are the
names understandable). That needs to be split into a separate commit,
and the half dozen different implementations of the counters need to be
unified.

Obviously this is very prototype-stage code. But I think it's a good
starting point for going forward.

To enable it, one currently has to set the bgwriter_legacy = false GUC.

Some early benchmarks show that in IO heavy cases there's somewhere
between a very mild regression (close to noise), to a pretty
considerable improvement. To see a benefit one - fairly obviously -
needs a workload that is bigger than shared buffers, because otherwise
checkpointer is going to do all writes (and should, it can sort them
perfectly!).

It's quite possible to saturate what a single bgwriter can write out (as
it is before the replacement). I'm inclined to think the next solution
for that is asynchronous IO, and write-combining, rather than multiple
bgwriters.

Here's an example pg_stat_bgwriter from the middle of a pgbench run
(after resetting it a short while before):

┌─[ RECORD 1 ]───────────────┬───────────────────────────────┐
│ checkpoints_timed          │ 1                             │
│ checkpoints_req            │ 0                             │
│ checkpoint_write_time      │ 179491                        │
│ checkpoint_sync_time       │ 266                           │
│ buffers_written_checkpoint │ 172414                        │
│ buffers_written_bgwriter   │ 475802                        │
│ buffers_written_backend    │ 7140                          │
│ buffers_written_ring       │ 0                             │
│ buffers_fsync_checkpointer │ 137                           │
│ buffers_fsync_bgwriter     │ 0                             │
│ buffers_fsync_backend      │ 0                             │
│ buffers_bgwriter_clean     │ 832616                        │
│ buffers_alloc_preclean     │ 1306572                       │
│ buffers_alloc_free         │ 0                             │
│ buffers_alloc_sweep        │ 4639                          │
│ buffers_alloc_ring         │ 767                           │
│ buffers_ticks_bgwriter     │ 4398290                       │
│ buffers_ticks_backend      │ 17098                         │
│ maxwritten_clean           │ 17                            │
│ stats_reset                │ 2019-06-10 20:17:56.087704-07 │
└────────────────────────────┴───────────────────────────────┘


Note that buffers_written_backend (as buffers_backend before) accounts
for file extensions too - which bgwriter can't offload. We should
replace that by a non-write (i.e. fallocate) anyway.

Greetings,

Andres Freund

[1] https://postgr.es/m/20160204155458.jrw3crmyscusdqf6%40alap3.anarazel.de
[2] 
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/bgwriter-rewrite
[3] https://github.com/anarazel/postgres/tree/bgwriter-rewrite
>From 53094143e3c1fc9a8090cce66e73e26d58c67b93 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 19 Feb 2016 12:07:51 -0800
Subject: [PATCH v7 1/2] Basic obstruction-free single producer, multiple
 consumer ringbuffer.

This is pretty darn limited, supporting only small queues - but could
easily be improved.
---
 src/backend/lib/Makefile  |   3 +-
 src/backend/lib/ringbuf.c | 161 ++++++++++++++++++++++++++++++++++++++
 src/include/lib/ringbuf.h |  72 +++++++++++++++++
 3 files changed, 235 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/ringbuf.c
 create mode 100644 src/include/lib/ringbuf.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 3c1ee1df83a..b0a63fba309 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+       ilist.o integerset.o knapsack.o pairingheap.o rbtree.o ringbuf.o \
+       stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/ringbuf.c b/src/backend/lib/ringbuf.c
new file mode 100644
index 00000000000..3de2a4977d8
--- /dev/null
+++ b/src/backend/lib/ringbuf.c
@@ -0,0 +1,161 @@
+/*-------------------------------------------------------------------------
+ *
+ * ringbuf.c
+
+ *	  Single producer, multiple consumer ringbuffer where consumption is
+ *	  obstruction-free (i.e. no progress guarantee, but a consumer that is
+ *	  stopped will not block progress).
+ *
+ * Implemented by essentially using an optimistic lock on the read side.
+ *
+ * XXX: It'd be nice if we could modify this so there's variants for push/pop
+ * that work for different concurrency scenarios. E.g. having spsc_push(),
+ * spmc_push(), ... - that'd avoid having to use different interfaces for
+ * different needs.
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/ringbuf.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ringbuf.h"
+#include "storage/proc.h"
+
+static inline uint32
+ringbuf_backendid(ringbuf *rb, uint32 pos)
+{
+	return pos & 0xffff0000;
+}
+
+uint32
+ringbuf_elements(ringbuf *rb)
+{
+	uint32 read_off = ringbuf_pos(rb, pg_atomic_read_u32(&rb->read_state));
+	uint32 write_off = ringbuf_pos(rb, rb->write_off);
+
+	/* not wrapped around */
+	if (read_off <= write_off)
+	{
+		return write_off - read_off;
+	}
+
+	/* wrapped around */
+	return (rb->size - read_off) + write_off;
+}
+
+size_t
+ringbuf_size(size_t nelems)
+{
+	Assert(nelems <= 0x0000FFFF);
+	return sizeof(ringbuf) + sizeof(void *) * nelems;
+}
+
+/*
+ * Memory needs to be externally allocated and be at least
+ * ringbuf_size(nelems) large.
+ */
+ringbuf *
+ringbuf_create(void *target, size_t nelems)
+{
+	ringbuf *rb = (ringbuf *) target;
+
+	Assert(nelems <= 0x0000FFFF);
+
+	memset(target, 0, ringbuf_size(nelems));
+
+	rb->size = nelems;
+	pg_atomic_init_u32(&rb->read_state, 0);
+	rb->write_off = 0;
+
+	return rb;
+}
+
+bool
+ringbuf_push(ringbuf *rb, void *data)
+{
+	uint32 read_off = pg_atomic_read_u32(&rb->read_state);
+
+	/*
+	 * Check if full - can be outdated, but that's ok. New readers are just
+	 * going to further consume elements, never cause the buffer to become
+	 * full.
+	 */
+	if (ringbuf_pos(rb, read_off)
+		== ringbuf_pos(rb, ringbuf_advance_pos(rb, rb->write_off)))
+	{
+		return false;
+	}
+
+	rb->elements[ringbuf_pos(rb, rb->write_off)] = data;
+
+	/*
+	 * The write adding the data needs to be visible before the corresponding
+	 * increase of write_off is visible.
+	 */
+	pg_write_barrier();
+
+	rb->write_off = ringbuf_advance_pos(rb, rb->write_off);
+
+	return true;
+}
+
+
+bool
+ringbuf_pop(ringbuf *rb, void **data)
+{
+	void *ret;
+	uint32 mybackend = MyProc->backendId;
+
+	Assert((mybackend & 0x0000ffff) == mybackend);
+
+	while (true)
+	{
+		uint32 read_state = pg_atomic_read_u32(&rb->read_state);
+		uint32 read_off = ringbuf_pos(rb, read_state);
+		uint32 old_read_state = read_state;
+
+		/* check if empty - can be outdated, but that's ok */
+		if (read_off == ringbuf_pos(rb, rb->write_off))
+			return false;
+
+		/*
+		 * Add our backend id to the position, to detect wrap around.
+		 * XXX
+		 *
+		 * XXX: Skip if the ID already is ours. That's probably likely enough
+		 * to warrant the additional branch.
+		 */
+		read_state = (read_state & 0x0000ffff) | mybackend << 16;
+
+		/*
+		 * Mix the reader position into the current read_off, otherwise
+		 * unchanged. If the offset changed since, retry from start.
+		 *
+		 * NB: This also serves as the read barrier pairing with the write
+		 * barrier in ringbuf_push().
+		 */
+		if (!pg_atomic_compare_exchange_u32(&rb->read_state, &old_read_state,
+											read_state))
+			continue;
+		old_read_state = read_state; /* with backend id mixed in */
+
+		/* finally read the data */
+		ret = rb->elements[read_off];
+
+		/* compute next offset */
+		read_state = ringbuf_advance_pos(rb, read_state);
+
+		if (pg_atomic_compare_exchange_u32(&rb->read_state, &old_read_state,
+										   read_state))
+			break;
+	}
+
+	*data = ret;
+
+	return true;
+}
diff --git a/src/include/lib/ringbuf.h b/src/include/lib/ringbuf.h
new file mode 100644
index 00000000000..3be450bb8f8
--- /dev/null
+++ b/src/include/lib/ringbuf.h
@@ -0,0 +1,72 @@
+/*
+ * ringbuf.h
+ *
+ * Single writer.multiple reader lockless & obstruction free ringbuffer.
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/include/lib/ringbuf.h
+ */
+#ifndef RINGBUF_H
+#define RINGBUF_H
+
+#include "port/atomics.h"
+
+typedef struct ringbuf
+{
+	uint32 size;
+
+	/* 16 bit reader id, 16 bit offset */
+	/* XXX: probably should be on separate cachelines */
+	pg_atomic_uint32 read_state;
+	uint32_t write_off;
+
+	void *elements[FLEXIBLE_ARRAY_MEMBER];
+} ringbuf;
+
+size_t ringbuf_size(size_t nelems);
+
+ringbuf *ringbuf_create(void *target, size_t nelems);
+
+static inline uint32
+ringbuf_pos(ringbuf *rb, uint32 pos)
+{
+	/*
+	 * XXX: replacing rb->size with a bitmask op would avoid expensive
+	 * divisions. Requiring a pow2 size seems ok.
+	 */
+	return (pos & 0x0000ffff) % rb->size;
+}
+
+/*
+ * Compute the new offset, slightly complicated by the fact that we only want
+ * to modify the lower 16 bits.
+ */
+static inline uint32
+ringbuf_advance_pos(ringbuf *rb, uint32 pos)
+{
+	return ((ringbuf_pos(rb, pos) + 1) & 0x0000FFFF) | (pos & 0xFFFF0000);
+}
+
+static inline bool
+ringbuf_empty(ringbuf *rb)
+{
+	uint32 read_state = pg_atomic_read_u32(&rb->read_state);
+
+	return ringbuf_pos(rb, read_state) == ringbuf_pos(rb, rb->write_off);
+}
+
+static inline bool
+ringbuf_full(ringbuf *rb)
+{
+	uint32 read_state = pg_atomic_read_u32(&rb->read_state);
+
+	return ringbuf_pos(rb, read_state) ==
+		ringbuf_pos(rb, ringbuf_advance_pos(rb, rb->write_off));
+}
+
+uint32 ringbuf_elements(ringbuf *rb);
+bool ringbuf_push(ringbuf *rb, void *data);
+bool ringbuf_pop(ringbuf *rb, void **data);
+
+#endif
-- 
2.22.0.dirty

>From 7c5df799aa12dc43f4d8bcef78120225cda990e0 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 19 Feb 2016 12:07:51 -0800
Subject: [PATCH v7 2/2] Rewrite background writer.

This currently consists out of two major parts:

1) Add more statistics, to be able to even evaluate the effects of
   bgwriter changes / problems. This should probably be split into a
   separate commit.

   It's remarkable how odd the set of current measurements is, and how
   many different mechanisms for transporting those values we
   currently have. The patch adds and replaces a few measurements, but
   doesn't yet do enough cleanup (have fewer transport mechanisms,
   split into different views).

2) A new bgwriter implementation (that can be turned on by setting the
   bgwriter_legacy GUC to false). There's a few major differences:

   a) bgwriter performs the clock sweep - that makes it a lot easier
      to actually find buffers worthwhile to clean. It's quite
      possible to get into situations where the old bgwriter can't do
      anything for a while because all buffers have a usagecount > 0.
   b) When a buffer is encountered by bgwriter, after performing clock
      sweep, is clean and has usage/pin count of 0 (i.e. it can be
      reclaimed), then we also push it onto the queue.
   c) It just has a ringbuffer of clean buffers, that backends can
      drain. Bgwriter pushes (without any locks) entries onto the
      queue, backends can pop them of.
   d) The pacing logic is a lot simpler. There's a ringbuffer that
      bgwriter tries to fill. There's a low watermark that causes
      backends to wake up bgwriter.
---
 src/backend/access/transam/xlog.c     |   2 +
 src/backend/catalog/system_views.sql  |  25 ++-
 src/backend/postmaster/bgwriter.c     |   9 +-
 src/backend/postmaster/checkpointer.c |  38 ++--
 src/backend/postmaster/pgstat.c       |  20 ++-
 src/backend/storage/buffer/buf_init.c |  22 ++-
 src/backend/storage/buffer/bufmgr.c   | 198 +++++++++++++++++++--
 src/backend/storage/buffer/freelist.c | 240 +++++++++++++++++++-------
 src/backend/utils/adt/pgstatfuncs.c   |  69 +++++++-
 src/backend/utils/misc/guc.c          |  13 +-
 src/include/catalog/pg_proc.dat       |  74 ++++++--
 src/include/pgstat.h                  |  51 +++++-
 src/include/postmaster/bgwriter.h     |   3 +
 src/include/storage/buf_internals.h   |  30 +++-
 src/include/storage/bufmgr.h          |   4 +-
 src/test/regress/expected/rules.out   |  19 +-
 16 files changed, 670 insertions(+), 147 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1c7dd51b9f1..78c1d786fa4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8376,6 +8376,8 @@ LogCheckpointEnd(bool restartpoint)
 	BgWriterStats.m_checkpoint_sync_time +=
 		sync_secs * 1000 + sync_usecs / 1000;
 
+	BgWriterStats.m_buf_fsync_checkpointer += CheckpointStats.ckpt_sync_rels;
+
 	/*
 	 * All of the published timing statistics are accounted for.  Only
 	 * continue if a log message is to be written.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 78a103cdb95..d15aed10ad2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -898,12 +898,27 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
+
+        pg_stat_get_buf_written_checkpoints() AS buffers_written_checkpoint,
+        pg_stat_get_buf_written_bgwriter() AS buffers_written_bgwriter,
+        pg_stat_get_buf_written_backend() AS buffers_written_backend,
+        pg_stat_get_buf_written_ring() AS buffers_written_ring,
+
+        pg_stat_get_buf_fsync_checkpointer() AS buffers_fsync_checkpointer,
+        pg_stat_get_buf_fsync_bgwriter() AS buffers_fsync_bgwriter,
+        pg_stat_get_buf_fsync_backend() AS buffers_fsync_backend,
+
+        pg_stat_get_buf_bgwriter_clean() AS buffers_bgwriter_clean,
+
+        pg_stat_get_buf_alloc_preclean() AS buffers_alloc_preclean,
+        pg_stat_get_buf_alloc_free() AS buffers_alloc_free,
+        pg_stat_get_buf_alloc_sweep() AS buffers_alloc_sweep,
+        pg_stat_get_buf_alloc_ring() AS buffers_alloc_ring,
+
+        pg_stat_get_buf_ticks_bgwriter() AS buffers_ticks_bgwriter,
+        pg_stat_get_buf_ticks_backend() AS buffers_ticks_backend,
+
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_progress_vacuum AS
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de5..526304fefc9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -65,6 +65,7 @@
  * GUC parameters
  */
 int			BgWriterDelay = 200;
+bool		BgWriterLegacy = true;
 
 /*
  * Multiplier to apply to BgWriterDelay when we decide to hibernate.
@@ -264,7 +265,10 @@ BackgroundWriterMain(void)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		if (BgWriterLegacy)
+			can_hibernate = BgBufferSyncLegacy(&wb_context);
+		else
+			can_hibernate = BgBufferSyncNew(&wb_context);
 
 		/*
 		 * Send off activity statistics to the stats collector
@@ -366,7 +370,8 @@ BackgroundWriterMain(void)
 							 BgWriterDelay * HIBERNATE_FACTOR,
 							 WAIT_EVENT_BGWRITER_HIBERNATE);
 			/* Reset the notification request in case we timed out */
-			StrategyNotifyBgWriter(-1);
+			if (BgWriterLegacy)
+				StrategyNotifyBgWriter(-1);
 		}
 
 		prev_hibernate = can_hibernate;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 13f152b4731..e5ecca1e3db 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -102,7 +102,7 @@
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * Unlike the checkpoint fields, num_written_*, num_fsync_*, and
  * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
@@ -127,8 +127,11 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
+	uint32		num_written_backend; /* counts user backend buffer writes */
+	uint32		num_written_ring; /* counts ring buffer writes */
+
+	uint32		num_fsync_bgwriter;	/* counts bgwriter fsync calls */
+	uint32		num_fsync_backend;	/* counts user backend fsync calls */
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
@@ -1119,7 +1122,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	/* Count all backend writes regardless of if they fit in the queue */
 	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
+		CheckpointerShmem->num_written_backend++;
 
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
@@ -1134,8 +1137,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+		if (AmBackgroundWriterProcess())
+			CheckpointerShmem->num_fsync_backend++;
+		else
+			CheckpointerShmem->num_fsync_bgwriter++;
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1295,11 +1300,15 @@ AbsorbSyncRequests(void)
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_written_backend;
+	BgWriterStats.m_buf_written_ring += CheckpointerShmem->num_written_ring;
+	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_fsync_backend;
+	BgWriterStats.m_buf_fsync_bgwriter += CheckpointerShmem->num_fsync_bgwriter;
 
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	CheckpointerShmem->num_written_backend = 0;
+	CheckpointerShmem->num_written_ring = 0;
+	CheckpointerShmem->num_fsync_backend = 0;
+	CheckpointerShmem->num_fsync_bgwriter = 0;
 
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
@@ -1373,3 +1382,12 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+// FIXME: crappy API
+void
+ReportRingWrite(void)
+{
+	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+	CheckpointerShmem->num_written_ring++;
+	LWLockRelease(CheckpointerCommLock);
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28b517..9aa7b9b8139 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -6313,12 +6313,26 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
+
 	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.buf_written_clean += msg->m_buf_written_clean;
-	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
+	globalStats.buf_written_bgwriter += msg->m_buf_written_bgwriter;
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
+	globalStats.buf_written_ring += msg->m_buf_written_ring;
+
+	globalStats.buf_fsync_checkpointer += msg->m_buf_fsync_checkpointer;
+	globalStats.buf_fsync_bgwriter += msg->m_buf_fsync_bgwriter;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-	globalStats.buf_alloc += msg->m_buf_alloc;
+
+	globalStats.buf_alloc_preclean += msg->m_buf_alloc_preclean;
+	globalStats.buf_alloc_free += msg->m_buf_alloc_free;
+	globalStats.buf_alloc_sweep += msg->m_buf_alloc_sweep;
+	globalStats.buf_alloc_ring += msg->m_buf_alloc_ring;
+
+	globalStats.buf_ticks_bgwriter += msg->m_buf_ticks_bgwriter;
+	globalStats.buf_ticks_backend += msg->m_buf_ticks_backend;
+
+	globalStats.buf_clean_bgwriter += msg->m_buf_clean_bgwriter;
+	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ccd2c31c0b3..6154f75714f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "lib/ringbuf.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 
@@ -23,6 +24,7 @@ char	   *BufferBlocks;
 LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
+ringbuf *VictimBuffers = NULL;
 
 
 /*
@@ -70,7 +72,8 @@ InitBufferPool(void)
 	bool		foundBufs,
 				foundDescs,
 				foundIOLocks,
-				foundBufCkpt;
+				foundBufCkpt,
+				foundFreeBufs;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -91,6 +94,10 @@ InitBufferPool(void)
 	LWLockRegisterTranche(LWTRANCHE_BUFFER_IO_IN_PROGRESS, "buffer_io");
 	LWLockRegisterTranche(LWTRANCHE_BUFFER_CONTENT, "buffer_content");
 
+	VictimBuffers = ShmemInitStruct("Free Buffers",
+									ringbuf_size(VICTIM_BUFFER_PRECLEAN_SIZE),
+									&foundFreeBufs);
+
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
 	 * shared memory, to avoid having to allocate significant amounts of
@@ -102,10 +109,11 @@ InitBufferPool(void)
 		ShmemInitStruct("Checkpoint BufferIds",
 						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
 
-	if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
+	if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt || foundFreeBufs)
 	{
 		/* should find all of these, or none of them */
-		Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
+		Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt && foundFreeBufs);
+
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -129,6 +137,7 @@ InitBufferPool(void)
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
+			 * FIXME: remove once legacy bgwriter is removed
 			 */
 			buf->freeNext = i + 1;
 
@@ -139,8 +148,10 @@ InitBufferPool(void)
 							 LWTRANCHE_BUFFER_IO_IN_PROGRESS);
 		}
 
-		/* Correct last entry of linked list */
+		/* Correct last entry of linked list: FIXME: remove */
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+		/* FIXME: could fill the first few free buffers? */
+		VictimBuffers = ringbuf_create(VictimBuffers, VICTIM_BUFFER_PRECLEAN_SIZE);
 	}
 
 	/* Init other shared buffer-management stuff */
@@ -189,5 +200,8 @@ BufferShmemSize(void)
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
+	/* FIXME: better ringbuffer size */
+	size = add_size(size, ringbuf_size(VICTIM_BUFFER_PRECLEAN_SIZE));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7332e6b5903..9d63244ba08 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -39,6 +39,7 @@
 #include "catalog/storage.h"
 #include "executor/instrument.h"
 #include "lib/binaryheap.h"
+#include "lib/ringbuf.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -101,7 +102,7 @@ typedef struct CkptTsStatus
 	/* already processed pages in this tablespace */
 	int			num_scanned;
 
-	/* current offset in CkptBufferIds for this tablespace */
+	/* currentCheckpointerShmem->num_written_ring offset in CkptBufferIds for this tablespace */
 	int			index;
 } CkptTsStatus;
 
@@ -866,11 +867,29 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start,
+			io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1136,6 +1155,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 						UnpinBuffer(buf, true);
 						continue;
 					}
+
+					// FIXME: crappy API
+					StrategyReportWrite(strategy, buf);
 				}
 
 				/* OK, do the I/O */
@@ -1352,6 +1374,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * trying to write it out.  We have to let them finish before we can
  * reclaim the buffer.
  *
+ * FIXME: ^^^
+ *
  * The buffer could get reclaimed by someone else while we are waiting
  * to acquire the necessary locks; if so, don't mess it up.
  */
@@ -2038,7 +2062,119 @@ BufferSync(int flags)
 }
 
 /*
- * BgBufferSync -- Write out some dirty buffers in the pool.
+ * BgBufferSyncNew -- Write out some dirty buffers in the pool.
+ *
+ * This is called periodically by the background writer process.
+ *
+ * Returns true if it's appropriate for the bgwriter process to go into
+ * low-power hibernation mode.
+ */
+bool
+BgBufferSyncNew(WritebackContext *wb_context)
+{
+	uint32      recent_alloc_preclean;
+	uint32      recent_alloc_free;
+	uint32      recent_alloc_sweep;
+	uint32      recent_alloc_ring;
+	uint32      strategy_passes;
+	uint64		nticks;
+	uint64		nticks_sum = 0;
+
+	/* Make sure we can handle the pin inside SyncOneBuffer */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/* Know where to start, and report buffer alloc counts to pgstat */
+	StrategySyncStart(&strategy_passes,
+					  &recent_alloc_preclean,
+					  &recent_alloc_free,
+					  &recent_alloc_sweep,
+					  &recent_alloc_ring,
+					  &nticks);
+
+	/* Report buffer alloc counts to pgstat */
+	BgWriterStats.m_buf_alloc_preclean += recent_alloc_preclean;
+	BgWriterStats.m_buf_alloc_free += recent_alloc_free;
+	BgWriterStats.m_buf_alloc_sweep += recent_alloc_sweep;
+	BgWriterStats.m_buf_alloc_ring += recent_alloc_ring;
+	BgWriterStats.m_buf_ticks_backend += nticks;
+
+	/* go and populate freelist */
+	while (!ringbuf_full(VictimBuffers))
+	{
+		BufferDesc *bufHdr;
+		bool pushed;
+		bool dirty;
+		uint32		buf_state;
+
+		ReservePrivateRefCountEntry();
+
+		bufHdr = ClockSweep(NULL, &buf_state, &nticks);
+		nticks_sum += nticks;
+
+		dirty = buf_state & BM_DIRTY;
+
+		if (dirty)
+		{
+			SMgrRelation reln;
+			BufferTag tag;
+			LWLock *content_lock;
+
+
+			/*
+			 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
+			 * buffer is clean by the time we've locked it.)
+			 */
+			PinBuffer_Locked(bufHdr);
+
+			/* open relation before locking the page */
+			reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+			FlushBuffer(bufHdr, reln);
+			LWLockRelease(content_lock);
+
+			/* copy tag before releasing pin */
+			tag = bufHdr->tag;
+
+			UnpinBuffer(bufHdr, true);
+
+			pushed = ringbuf_push(VictimBuffers, bufHdr);
+
+			Assert(wb_context);
+			ScheduleBufferTagForWriteback(wb_context, &tag);
+
+			BgWriterStats.m_buf_written_bgwriter++;
+		}
+		else
+		{
+			UnlockBufHdr(bufHdr, buf_state);
+			pushed = ringbuf_push(VictimBuffers, bufHdr);
+
+			BgWriterStats.m_buf_clean_bgwriter++;
+		}
+
+		/* full, shouldn't normally happen, we're the only writer  */
+		if (!pushed)
+			break;
+
+		/* so we occasionally sleep, even if continually busy */
+		if (BgWriterStats.m_buf_written_bgwriter >= bgwriter_lru_maxpages)
+		{
+			BgWriterStats.m_maxwritten_clean++;
+			break;
+		}
+	}
+
+	BgWriterStats.m_buf_ticks_bgwriter += nticks_sum;
+
+	return BgWriterStats.m_buf_written_bgwriter == 0 &&
+		BgWriterStats.m_buf_clean_bgwriter == 0;
+}
+
+/*
+ * BgBufferSyncLegacy -- Write out some dirty buffers in the pool.
  *
  * This is called periodically by the background writer process.
  *
@@ -2049,12 +2185,16 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSyncLegacy(WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
 	uint32		strategy_passes;
-	uint32		recent_alloc;
+	uint32      recent_alloc_preclean;
+	uint32      recent_alloc_free;
+	uint32      recent_alloc_sweep;
+	uint32      recent_alloc_ring;
+	uint64		recent_ticks;
 
 	/*
 	 * Information saved between calls so we can determine the strategy
@@ -2090,16 +2230,25 @@ BgBufferSync(WritebackContext *wb_context)
 
 	/* Variables for final smoothed_density update */
 	long		new_strategy_delta;
-	uint32		new_recent_alloc;
+	uint32		new_recent_alloc_sweep;
 
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
+	strategy_buf_id = StrategySyncStart(&strategy_passes,
+										&recent_alloc_preclean,
+										&recent_alloc_free,
+										&recent_alloc_sweep,
+										&recent_alloc_ring,
+										&recent_ticks);
 
 	/* Report buffer alloc counts to pgstat */
-	BgWriterStats.m_buf_alloc += recent_alloc;
+	BgWriterStats.m_buf_alloc_preclean += recent_alloc_preclean;
+	BgWriterStats.m_buf_alloc_free += recent_alloc_free;
+	BgWriterStats.m_buf_alloc_sweep += recent_alloc_sweep;
+	BgWriterStats.m_buf_alloc_ring += recent_alloc_ring;
+	BgWriterStats.m_buf_ticks_backend += recent_ticks;
 
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
@@ -2196,9 +2345,9 @@ BgBufferSync(WritebackContext *wb_context)
 	 *
 	 * If the strategy point didn't move, we don't update the density estimate
 	 */
-	if (strategy_delta > 0 && recent_alloc > 0)
+	if (strategy_delta > 0 && recent_alloc_sweep > 0)
 	{
-		scans_per_alloc = (float) strategy_delta / (float) recent_alloc;
+		scans_per_alloc = (float) strategy_delta / (float) recent_alloc_sweep;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 	}
@@ -2216,10 +2365,10 @@ BgBufferSync(WritebackContext *wb_context)
 	 * a true average we want a fast-attack, slow-decline behavior: we
 	 * immediately follow any increase.
 	 */
-	if (smoothed_alloc <= (float) recent_alloc)
-		smoothed_alloc = recent_alloc;
+	if (smoothed_alloc <= (float) recent_alloc_sweep)
+		smoothed_alloc = recent_alloc_sweep;
 	else
-		smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
+		smoothed_alloc += ((float) recent_alloc_sweep - smoothed_alloc) /
 			smoothing_samples;
 
 	/* Scale the estimate by a GUC to allow more aggressive tuning. */
@@ -2297,7 +2446,7 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	BgWriterStats.m_buf_written_clean += num_written;
+	BgWriterStats.m_buf_written_bgwriter += num_written;
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
@@ -2317,22 +2466,22 @@ BgBufferSync(WritebackContext *wb_context)
 	 * density estimates.
 	 */
 	new_strategy_delta = bufs_to_lap - num_to_scan;
-	new_recent_alloc = reusable_buffers - reusable_buffers_est;
-	if (new_strategy_delta > 0 && new_recent_alloc > 0)
+	new_recent_alloc_sweep = reusable_buffers - reusable_buffers_est;
+	if (new_strategy_delta > 0 && new_recent_alloc_sweep > 0)
 	{
-		scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc;
+		scans_per_alloc = (float) new_strategy_delta / (float) new_recent_alloc_sweep;
 		smoothed_density += (scans_per_alloc - smoothed_density) /
 			smoothing_samples;
 
 #ifdef BGW_DEBUG
 		elog(DEBUG2, "bgwriter: cleaner density alloc=%u scan=%ld density=%.2f new smoothed=%.2f",
-			 new_recent_alloc, new_strategy_delta,
+			 new_recent_alloc_sweep, new_strategy_delta,
 			 scans_per_alloc, smoothed_density);
 #endif
 	}
 
 	/* Return true if OK to hibernate */
-	return (bufs_to_lap == 0 && recent_alloc == 0);
+	return (bufs_to_lap == 0 && new_recent_alloc_sweep == 0);
 }
 
 /*
@@ -4321,6 +4470,8 @@ void
 IssuePendingWritebacks(WritebackContext *context)
 {
 	int			i;
+	instr_time	io_start,
+				io_time;
 
 	if (context->nr_pending == 0)
 		return;
@@ -4332,6 +4483,9 @@ IssuePendingWritebacks(WritebackContext *context)
 	qsort(&context->pending_writebacks, context->nr_pending,
 		  sizeof(PendingWriteback), buffertag_comparator);
 
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -4381,6 +4535,14 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, tag.forkNum, tag.blockNum, nblocks);
 	}
 
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+	}
+
 	context->nr_pending = 0;
 }
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 06659ab2653..6583f1c3815 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,7 +15,9 @@
  */
 #include "postgres.h"
 
+#include "lib/ringbuf.h"
 #include "port/atomics.h"
+#include "postmaster/bgwriter.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
@@ -51,7 +53,14 @@ typedef struct
 	 * overflow during a single bgwriter cycle.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
-	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
+
+	/* Buffers allocated since last reset */
+	pg_atomic_uint32 numBufferAllocsPreclean;
+	pg_atomic_uint32 numBufferAllocsFree;
+	pg_atomic_uint32 numBufferAllocsSweep;
+	pg_atomic_uint32 numBufferAllocsRing;
+
+	pg_atomic_uint64 numBufferTicksBackend;
 
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
@@ -168,6 +177,62 @@ ClockSweepTick(void)
 	return victim;
 }
 
+BufferDesc *
+ClockSweep(BufferAccessStrategy strategy, uint32 *buf_state, uint64 *nticks)
+{
+	BufferDesc *buf;
+	int			trycounter;
+	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	uint64		local_nticks = 0;
+
+	trycounter = NBuffers;
+	for (;;)
+	{
+
+		buf = GetBufferDescriptor(ClockSweepTick());
+		local_nticks++;
+
+		/*
+		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+		 * it; decrement the usage_count (unless pinned) and keep scanning.
+		 */
+		local_buf_state = LockBufHdr(buf);
+
+		if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
+		{
+			if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
+			{
+				local_buf_state -= BUF_USAGECOUNT_ONE;
+
+				trycounter = NBuffers;
+			}
+			else
+			{
+				/* Found a usable buffer */
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				*buf_state = local_buf_state;
+				*nticks = local_nticks;
+
+				return buf;
+			}
+		}
+		else if (--trycounter == 0)
+		{
+			/*
+			 * We've scanned all the buffers without making any state changes,
+			 * so all the buffers are pinned (or were when we looked at them).
+			 * We could hope that someone will free one eventually, but it's
+			 * probably better to fail than to risk getting stuck in an
+			 * infinite loop.
+			 */
+			UnlockBufHdr(buf, local_buf_state);
+			elog(ERROR, "no unpinned buffers available");
+		}
+		UnlockBufHdr(buf, local_buf_state);
+	}
+}
+
 /*
  * have_free_buffer -- a lockless check to see if there is a free buffer in
  *					   buffer pool.
@@ -202,8 +267,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
-	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	uint64		nticks;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
@@ -229,26 +294,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * some arbitrary process.
 	 */
 	bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);
-	if (bgwprocno != -1)
+	if (BgWriterLegacy)
 	{
-		/* reset bgwprocno first, before setting the latch */
-		StrategyControl->bgwprocno = -1;
+		if (bgwprocno != -1)
+		{
+			/* reset bgwprocno first, before setting the latch */
+			StrategyControl->bgwprocno = -1;
 
-		/*
-		 * Not acquiring ProcArrayLock here which is slightly icky. It's
-		 * actually fine because procLatch isn't ever freed, so we just can
-		 * potentially set the wrong process' (or no process') latch.
-		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+			/*
+			 * Not acquiring ProcArrayLock here which is slightly icky. It's
+			 * actually fine because procLatch isn't ever freed, so we just can
+			 * potentially set the wrong process' (or no process') latch.
+			 */
+			SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		}
 	}
 
-	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
-	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
-
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
 	 * freelist. Since we otherwise don't require the spinlock in every
@@ -302,6 +363,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 			if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 				&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
 			{
+				// FIXME: possible to do outside of lock?
+				pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocsFree, 1);
+
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
@@ -312,51 +376,81 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 		}
 	}
 
-	/* Nothing on the freelist, so run the "clock sweep" algorithm */
-	trycounter = NBuffers;
-	for (;;)
+	if (!BgWriterLegacy)
 	{
-		buf = GetBufferDescriptor(ClockSweepTick());
+		int i = 0;
 
 		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; decrement the usage_count (unless pinned) and keep scanning.
+		 * Try to get a buffer from the clean buffer list.
 		 */
-		local_buf_state = LockBufHdr(buf);
-
-		if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
+		while (!ringbuf_empty(VictimBuffers))
 		{
-			if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
-			{
-				local_buf_state -= BUF_USAGECOUNT_ONE;
+			BufferDesc *buf;
+			bool found;
+			uint32 elements;
 
-				trycounter = NBuffers;
+			found = ringbuf_pop(VictimBuffers, (void *)&buf);
+
+			/* If the ringbuffer is sufficiently depleted, wakeup the bgwriter. */
+			if (bgwprocno != -1 &&
+				(!found ||
+				 (elements = ringbuf_elements(VictimBuffers)) < VICTIM_BUFFER_PRECLEAN_SIZE / 4))
+			{
+#if 0
+				if (!found)
+					elog(LOG, "signalling bgwriter: empty");
+				else
+					elog(LOG, "signalling bgwriter: watermark: %u %u/%u",
+						 elements, VICTIM_BUFFER_PRECLEAN_SIZE / 4, VICTIM_BUFFER_PRECLEAN_SIZE);
+#endif
+				SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+			}
+
+			if (!found)
+				break;
+
+			/* check if the buffer is still unused, done if so */
+			local_buf_state = LockBufHdr(buf);
+			if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
+				&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
+			{
+				// FIXME: possible to do outside of lock?
+				pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocsPreclean, 1);
+
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				*buf_state = local_buf_state;
+				return buf;
 			}
 			else
 			{
-				/* Found a usable buffer */
-				if (strategy != NULL)
-					AddBufferToRing(strategy, buf);
-				*buf_state = local_buf_state;
-				return buf;
+				UnlockBufHdr(buf, local_buf_state);
+				//ereport(LOG, (errmsg("buffer %u since reused (hand at %u)",
+				//					 buf->buf_id,
+				//					 pg_atomic_read_u32(&StrategyControl->nextVictimBuffer) % NBuffers),
+				//			  errhidestmt(true)));
 			}
+
+			i++;
 		}
-		else if (--trycounter == 0)
-		{
-			/*
-			 * We've scanned all the buffers without making any state changes,
-			 * so all the buffers are pinned (or were when we looked at them).
-			 * We could hope that someone will free one eventually, but it's
-			 * probably better to fail than to risk getting stuck in an
-			 * infinite loop.
-			 */
-			UnlockBufHdr(buf, local_buf_state);
-			elog(ERROR, "no unpinned buffers available");
-		}
-		UnlockBufHdr(buf, local_buf_state);
+
+#if 0
+		ereport(LOG, (errmsg("ringbuf empty after %u cycles", i),
+					  errhidestmt(true)));
+#endif
+
 	}
+
+	/* Nothing on the freelist, so run the "clock sweep" algorithm */
+	buf = ClockSweep(strategy, buf_state, &nticks);
+
+	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocsSweep, 1);
+	pg_atomic_fetch_add_u64(&StrategyControl->numBufferTicksBackend, nticks);
+
+	return buf;
 }
 
+
 /*
  * StrategyFreeBuffer: put a buffer on the freelist
  */
@@ -381,18 +475,22 @@ StrategyFreeBuffer(BufferDesc *buf)
 }
 
 /*
- * StrategySyncStart -- tell BufferSync where to start syncing
+ * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
- * The result is the buffer index of the best buffer to sync first.
- * BufferSync() will proceed circularly around the buffer array from there.
+ * The result is the buffer index below the current clock-hand. BgBufferSync()
+ * will proceed circularly around the buffer array from there.
  *
- * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * In addition, we return the completed-pass count (which is effectively the
+ * higher-order bits of nextVictimBuffer) and the counts of recent buffer
+ * allocations.  The allocation counts are reset after being read.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(uint32 *complete_passes,
+				  uint32 *alloc_preclean,
+				  uint32 *alloc_free,
+				  uint32 *alloc_sweep,
+				  uint32 *alloc_ring,
+				  uint64 *ticks_backend)
 {
 	uint32		nextVictimBuffer;
 	int			result;
@@ -410,13 +508,16 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		 * completePasses could be incremented. C.f. ClockSweepTick().
 		 */
 		*complete_passes += nextVictimBuffer / NBuffers;
-	}
 
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
 	}
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+	*alloc_preclean = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocsPreclean, 0);
+	*alloc_free = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocsFree, 0);
+	*alloc_sweep = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocsSweep, 0);
+	*alloc_ring = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocsRing, 0);
+	*ticks_backend = pg_atomic_exchange_u64(&StrategyControl->numBufferTicksBackend, 0);
+
 	return result;
 }
 
@@ -517,7 +618,11 @@ StrategyInitialize(bool init)
 
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
+		pg_atomic_init_u32(&StrategyControl->numBufferAllocsPreclean, 0);
+		pg_atomic_init_u32(&StrategyControl->numBufferAllocsFree, 0);
+		pg_atomic_init_u32(&StrategyControl->numBufferAllocsSweep, 0);
+		pg_atomic_init_u32(&StrategyControl->numBufferAllocsRing, 0);
+		pg_atomic_init_u64(&StrategyControl->numBufferTicksBackend, 0);
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
@@ -645,6 +750,9 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
+		// FIXME: possible to do outside of lock?
+		pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocsRing, 1);
+
 		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
@@ -702,3 +810,11 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 
 	return true;
 }
+
+void
+StrategyReportWrite(BufferAccessStrategy strategy,
+					BufferDesc *buf)
+{
+	if (strategy->current_was_in_ring)
+		ReportRingWrite();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 05240bfd142..d0d163ea35a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1604,15 +1604,45 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
+pg_stat_get_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
 }
 
 Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
+pg_stat_get_buf_written_bgwriter(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_bgwriter);
+}
+
+Datum
+pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+}
+
+Datum
+pg_stat_get_buf_written_ring(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_ring);
+}
+
+Datum
+pg_stat_get_buf_ticks_bgwriter(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_ticks_bgwriter);
+}
+
+Datum
+pg_stat_get_buf_ticks_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_ticks_backend);
+}
+
+Datum
+pg_stat_get_buf_bgwriter_clean(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_clean_bgwriter);
 }
 
 Datum
@@ -1641,10 +1671,17 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
 }
 
+// FIXME: name
 Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buf_fsync_checkpointer(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_checkpointer);
+}
+
+Datum
+pg_stat_get_buf_fsync_bgwriter(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
 }
 
 Datum
@@ -1654,9 +1691,27 @@ pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
+pg_stat_get_buf_alloc_preclean(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc_preclean);
+}
+
+Datum
+pg_stat_get_buf_alloc_free(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc_free);
+}
+
+Datum
+pg_stat_get_buf_alloc_sweep(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc_sweep);
+}
+
+Datum
+pg_stat_get_buf_alloc_ring(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc_ring);
 }
 
 Datum
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a683..425d057a475 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1434,6 +1434,17 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"bgwriter_legacy", PGC_SIGHUP, RESOURCES_BGWRITER,
+			gettext_noop("Use legacy bgwriter algorithm."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&BgWriterLegacy,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"trace_notify", PGC_USERSET, DEVELOPER_OPTIONS,
 			gettext_noop("Generates debugging output for LISTEN and NOTIFY."),
@@ -2734,7 +2745,7 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_MS
 		},
 		&BgWriterDelay,
-		200, 10, 10000,
+		200, 1, 10000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 87335248a03..464e088c346 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5325,16 +5325,70 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
+
 { oid => '2771',
   descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
+  proname => 'pg_stat_get_buf_written_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
+  prosrc => 'pg_stat_get_buf_written_checkpoints' },
 { oid => '2772',
   descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
+  proname => 'pg_stat_get_buf_written_bgwriter', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
+  prosrc => 'pg_stat_get_buf_written_bgwriter' },
+
+{ oid => '2775',
+  descr => 'statistics: number of buffers written by backends while cleaning dirty buffers',
+  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_buf_written_backend' },
+{ oid => '270',
+  descr => 'statistics: number of buffers written by backends when recycling ring entries',
+  proname => 'pg_stat_get_buf_written_ring', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_buf_written_ring' },
+
+{ oid => '271',
+  descr => 'statistics: number of fsync requests processed by checkpointer',
+  proname => 'pg_stat_get_buf_fsync_checkpointer', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_buf_fsync_checkpointer' },
+{ oid => '272',
+  descr => 'statistics: number of bgwriter buffer writes that did their own fsync',
+  proname => 'pg_stat_get_buf_fsync_bgwriter', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_buf_fsync_bgwriter' },
+{ oid => '3063',
+  descr => 'statistics: number of backend writes that did their own fsync',
+  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_buf_fsync_backend' },
+
+{ oid => '273', descr => 'statistics: number of reusable clean buffers discovered by bgwriter',
+  proname => 'pg_stat_get_buf_bgwriter_clean', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_bgwriter_clean' },
+
+{ oid => '380', descr => 'statistics: number of backend buffer allocations via preclean list',
+  proname => 'pg_stat_get_buf_alloc_preclean', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc_preclean' },
+{ oid => '2859', descr => 'statistics: number of backend buffer allocations via backend clock sweep',
+  proname => 'pg_stat_get_buf_alloc_sweep', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc_sweep' },
+{ oid => '381', descr => 'statistics: number of backend buffer allocations via ring buffer',
+  proname => 'pg_stat_get_buf_alloc_ring', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc_ring' },
+{ oid => '421', descr => 'statistics: number of backend buffer allocations via free list',
+  proname => 'pg_stat_get_buf_alloc_free', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc_free' },
+
+{ oid => '560', descr => 'statistics: number of clock sweep ticks by bgwriter',
+  proname => 'pg_stat_get_buf_ticks_bgwriter', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_ticks_bgwriter' },
+{ oid => '561', descr => 'statistics: number of clock sweep ticks by backend',
+  proname => 'pg_stat_get_buf_ticks_backend', provolatile => 's', proparallel => 'r',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_ticks_backend' },
+
+
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5354,18 +5408,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '2978', descr => 'statistics: number of function calls',
   proname => 'pg_stat_get_function_calls', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a1883..54c4765fb11 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -413,14 +413,30 @@ typedef struct PgStat_MsgBgWriter
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_clean;
-	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
-	PgStat_Counter m_buf_alloc;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
+
+	PgStat_Counter m_buf_written_checkpoints;
+	PgStat_Counter m_buf_written_bgwriter;
+	PgStat_Counter m_buf_written_backend;
+	PgStat_Counter m_buf_written_ring;
+
+	PgStat_Counter m_buf_fsync_checkpointer;
+	PgStat_Counter m_buf_fsync_bgwriter;
+	PgStat_Counter m_buf_fsync_backend;
+
+	PgStat_Counter m_buf_clean_bgwriter;
+
+	PgStat_Counter m_buf_alloc_preclean;
+	PgStat_Counter m_buf_alloc_free;
+	PgStat_Counter m_buf_alloc_sweep;
+	PgStat_Counter m_buf_alloc_ring;
+
+	PgStat_Counter m_buf_ticks_bgwriter;
+	PgStat_Counter m_buf_ticks_backend;
+
+	PgStat_Counter m_maxwritten_clean;
+
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -699,16 +715,33 @@ typedef struct PgStat_ArchiverStats
 typedef struct PgStat_GlobalStats
 {
 	TimestampTz stats_timestamp;	/* time of stats file update */
+
 	PgStat_Counter timed_checkpoints;
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
+
 	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_clean;
-	PgStat_Counter maxwritten_clean;
+	PgStat_Counter buf_written_bgwriter;
 	PgStat_Counter buf_written_backend;
+	PgStat_Counter buf_written_ring;
+
+	PgStat_Counter buf_fsync_checkpointer;
+	PgStat_Counter buf_fsync_bgwriter;
 	PgStat_Counter buf_fsync_backend;
-	PgStat_Counter buf_alloc;
+
+	PgStat_Counter buf_clean_bgwriter;
+
+	PgStat_Counter buf_alloc_preclean;
+	PgStat_Counter buf_alloc_free;
+	PgStat_Counter buf_alloc_sweep;
+	PgStat_Counter buf_alloc_ring;
+
+	PgStat_Counter buf_ticks_bgwriter;
+	PgStat_Counter buf_ticks_backend;
+
+	PgStat_Counter maxwritten_clean;
+
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 630366f49ef..892e24e0832 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,6 +26,7 @@ extern int	BgWriterDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
 extern double CheckPointCompletionTarget;
+extern bool	BgWriterLegacy;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
@@ -40,6 +41,8 @@ extern void AbsorbSyncRequests(void);
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
+extern void ReportRingWrite(void);
+
 extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index df2dda7e7e7..1b58b1db0df 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -142,7 +142,7 @@ typedef struct buftag
  * single atomic operation, without actually acquiring and releasing spinlock;
  * for instance, increase or decrease refcount.  buf_id field never changes
  * after initialization, so does not need locking.  freeNext is protected by
- * the buffer_strategy_lock not buffer header lock.  The LWLock can take care
+ * the buffer_strategy_lock not buffer header lock (XXX: remove).  The LWLock can take care
  * of itself.  The buffer header lock is *not* used to control access to the
  * data in the buffer!
  *
@@ -184,7 +184,9 @@ typedef struct BufferDesc
 	pg_atomic_uint32 state;
 
 	int			wait_backend_pid;	/* backend PID of pin-count waiter */
-	int			freeNext;		/* link in freelist chain */
+
+	/* link in freelist chain: only used with legacy bgwriter */
+	int			freeNext;
 
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
@@ -232,10 +234,18 @@ extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
 /*
  * The freeNext field is either the index of the next freelist entry,
  * or one of these special values:
+ * XXX: Remove when removing legacy bgwriter
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
 
+/*
+ * FIXME: Probably needs to depend on NBuffers or such.
+ */
+
+/* size of buffer free list */
+#define VICTIM_BUFFER_PRECLEAN_SIZE 4096
+
 /*
  * Functions for acquiring/releasing a shared buffer header's spinlock.  Do
  * not apply these to local buffers!
@@ -274,6 +284,7 @@ typedef struct WritebackContext
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 extern PGDLLIMPORT WritebackContext BackendWritebackContext;
+extern PGDLLIMPORT struct ringbuf *VictimBuffers;
 
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
@@ -306,13 +317,24 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
-extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
+extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategym,
 									 uint32 *buf_state);
+extern BufferDesc *ClockSweep(BufferAccessStrategy strategy,
+							  uint32 *buf_state, uint64 *nticks);
+
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategyReportWrite(BufferAccessStrategy strategy,
+								BufferDesc *buf);
+
+extern int	StrategySyncStart(uint32 *complete_passes,
+							  uint32 *alloc_preclean,
+							  uint32 *alloc_free,
+							  uint32 *alloc_sweep,
+							  uint32 *alloc_ring,
+							  uint64 *ticks_backend);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1c..9957b9c8c27 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -221,7 +221,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+
+extern bool BgBufferSyncNew(struct WritebackContext *wb_context);
+extern bool BgBufferSyncLegacy(struct WritebackContext *wb_context);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7d365c48d12..da436d982ab 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1796,12 +1796,21 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
+    pg_stat_get_buf_written_checkpoints() AS buffers_written_checkpoint,
+    pg_stat_get_buf_written_bgwriter() AS buffers_written_bgwriter,
+    pg_stat_get_buf_written_backend() AS buffers_written_backend,
+    pg_stat_get_buf_written_ring() AS buffers_written_ring,
+    pg_stat_get_buf_fsync_checkpointer() AS buffers_fsync_checkpointer,
+    pg_stat_get_buf_fsync_bgwriter() AS buffers_fsync_bgwriter,
+    pg_stat_get_buf_fsync_backend() AS buffers_fsync_backend,
+    pg_stat_get_buf_bgwriter_clean() AS buffers_bgwriter_clean,
+    pg_stat_get_buf_alloc_preclean() AS buffers_alloc_preclean,
+    pg_stat_get_buf_alloc_free() AS buffers_alloc_free,
+    pg_stat_get_buf_alloc_sweep() AS buffers_alloc_sweep,
+    pg_stat_get_buf_alloc_ring() AS buffers_alloc_ring,
+    pg_stat_get_buf_ticks_bgwriter() AS buffers_ticks_bgwriter,
+    pg_stat_get_buf_ticks_backend() AS buffers_ticks_backend,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
-- 
2.22.0.dirty

Reply via email to