On 15/06/2021 14:20, Dilip Kumar wrote:
Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation.  Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content.  Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache.  Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.

I guess that would work, but you could also walk the database directory like copydir() does. How you find the relations to copy is orthogonal to whether you WAL-log them or use checkpoints. And whether you use the buffer cache is also orthogonal to the rest of the proposal; you could issue FlushDatabaseBuffers() instead of a checkpoint.

Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation.  Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

Yeah, WAL-logging the contents of the source database would certainly be less weird than the current system. As Julien also pointed out, the question is, are there people using on "CREATE DATABASE foo TEMPLATE bar" to copy a large source database, on the premise that it's fast because it skips WAL-logging?

In principle, we could have both mechanisms, and use the new WAL-logged system if the database is small, and the old system with checkpoints if it's large. But I don't like idea of having to maintain both.

- Heikki


Reply via email to