Hi Marco, thank you for sending an updated patch. I am writing down a report of this initial (and partial) review.
IMPORTANT: This patch is not complete, as stated by Marco. See the "Conclusions" section for my proposed TODO list. == Patch application I have been able to successfully apply your patch and compile it. Regression tests passed. == Initial run I have created a fresh new instance of PostgreSQL and activated streaming replication to be used by pg_basebackup. I have done a pgbench run with scale 100. I have taken a full consistent backup with pg_basebackup (in plain format): pg_basebackup -v -F p -D $BACKUPDIR/backup-$(date '+%s') -x I have been able to verify that the backup_profile is correctly placed in the destination PGDATA directory. Here is an excerpt: POSTGRESQL BACKUP PROFILE 1 START WAL LOCATION: 0/3000058 (file 000000010000000000000003) CHECKPOINT LOCATION: 0/300008C BACKUP METHOD: streamed BACKUP FROM: master START TIME: 2015-01-14 10:07:07 CET LABEL: pg_basebackup base backup FILE LIST \N \N t 1421226427 206 backup_label \N \N t 1421225508 88 postgresql.auto.conf ... As suggested by Marco, I have manually taken the LSN from this file (next version must do this automatically). I have then executed pg_basebackup and activated the incremental feature by using the LSN from the previous backup, as follows: LSN=$(awk '/^START WAL/{print $4}' backup_profile) pg_basebackup -v -F p -D $BACKUPDIR/backup-$(date '+%s') -I $LSN -x The time taken by this operation has been much lower than the previous one and the size is much lower (I have not done any operation in the meantime): du -hs backup-1421226* 1,5G backup-1421226427 17M backup-1421226427 I have done some checks on the file system and then used the prototype of recovery script in Python written by Marco. ./recover.py backup-1421226427 backup-1421226427 new-data The cluster started successfully. I have then run a pg_dump of the pgbench database and were able to reload it on the initial cluster. == Conclusions The first run of this patch seems promising. While the discussion on the LSN map continues (which is mainly an optimisation of this patch), I would really like to see this patch progress as it would be a killer feature in several contexts (not in every context). Just in this period we are releasing file based incremental backup for Barman and customers using the alpha version are experiencing on average a deduplication ratio between 50% to 70%. This is for example an excerpt of "barman show-backup" from one of our customers (a daily saving of 550GB is not bad): Base backup information: Disk usage : 1.1 TiB (1.1 TiB with WALs) Incremental size : 564.6 GiB (-50.60%) ... My opinion, Marco, is that for version 5 of this patch, you: 1) update the information on the wiki (it is outdated - I know you have been busy with LSN map optimisation) 2) modify pg_basebackup in order to accept a directory (or tar file) and automatically detect the LSN from the backup profile 3) add the documentation regarding the backup profile and pg_basebackup Once we have all of this, we can continue trying the patch. Some unexplored paths are: * tablespace usage * tar format * performance impact (in both "read-only" and heavily updated contexts) * consistency checks I would then leave for version 6 the pg_restorebackup utility (unless you want to do everything at once). One limitation of the current recovery script is that it cannot accept multiple incremental backups (it just accepts three parameters: base backup, incremental backup and merge destination). Maybe you can change the syntax as follows: ./recover.py DESTINATION BACKUP_1 BACKUP_2 [BACKUP_3, ...] Thanks a lot for working on this. I am looking forward to continuing the review. Ciao, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia - Managing Director PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it 2015-01-13 17:21 GMT+01:00 Marco Nenciarini <marco.nenciar...@2ndquadrant.it >: > Il 13/01/15 12:53, Gabriele Bartolini ha scritto: > > Hi Marco, > > > > could you please send an updated version the patch against the current > > HEAD in order to facilitate reviewers? > > > > Here is the updated patch for incremental file based backup. > > It is based on the current HEAD. > > I'm now working to the client tool to rebuild a full backup starting > from a file based incremental backup. > > Regards, > Marco > > -- > Marco Nenciarini - 2ndQuadrant Italy > PostgreSQL Training, Services and Support > marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it >