Hi, > > Oh, I see. If the process will be killed this perhaps is not going to > > happen. Whether this can happen if we pull the plug from the machine > > is probably a design implementation of the particular filesystem and > > whether it's journaled. > > Right. I mentioned that scenario in the original email: > > "Furthermore, I think that the problem could arise without performing a > backup at all: say that the server crashes on the OS level in > mid-truncation, and the truncation of segment 0 reaches disk but the > removal of segment 1 does not."
Right. I didn't fully understand this scenario at first. I tried to reproduce it to see what actually happens. Here is what I did. ``` create table truncateme(id integer, val varchar(1024)); alter table truncateme set (autovacuum_enabled = off); -- takes ~30 seconds insert into truncateme select id, ( select string_agg(chr((33+random()*(126-33)) :: integer), '') from generate_series(1,1000) ) from generate_series(1,2*1024*1024) as id; checkpoint; select relfilenode from pg_class where relname = 'truncateme'; relfilenode ------------- 32778 ``` We got 3 segments (and no VM fork since I disabled VACUUM): ``` $ cd ~/pginstall/data-master/base/16384/ $ ls -lah 32778* -rw------- 1 eax staff 1.0G Apr 21 23:47 32778 -rw------- 1 eax staff 1.0G Apr 21 23:48 32778.1 -rw------- 1 eax staff 293M Apr 21 23:48 32778.2 -rw------- 1 eax staff 608K Apr 21 23:48 32778_fsm ``` Let's backup the last segment: ``` $ cp 32778.2 ~/temp/32778.2_backup ``` Truncate the table: ``` delete from truncateme where id > 1024*1024; vacuum truncateme; ``` And kill Postgres: ``` $ pkill -9 postgres ``` ... before it finishes another checkpoint: ``` LOG: checkpoints are occurring too frequently (4 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size". LOG: checkpoint starting: wal [and there was no "checkpoint complete"] ``` Next: ``` $ ls -lah 32778* -rw------- 1 eax staff 1.0G Apr 21 23:50 32778 -rw------- 1 eax staff 146M Apr 21 23:50 32778.1 -rw------- 1 eax staff 0B Apr 21 23:50 32778.2 -rw------- 1 eax staff 312K Apr 21 23:50 32778_fsm -rw------- 1 eax staff 40K Apr 21 23:50 32778_vm $ cp ~/temp/32778.2_backup ./32778.2 ``` I believe this simulates the case when pg_basebackup did a checkpoint, copied 32778.2 before the rest of the segments, Postgres did a truncate, and pg_basebackup received the rest of the data including WAL. The WAL contains the record about the truncation, see RelationTruncate(). Just as an observation: we keep zero sized segments instead of deleting them. So if I start Postgres now I expect it to return to a consistent state, ideally the same state it had before the crash in terms of the segments. What I actually get is: ``` LOG: database system was interrupted; last known up at 2023-04-21 23:50:08 MSK LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 9/C4035B18 LOG: invalid record length at 9/E8FCBF10: expected at least 24, got 0 LOG: redo done at 9/E8FCBEB0 system usage: CPU: user: 1.58 s, system: 0.96 s, elapsed: 2.61 s LOG: checkpoint starting: end-of-recovery immediate wait LOG: checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s) added, 0 removed, 36 recycled; write=0.005 s, sync=0.003 s, total=0.016 s; sync files=9, longest=0.002 s, average=0.001 s; distance=605784 kB, estimate=605784 kB; lsn=9/E8FCBF10, redo lsn=9/E8FCBF10 LOG: database system is ready to accept connections ``` ... and: ``` -rw------- 1 eax staff 1.0G Apr 21 23:50 32778 -rw------- 1 eax staff 146M Apr 21 23:53 32778.1 -rw------- 1 eax staff 0B Apr 21 23:53 32778.2 ``` Naturally the database is consistent too. So it looks like the recovery protocol worked as expected after all, at least in the upcoming PG16 and this specific scenario. Did I miss anything? If not, perhaps we should just update the comments a bit? -- Best regards, Aleksander Alekseev