Re: Race conditions in 019_replslot_limit.pl

Kyotaro Horiguchi Tue, 15 Feb 2022 21:26:52 -0800

At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <[email protected]> wrote in 
> I think what happened is that there was no WAL to receive between the start of
> the primary and the $node_primary3->wait_for_catchup($node_standby3);
> 
> Because the slot is created without reserving WAL that allows the primary to
> remove the WAL segments without ever creating a slot based conflict. I think
> that should be fixable by reserving the slot at creation time?


Agreed.  Doing this att all slot creation seems fine.

> > and:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
> > 
> > #   Failed test 'have walsender pid 3682154
> > # 3682136'
> > #   at t/019_replslot_limit.pl line 335.
> > #                   '3682154
> > # 3682136'
> > #     doesn't match '(?^:^[0-9]+$)'
> > 
> > The latter looks like there are two walsenders active, which confuses the
> > test.
> 
> Too bad there's no plain pid in the log_line_prefix. Looks like that used to 
> be the
> buildfarm default, and I haven't fixed that animals configuration...
> 
> %c apparently is hex(process startup time).hex(pid) in hex, so we're looking
> for 382f58... Confirmed by the slot name:
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication 
> command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( 
> RESERVE_WAL)
> which pg_basebackup builds using the backend pid:
>               replication_slot = psprintf("pg_basebackup_%d", (int) 
> PQbackendPID(param->bgconn));
> 
> The logs for that pid are:
> 2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG:  connection received: 
> host=[local]
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG:  replication connection 
> authorized: user=bf application_name=019_replslot_limit.pl
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG:  received replication 
> command: SHOW data_directory_mode
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT:  SHOW 
> data_directory_mode
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG:  received replication 
> command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( 
> RESERVE_WAL)
> 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT:  
> CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( 
> RESERVE_WAL)
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG:  received replication 
> command: IDENTIFY_SYSTEM
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT:  IDENTIFY_SYSTEM
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG:  received replication 
> command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
> 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT:  
> START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
> 
> Even though the node has log_disconnect = true, and other processes indeed log
> their disconnection, there's no disconnect for the above session until the
> server is shut down.  Even though pg_basebackup clearly finished? Uh, huh?

It seems to me so, too.

> I guess it's conceivable that the backend was still working through process
> shutdown? But it doesn't seem too likely, given that several other connections
> manage to get through entire connect / disconnect cycles?

Yes, but since postmaster seems thinking that process is gone.


> > Has anyone looked into these yet?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race conditions in 019_replslot_limit.pl

Reply via email to