Hi, 

I have a workflow where I recover from PITR backup and run a query on it. The 
program that runs query 
checks that it can connect to database in a loop, until it can, and then runs 
the query. 
This has worked fine far. Recently I upgraded to 11 and I see that I can 
connect to DB while recovery is 
not complete yet. See this:

< 2020-11-05 03:34:36.114 CST  >LOG:  starting archive recovery
< 2020-11-05 03:34:36.590 CST  >LOG:  restored log file 
"00000001000002EF000000F9" from archive
< 2020-11-05 03:34:36.641 CST  >LOG:  redo starts at 2EF/F9000028
...
< 2020-11-05 03:34:46.392 CST  >LOG:  restored log file 
"00000001000002F000000008" from archive
< 2020-11-05 03:34:46.658 CST 127.0.0.1 >FATAL:  the database system is 
starting up
< 2020-11-05 03:34:47.028 CST  >LOG:  restored log file 
"00000001000002F000000009" from archive

You can see above fail to connect , but sometime into recover I see;

< 2020-11-05 04:07:51.987 CST  >LOG:  restored log file 
"00000001000002F200000029" from archive
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >ERROR:  canceling statement due to 
conflict with recovery
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >DETAIL:  User query might have needed 
to see row versions that must be removed.
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.195 CST  >FATAL:  terminating connection due to 
administrator command
< 2020-11-05 04:08:23.195 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.195 CST  >FATAL:  terminating connection due to 
administrator command
< 2020-11-05 04:08:23.195 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.232 CST  >LOG:  background worker "parallel worker" (PID 
13577) exited with exit code 1
< 2020-11-05 04:08:23.244 CST  >LOG:  background worker "parallel worker" (PID 
13578) exited with exit code 1
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >FATAL:  terminating connection due to 
conflict with recovery
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >DETAIL:  User query might have needed 
to see row versions that must be removed.
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >HINT:  In a moment you should be able 
to reconnect to the database and repeat your command.
< 2020-11-05 04:08:25.354 CST  >LOG:  restored log file 
"00000001000002F20000002A" from archive
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >ERROR:  canceling statement due to 
conflict with recovery
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >DETAIL:  User query might have needed 
to see row versions that must be removed.
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.556 CST  >FATAL:  terminating connection due to 
administrator command
< 2020-11-05 04:08:55.556 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.561 CST  >FATAL:  terminating connection due to 
administrator command
< 2020-11-05 04:08:55.561 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.640 CST  >LOG:  background worker "parallel worker" (PID 
13683) exited with exit code 1
< 2020-11-05 04:08:55.653 CST  >LOG:  background worker "parallel worker" (PID 
13684) exited with exit code 1
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >FATAL:  terminating connection due to 
conflict with recovery
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >DETAIL:  User query might have needed 
to see row versions that must be removed.
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >HINT:  In a moment you should be able 
to reconnect to the database and repeat your command.
< 2020-11-05 04:09:00.307 CST  >LOG:  restored log file 
"00000001000002F20000002B" from archive

As you can see a query "select count(*) from file" failed due to table not 
being restored yet. BUT connection was allowed 
before DB was ready.... Only few hours after ;

< 2020-11-05 09:31:30.319 CST  >LOG:  archive recovery complete
< 2020-11-05 09:34:51.729 CST  >LOG:  database system is ready to accept 
connections

After which the query runs fine without errors. This is bad because I see that 
select count(*) takes progressively 
longer to execute and the count(*) is not what I am interested in. I run the 
"real" query after that. As a result I add
hours to program execution time. 

Is there a more robust method to it? Ideally I do not want to be able to 
connect to db until :

< 2020-11-05 09:31:30.319 CST  >LOG:  archive recovery complete
< 2020-11-05 09:34:51.729 CST  >LOG:  database system is ready to accept 
connections


And I believe this was the behavior before upgrade. If connection can't be 
disabled, how can I detect 
condition "database system is ready to accept connections"

I believe "pg_isready" utility would succeed once it can connect. And as can 
see I could 
connect way before DB is really ready. 

Thanks!
Dmitry


Reply via email to