recovery regression failure on bionic

Amit Kapila Wed, 08 Jan 2020 18:32:02 -0800

On Thu, Jan 9, 2020 at 5:48 AM Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> Andres Freund <and...@anarazel.de> writes:
> > Is it worth having the test close superflous FDs? It'd not be hard to do
> > so via brute force (or even going through /proc/self/fd).
>
> No, it isn't, because d20703805's test is broken by design.  There
> are any number of reasons why there might be more than three-or-so
> FDs open during postmaster start.  Here are a few:
>
> * It seems pretty likely that at least one of those FDs is
> intentionally being left open by cron so it can detect death of
> all child processes (like our postmaster death pipe).  Forcibly
> closing them will not necessarily have nice results.  Other
> execution environments might do similar tricks.
>
> * On platforms where semaphores eat a FD apiece, we intentionally
> open those before counting free FDs.
>
> * We run process_shared_preload_libraries before counting free FDs,
> too.  If a loaded library intentionally leaves a FD open in the
> postmaster, counting that against the limit also seems like a good
> idea.
>
> My opinion is still that we should just get rid of that test case.
>


The point is that we know what is going wrong on sidewinder on back
branches.  However, we still don't know what is going wrong with tern
and mandrill on v10 [1][2] where the log is:

2020-01-08 06:38:10.842 UTC [54001846:9] t/006_logical_decoding.pl
STATEMENT:  SELECT data from pg_logical_slot_get_changes('test_slot',
NULL, NULL)
   WHERE data LIKE '%INSERT%' ORDER BY lsn LIMIT 1;
2020-01-08 06:38:15.993 UTC [63898020:3] LOG:  server process (PID
54001846) was terminated by signal 11
2020-01-08 06:38:15.993 UTC [63898020:4] DETAIL:  Failed process was
running: SELECT data from pg_logical_slot_get_changes('test_slot',
NULL, NULL)
   WHERE data LIKE '%INSERT%' ORDER BY lsn LIMIT 1;
2020-01-08 06:38:15.993 UTC [63898020:5] LOG:  terminating any other
active server processes

Noah has tried to reproduce it [3] on that buildfarm machine by
running that test in a loop, but he couldn't reproduce it till now. He
is running the test now for a longer duration.  Another point is that
the logic in v11 code is the same, but the same test is passing on
those machines, so I have a slight suspicion that there might be some
other problem in v10 which is uncovered by this test, but I am not
sure on this point.

Now, if we remove that test as per your suggestion, then we might not
be able to find out what is going wrong on those machines in v10?


[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2020-01-08%2004%3A36%3A27
[2] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-01-08%2004%3A36%3A27
[3] - 
https://www.postgresql.org/message-id/20200104185148.GA2270238%40rfd.leadboat.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: src/test/recovery regression failure on bionic

Reply via email to