On Thu, Jan 9, 2020 at 5:48 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Andres Freund <and...@anarazel.de> writes: > > Is it worth having the test close superflous FDs? It'd not be hard to do > > so via brute force (or even going through /proc/self/fd). > > No, it isn't, because d20703805's test is broken by design. There > are any number of reasons why there might be more than three-or-so > FDs open during postmaster start. Here are a few: > > * It seems pretty likely that at least one of those FDs is > intentionally being left open by cron so it can detect death of > all child processes (like our postmaster death pipe). Forcibly > closing them will not necessarily have nice results. Other > execution environments might do similar tricks. > > * On platforms where semaphores eat a FD apiece, we intentionally > open those before counting free FDs. > > * We run process_shared_preload_libraries before counting free FDs, > too. If a loaded library intentionally leaves a FD open in the > postmaster, counting that against the limit also seems like a good > idea. > > My opinion is still that we should just get rid of that test case. >
The point is that we know what is going wrong on sidewinder on back branches. However, we still don't know what is going wrong with tern and mandrill on v10 [1][2] where the log is: 2020-01-08 06:38:10.842 UTC [54001846:9] t/006_logical_decoding.pl STATEMENT: SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) WHERE data LIKE '%INSERT%' ORDER BY lsn LIMIT 1; 2020-01-08 06:38:15.993 UTC [63898020:3] LOG: server process (PID 54001846) was terminated by signal 11 2020-01-08 06:38:15.993 UTC [63898020:4] DETAIL: Failed process was running: SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) WHERE data LIKE '%INSERT%' ORDER BY lsn LIMIT 1; 2020-01-08 06:38:15.993 UTC [63898020:5] LOG: terminating any other active server processes Noah has tried to reproduce it [3] on that buildfarm machine by running that test in a loop, but he couldn't reproduce it till now. He is running the test now for a longer duration. Another point is that the logic in v11 code is the same, but the same test is passing on those machines, so I have a slight suspicion that there might be some other problem in v10 which is uncovered by this test, but I am not sure on this point. Now, if we remove that test as per your suggestion, then we might not be able to find out what is going wrong on those machines in v10? [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2020-01-08%2004%3A36%3A27 [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-01-08%2004%3A36%3A27 [3] - https://www.postgresql.org/message-id/20200104185148.GA2270238%40rfd.leadboat.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com