On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:
Hello hackers,
While trying to reproduce inexplicable drongo failures (e. g., [1])
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. After many (6-8 thousands) iterations of the
015_stream.pl TAP test, psql failed to start with a
STATUS_DLL_INIT_FAILED
error, and a corresponding Windows popup preventing a test exit (see ss-1
in the archive attached).
Upon reaching that state of the system, following test runs fail with one
or another error related to memory (not only with psql, but also with the
server processes):
testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11
20:01:51.777 PDT [8812] LOG: server process (PID 11532) was
terminated by exception 0xC00000FD
testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11
20:01:19.806 PDT [2036] LOG: server process (PID 10548) was
terminated by exception 0xC00000FD
testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11
19:59:41.513 PDT [9128] LOG: server process (PID 14476) was
terminated by exception 0xC0000409
testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11
20:03:27.801 PDT [10156] LOG: server process (PID 2236) was
terminated by exception 0xC0000409
testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11
19:59:41.359 PDT [10656] LOG: server process (PID 14712) was
terminated by exception 0xC000012D
testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11
20:02:23.815 PDT [13704] LOG: server process (PID 13992) was
terminated by exception 0xC00000FD
testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11
19:59:41.360 PDT [9760] LOG: server process (PID 11608) was
terminated by exception 0xC0000142
...
When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
attachment), while Physical Memory isn't all used up. To get OS to a
working state, I had to reboot it — killing processes, logoff/logon
didn't
help.
I reproduced this issue again, investigated it and found out that it is
caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
leaking kernel objects, namely "Event objects" within Non-Paged pool on
each run.
This can be seen with Kernel Pool Monitor, or just with this simple PS
script:
for ($i = 1; $i -le 100; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | %
PoolNonpagedBytes
}
It shows to me:
iteration 1
Total Copied Skipped Mismatch FAILED Extras
Dirs : 27 27 0 0 0 0
Files : 968 968 0 0 0 0
Bytes : 38.29 m 38.29 m 0 0 0 0
Times : 0:00:00 0:00:00 0:00:00 0:00:00
...
1226063872
...
iteration 100
Total Copied Skipped Mismatch FAILED Extras
Dirs : 27 27 0 0 0 0
Files : 968 968 0 0 0 0
Bytes : 38.29 m 38.29 m 0 0 0 0
Times : 0:00:00 0:00:00 0:00:00 0:00:00
...
1245220864
(That is, 0.1-0.2 MB leaks per one robocopy run.)
I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.
Perhaps this information can be helpful for someone who is running
buildfarm/CI tests on Windows animals...
[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53
Interesting, good detective work. Still, wouldn't this mean drongo would
fail consistently?
I wonder why we're using robocopy instead of our own RecursiveCopy module?
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com