Re: Robocopy might be not robust enough for never-ending testing on Windows

Andrew Dunstan Sat, 14 Sep 2024 07:22:37 -0700


On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:

Hello hackers,

While trying to reproduce inexplicable drongo failures (e. g., [1])
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. After many (6-8 thousands) iterations of the
015_stream.pl TAP test, psql failed to start with aSTATUS_DLL_INIT_FAILED
error, and a corresponding Windows popup preventing a test exit (see ss-1
in the archive attached).

Upon reaching that state of the system, following test runs fail with one
or another error related to memory (not only with psql, but also with the
server processes):
testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-1120:01:51.777 PDT [8812] LOG: server process (PID 11532) wasterminated by exception 0xC00000FDtestrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-1120:01:19.806 PDT [2036] LOG: server process (PID 10548) wasterminated by exception 0xC00000FDtestrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-1119:59:41.513 PDT [9128] LOG: server process (PID 14476) wasterminated by exception 0xC0000409testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-1120:03:27.801 PDT [10156] LOG: server process (PID 2236) wasterminated by exception 0xC0000409testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-1119:59:41.359 PDT [10656] LOG: server process (PID 14712) wasterminated by exception 0xC000012Dtestrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-1120:02:23.815 PDT [13704] LOG: server process (PID 13992) wasterminated by exception 0xC00000FDtestrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-1119:59:41.360 PDT [9760] LOG: server process (PID 11608) wasterminated by exception 0xC0000142
...

When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
attachment), while Physical Memory isn't all used up. To get OS to a
working state, I had to reboot it — killing processes, logoff/logondidn't
help.

I reproduced this issue again, investigated it and found out that it is
caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
leaking kernel objects, namely "Event objects" within Non-Paged pool on
each run.
This can be seen with Kernel Pool Monitor, or just with this simple PSscript:
for ($i = 1; $i -le 100; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | %PoolNonpagedBytes
}

It shows to me:
iteration 1
               Total    Copied   Skipped  Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1226063872
...
iteration 100
               Total    Copied   Skipped  Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1245220864

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

Perhaps this information can be helpful for someone who is running
buildfarm/CI tests on Windows animals...
[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53

Interesting, good detective work. Still, wouldn't this mean drongo wouldfail consistently?


I wonder why we're using robocopy instead of our own RecursiveCopy module?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Robocopy might be not robust enough for never-ending testing on Windows

Reply via email to