Hello Shlok,
03.07.2025 09:54, Shlok Kyal wrote:
I have also encountered a similar buildfarm failure [1].
| 1/1 + subscription 142 ms FAIL
1/1 postgresql:regress / regress/regress ERROR 284.85s exit status 1
diff --strip-trailing-cr -U3
c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
---
c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
2025-06-28 20:13:02 +0900
+++
c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
2025-06-28 20:35:21 +0900
@@ -70,7 +70,7 @@
SELECT :'prev_stats_reset' < stats_reset FROM
pg_stat_subscription_stats WHERE subname = 'regress_testsub';
?column?
----------
- t
+ f
(1 row)
Thank you for your attention to this! Yes, I also noticed that failure,
but could not reproduce it yet.
So according to me, I suspect the following causes
1) The time difference between 'prev_stats_reset' and current
'stats_reset' value is less than 1 microseconds.
'stats_reset' is of type 'timestamp with time zone' and the content of
it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
difference between 'prev_stats_reset' and current 'stats_reset' is
less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
stats_reset FROM pg_stat_subscription_stats WHERE subname =
'regress_testsub'' might return 'false' instead of 'true'.
But I was not able to reproduce such a scenario after multiple
testing. Even in high end machines, it takes at least a few
microseconds. Also there are multiple cases where we did similar
timestamp comparison in 'stats.sql' as well. And, I didn't find any
other failure related to such case. So, I feel this is not possible.
Did you try that on Windows (hamerkop is a Windows animal)? IIUC,
GetCurrentTimestamp() -> gettimeofday() implemented on Windows via
GetSystemTimePreciseAsFileTime(), and it has 100ns resolution,
according to [1].
I tried it on a Windows VM and saw the following numbers:
SELECT :'prev_stats_reset' < stats_reset, :'prev_stats_reset', stats_reset FROM pg_stat_subscription_stats WHERE subname
= 'regress_testsub';
?column? | ?column? | stats_reset
----------+-------------------------------------+-------------------------------------
t | Fri Jul 04 11:46:10.224323 2025 PDT | Fri Jul 04 11:46:10.224975
2025 PDT
(1 row)
that is, the difference is about 600 ns — if I could make the VM 6x faster
or run the test on a bare metal, maybe I'd get the test failed.
Although the overall duration of the subscription doesn't differ much:
My VM:
# parallel group (2 tests): subscription publication
ok 164 + publication 806 ms
ok 165 + subscription 210 ms
vs
hamerkop:
# parallel group (2 tests): subscription publication
ok 164 + publication 2383 ms
not ok 165 + subscription 142 ms
So probably the failure is caused by something else.
2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
We have a shared hash 'pgStatLocal.shared_hash'. If the entry
reference (for the subscription) is not found while executing
'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset
the stats. Maybe somehow this shared hash is getting dropped..
Also, it could be failing due to the same reason as Alexander has
I don't think 2) is relevant here, because shared buffers shouldn't affect
subscription's statistics.
[1]
https://stackoverflow.com/questions/4444090/measuring-time-with-a-resolution-of-microseconds-in-c
Best regards.
Alexander