02.07.2024 15:00, Alexander Lakhin wrote:
One month later, I'd like to summarize failures that I've investigated
and classified during June, 2024 on the aforementioned wiki page.
(Maybe it would make sense to issue a monthly report with such information
in the future.)
Please take a look at July report on the buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-07-01' AND
dt < '2024-08-01' GROUP BY br;
REL_12_STABLE: 11
REL_13_STABLE: 9
REL_14_STABLE: 7
REL_15_STABLE: 10
REL_16_STABLE: 9
REL_17_STABLE: 68
HEAD: 106
-- Total: 220
(Counting test failures only, excluding indent-check, Configure, Build
errors.)
# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
dt >= '2024-07-01' AND dt < '2024-08-01');
40
# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-07-01' AND
dt < '2024-08-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 9;
https://www.postgresql.org/message-id/20240404170055.qynecay7szu3d...@awork3.anarazel.de:
29
-- An environmental issue
https://www.postgresql.org/message-id/a9a97e83-9ec8-5de5-bf69-80e9560f5...@gmail.com:
20
-- Probably fixed
https://www.postgresql.org/message-id/1545399.1720554...@sss.pgh.pa.us: 11
-- Fixed
https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319...@gmail.com:
9
https://www.postgresql.org/message-id/db093cce-7eec-8516-ef0f-891895178...@gmail.com:
8
-- An environmental issue; probably fixed
https://www.postgresql.org/message-id/b2037a8d-fe6b-d299-da17-ff5f3214e...@gmail.com:
8
https://www.postgresql.org/message-id/3e2cbd24-f45e-4b2b-ba83-8149214f0...@dunslane.net:
8
-- Fixed
https://www.postgresql.org/message-id/68de6498-0449-a113-dd03-e198dded0...@gmail.com:
8
-- Fixed
https://www.postgresql.org/message-id/3618203.1722473...@sss.pgh.pa.us: 8
-- Fixed
# SELECT count(*) FROM failures WHERE dt >= '2024-07-01' AND
dt < '2024-08-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
17
And one more metric, that might be useful, but it requires also time
analysis — short-lived (eliminated immediately) failures: 83
I also wrote a simple script (see attached) to check for unknown buildfarm
failures using "HTML API", to make sure no failures missed. Surely, it
could be improved in many ways, but I find it rather useful as-is.
Best regards,
Alexander
#!/bin/bash
TMP=${TMPDIR:-/tmp}
wget "https://buildfarm.postgresql.org/cgi-bin/show_failures.pl" -O
"$TMP/failures.html"
wget "https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures" -O
"$TMP/known-test-failures.html"
sed -E 's/\&max_days/\&max_days/; s/(hours|mins|secs| i) < /\1 \< /' -i
"$TMP/failures.html"
cat << 'EOF' > "$TMP/t.xsl"
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select=".//xhtml:tr[@class='fail' or @class='warnx' or
@class='warn']//xhtml:td[@class='status'][not(contains(., 'Configure')) and
not(contains(., 'Build')) and not(contains(., 'Make')) and not(contains(.,
'indent-check'))]">
<xsl:value-of select="xhtml:a/@href" />
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
EOF
for fl in $(xsltproc "$TMP/t.xsl" "$TMP/failures.html"); do
if [[ $fl == show_log* ]]; then
sfl=${fl/\&/\&}
grep -q "$sfl" "$TMP/known-test-failures.html" && continue
echo "An unknown failure found:
https://buildfarm.postgresql.org/cgi-bin/$fl"
wget "https://buildfarm.postgresql.org/cgi-bin/$fl" -O
"$TMP/failure-$fl.log"
il=""
if grep -q -Pzo \
'(?s)pgsql.build/testrun/pg_basebackup/040_pg_createsubscriber/log/regress_log_040_pg_createsubscriber\b.*'\
'ok 29 - run pg_createsubscriber without --databases\s*\n.*'\
'pg_createsubscriber: error: recovery timed out\s*\n.*'\
'not ok 30 - run pg_createsubscriber on node S\s*\n'\
"$TMP/failure-$fl.log"; then
il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#040_pg_createsubscriber.pl_fails_due_to_recovery_timed_out_during_pg_createsubscriber_run"
elif grep -q -Pzo
'(?s)(pgsql.build/testrun/postgres_fdw-running/regress|pgsql.build/testrun/postgres_fdw/regress|pgsql.build/contrib/postgres_fdw)/regression.diffs<.*'\
' ERROR: canceling statement due to statement timeout\s*\n'\
'\+WARNING: could not get result of cancel request due to timeout\s*\n'\
"$TMP/failure-$fl.log"; then
il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#The_postgres_fdw_test_fails_due_to_an_unexpected_warning_on_canceling_a_statement"
elif grep -q -Pzo \
'(?s)# poll_query_until timed out executing this query:\s*\n'\
'# \s*\n'\
'# SELECT NOT EXISTS \(\s*\n'\
'# SELECT \*\s*\n'\
'# FROM pg_database\s*\n'\
"#\s*WHERE age\(datfrozenxid\) \>
current_setting\('autovacuum_freeze_max_age'\)::int\)\s*\n.*"\
'# Looks like your test exited with 29 just after 1.\s*\n'\
't/001_emergency_vacuum.pl ..\s*\n'\
"$TMP/failure-$fl.log"; then
il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing"
elif grep -q -Pzo \
'(?s)Details for system "[^"]+" failure at stage pg_amcheckCheck,.*'\
'postgresql:pg_amcheck / pg_amcheck/005_opclass_damage\s+TIMEOUT'\
"$TMP/failure-$fl.log"; then
il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#005_opclass_damage.pl_fails_on_Windows_animals_due_to_timeout"
elif grep -q -Pzo \
'(?s)pgsql.build/src/test/isolation/output_iso/regression.diffs<.*'\
'\+isolationtester: canceling step d2a1 after (300|360) seconds\s*\n'\
' step d2a1: <... completed>\s*\n'\
'- sum\s*\n'\
'------\s*\n'\
'-10000\s*\n.*'\
'\+ERROR: canceling statement due to user request\s*\n'\
' step e1c: COMMIT;'\
"$TMP/failure-$fl.log"; then
il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#deadlock-parallel.spec_fails_due_to_timeout_on_jit-enabled_animals"
fi
if [ -n "$il" ]; then
echo " The corresponding issue: $il"
echo
fi
else
echo "Invalid link: $fl"
fi
done