Hello! Andreas Enge <andr...@enge.fr> skribis:
> Am Thu, Jun 06, 2024 at 07:48:27PM +0200 schrieb Andreas Enge: >> Could the graph on >> https://ci.guix.gnu.org/metrics >> be augmented by the number of packages to be built for the different >> architectures? That would be nice, I agree (I haven’t looked much at that part of the code). > In that direction, the metrics now show that very few packages were built > in the last 24 hours, except maybe for ARM (where we anyway build few > packages). But the number of waiting builds stalls at around 280000. > > Are these all for ARM now? Should we cancel builds a bit more aggressively > to make sure that recent packages are favoured? In the meantime, here’s me doing stats-as-a-service: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo -u cuirass psql cuirass cuirass=> select count(*) from builds where status = -2 ; count -------- 284314 (1 row) Time: 635.478 ms cuirass=> select count(*) from builds where status = -2 and system = 'x86_64-linux'; count ------- 0 (1 row) Time: 761.333 ms cuirass=> select count(*) from builds where status = -2 and system = 'aarch64-linux'; count -------- 160847 (1 row) Time: 661.968 ms cuirass=> select count(*) from builds where status = -2 and system = 'powerpc64le-linux'; count -------- 119124 (1 row) Time: 589.800 ms cuirass=> select count(*) from builds where status = -2 and system = 'armhf-linux'; count ------- 4343 (1 row) Time: 549.242 ms cuirass=> select count(*) from builds where status = -2 and system = 'i686-linux'; count ------- 0 (1 row) Time: 1088.130 ms (00:01.088) --8<---------------cut here---------------end--------------->8--- So lots of AArch64 and POWER9 builds. Executive summary: 1. Of all the AArch64 build machines we have, only ‘overdrive1’ is currently actually contributing build power; 2. AArch64 build machines ‘pankow’, ‘grunewald’, and ‘kreuzberg’ (HoneyCombs) need on-site intervention so we can reconfigure them and reboot them. 3. Some other AArch64 build machines (‘lieserl’ and ‘monokuma’) have been off for months and we’re discussing on guix-sysadmin ways to turn them back on; 4. POWER9, I’m not sure. 5. ‘cuirass remote-server’ may be too slow at handling incoming messages from workers, leading to redundant builds and the impression on https://ci.guix.gnu.org/workers that workers are idle, even when they’re in fact busy building stuff. Investigation details: I noticed that ‘cuirass remote-server’ on berlin would all too often consider workers as “unresponsive” (meaning that it hasn’t received a ‘ping’ message from them in the past 2 minutes): --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo grep unresponsive /var/log/cuirass-remote-server.log |tail -10 2024-06-17 12:44:02 restarted 1 builds that were on unresponsive workers 2024-06-17 12:50:03 restarted 1 builds that were on unresponsive workers 2024-06-17 12:55:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:01:03 restarted 3 builds that were on unresponsive workers 2024-06-17 13:08:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:20:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:22:03 restarted 4 builds that were on unresponsive workers 2024-06-17 13:24:03 restarted 2 builds that were on unresponsive workers 2024-06-17 13:29:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:33:03 restarted 3 builds that were on unresponsive workers --8<---------------cut here---------------end--------------->8--- As shown in this log, the effect is that some builds get restarted, even though they are still being built by a worker that was wrongfully considered unresponsive. This needs further investigation. The SQL query for ‘db-get-pending-build’ fixed by Cuirass commit 17338588d4862b04e9e405c1244a2ea703b50d98 is no longer at fault: it’s now reasonably fast (there’s a warning in ‘cuirass-remote-server.log’ if it ever takes more than 10s). It could be that the backlog of incoming messages in ‘remote-server’ still keeps increasing though, since workers send pings every minute no matter what. A further problem is that we’re unable to retrieve binaries from a couple of build machines: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo grep error: /var/log/cuirass-remote-server.log |tail -10 2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created 2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized. 2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created 2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized. 2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created 2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized. 2024-06-17 13:17:29 error: failed to add /gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1 to store: path `/gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1' does not exist and cannot be created 2024-06-17 13:17:29 error: The remote-worker signing key might be unauthorized. 2024-06-17 13:24:03 error: failed to add /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0 to store: path `/gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0' does not exist and cannot be created 2024-06-17 13:24:03 error: The remote-worker signing key might be unauthorized. --8<---------------cut here---------------end--------------->8--- By picking store items from these error messages, we can determine that at least ‘pankow’ (10.0.0.8, AArch64) and ‘grunewald’ (10.0.0.10, AArch64) are at fault: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ guix gc --derivers /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0 /gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv ludo@berlin ~$ sudo grep 8yc7j6q169f8312wx6jxs7g0z4xy5l5l /var/log/cuirass-remote-server.log |tail -10 2024-06-17 13:21:50 10.0.0.8 (uUTl7MVR): build started: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'. 2024-06-17 13:24:03 fetching 1 outputs of '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv' from http://10.0.0.8:5558 2024-06-17 13:24:03 build succeeded: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv' ludo@berlin ~$ guix gc --derivers /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b /gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv ludo@berlin ~$ sudo grep ygrgwp9jyksjpnd76b83ifdskbcdjbhh /var/log/cuirass-remote-server.log |tail -10 2024-06-17 13:05:21 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.8:5558 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' 2024-06-17 13:34:39 build failed: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' 2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558 2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558 2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' 2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' --8<---------------cut here---------------end--------------->8--- The signing key of ‘grunewald’ is definitely registered: --8<---------------cut here---------------start------------->8--- $ ssh grunewald cat /etc/guix/signing-key.pub (public-key (ecc (curve Ed25519) (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#) ) ) $ grep -rl 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 ~/src/guix-maintenance/hydra/ $ ssh berlin grep 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 /etc/guix/acl (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#) --8<---------------cut here---------------end--------------->8--- That of ‘pankow’ I can’t say because I cannot log in. Most likely, it rebooted and might have regenerated a new signing key different from the one that’s registered. So in effect, ‘pankow’ is effectively not contributing any build. The third machine of the HoneyComb family is ‘kreuzberg’: it’s been off for a few days, after I rebooted it and it didn’t come back. Thanks, Ludo’. PS: I’m traveling this week so I won’t be very responsive.