Package: release.debian.org Severity: normal Tags: bookworm User: release.debian....@packages.debian.org Usertags: pu X-Debbugs-Cc: prometheus-node-exporter-collect...@packages.debian.org Control: affects -1 + src:prometheus-node-exporter-collectors
[ Reason ] Since the bookworm upgrade, all hosts with the prometheus-node-exporter-collectors package install repeatedly hit the mirrors with spurious apt-update runs. The Debian package systemd.timer(1) schedule is once on boot and then every 15 minutes after, which imposes a tremendous load on the mirror system. It also happens to lock the apt status files which can in turn cause deadlocks with other programs. There seems to be an (unrelated?) regression in apt where some (python?) apt program can hang this way indefinitely. That's tracked separately, e.g. https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851 [ Impact ] The mirror networks are going to be badly negatively affected by this. This actually doesn't seem to have been a problem so far, possibly because deb.debian.org is absorbing everything we can throw at it and no one is looking at Fastly (or Amazon?), but I suspect this could become a problem as Prometheus adoption widens. This also can mean some security upgrades don't get deployed, as the apt update lockfile gets hung. [ Tests ] There's no autopkgtest on this package, but the patches provided here have been tested in production in about 90 servers for torproject.org over a few weeks now, fixing the diagnosed issue. https://gitlab.torproject.org/tpo/tpa/team/-/issues/41355 is our tracking ticket documenting this. [ Risks ] Code is *relatively* simple. It diverges from upstream a little now because upstream did a little reactoring post bookworm and I didn't want to make the patch bigger than it absolutely needs to be. [ Checklist ] [x] *all* changes are documented in the d/changelog [x] I reviewed all changes and I approve them [x] attach debdiff against the package in (old)stable [x] the issue is verified as fixed in unstable [ Changes ] There are three individual patches, cherry-picked and backported from upstream. The first one simply stops running apt-update. The second and third patch add a new metric which keeps track of the last update timestamp on the apt metadata. That is important: previously, the script was running apt-update so we could be pretty sure it was running automatically. But by making this change, we're *not* running apt-update automatically and assume users have properly setup something *else* that does. This was the behaviour in bullseye, for the record, but it's possible that some users *assume* the wrong (new) behavior was the correct one and installed the package not knowing they need to deploy their own APT::Periodic thing.
diff -Nru prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/changelog prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/changelog --- prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/changelog 2023-02-02 23:57:45.000000000 -0500 +++ prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/changelog 2023-10-31 13:57:52.000000000 -0400 @@ -1,3 +1,10 @@ +prometheus-node-exporter-collectors (0.0~git20230203.6f710f8-1+deb12u1) bookworm; urgency=medium + + * Team upload + * Fix deadlock with other apt update runs (Closes: #1028212) + + -- Antoine Beaupré <anar...@debian.org> Tue, 31 Oct 2023 13:57:52 -0400 + prometheus-node-exporter-collectors (0.0~git20230203.6f710f8-1) unstable; urgency=medium * New upstream snapshot (Closes: #1030058) diff -Nru prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch --- prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch 1969-12-31 19:00:00.000000000 -0500 +++ prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch 2023-10-31 13:57:52.000000000 -0400 @@ -0,0 +1,65 @@ +From 28c179ddfd3d7e0f5bc49b93f924f0dffba5b71d Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anar...@debian.org> +Date: Fri, 13 Oct 2023 12:29:48 -0400 +Subject: [PATCH] do not run apt update or simulate apt dist-upgrade +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +This is causing all sorts of problems. The first of which is that +we're hitting our poor mirrors every time the script is ran, which, in +the Debian package configuration, is *every 15 minutes* (!!). + +The second is that this locks the cache and makes this script +needlessly stumble upon a possible regression in APT from Debian +bookworm and Ubuntu 22.06: + +https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851 + +That still has to be confirmed: it's possible that `apt update` can +hang for a long time, but that shouldn't concern us if we delegate +this work out of band. + +I also do not believe actually performing the `dist-upgrade` +calculations is necessary to compute the pending upgrades at all. I've +done work with python-apt for other projects and haven't found that to +be required: the cache has the necessary information about pending +upgrades. + +Closes: #179 + +Signed-off-by: Antoine Beaupré <anar...@debian.org> +--- + apt_info.py | 9 --------- + 1 file changed, 9 deletions(-) + +diff --git a/apt_info.py b/apt_info.py +index 59f3ad7..81a276b 100755 +--- a/apt_info.py ++++ b/apt_info.py +@@ -9,7 +9,6 @@ + + import apt + import collections +-import contextlib + import os + + _UpgradeInfo = collections.namedtuple("_UpgradeInfo", ["labels", "count"]) +@@ -90,14 +89,6 @@ def _write_reboot_required(): + def _main(): + cache = apt.cache.Cache() + +- # First of all, attempt to update the index. If we don't have permission +- # to do so (or it fails for some reason), it's not the end of the world, +- # we'll operate on the old index. +- with contextlib.suppress(apt.cache.LockFailedException, apt.cache.FetchFailedException): +- cache.update() +- +- cache.open() +- cache.upgrade(True) + _write_pending_upgrades(cache) + _write_held_upgrades(cache) + _write_autoremove_pending(cache) +-- +2.39.2 + diff -Nru prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-report-the-apt-cache-timestamp.patch prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-report-the-apt-cache-timestamp.patch --- prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-report-the-apt-cache-timestamp.patch 1969-12-31 19:00:00.000000000 -0500 +++ prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-report-the-apt-cache-timestamp.patch 2023-10-31 13:57:52.000000000 -0400 @@ -0,0 +1,52 @@ +From dc6568ef14f49bddef28befef6f5058d38535cbc Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anar...@debian.org> +Date: Fri, 13 Oct 2023 13:01:32 -0400 +Subject: [PATCH] report the apt cache timestamp +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +We use the `pkgcache.bin` modification time as a heuristic, but there +might be a better way to do this. + +We also silently ignore errors when pulling that file to avoid broken +systems from breaking the other metrics. I believe this will lead to a +null metric which is probably what we want anyway. + +A possible improvement to this would be to individually inspect the +`InRelease` files and report the mirror's timestamps as well, but +that's more involved. It's also not a priority compared to this fix, +because we want the update timestamp if we remove the `apt update` +run (in #181). + +Closes: #180 + +Signed-off-by: Antoine Beaupré <anar...@debian.org> +Co-authored-by: Ben Kochie <sup...@gmail.com> +--- + apt_info.py | 9 +++++++++ + 1 file changed, 9 insertions(+) + +Index: b/apt_info.py +=================================================================== +--- a/apt_info.py 2023-10-23 14:58:09.558722141 -0400 ++++ b/apt_info.py 2023-10-23 15:00:07.250721050 -0400 +@@ -77,6 +77,18 @@ def _write_autoremove_pending(cache): + print(f"apt_autoremove_pending {len(autoremovable_packages)}") + + ++def _write_cache_timestamps(registry): ++ print('# HELP apt_package_cache_timestamp_seconds Apt update last run time.') ++ print('# TYPE apt_package_cache_timestamp_seconds gauge') ++ try: ++ print( ++ "apt_package_cache_timestamp_seconds %s" ++ % os.stat('/var/cache/apt/pkgcache.bin').st_mtime ++ ) ++ except OSError: ++ pass ++ ++ + def _write_reboot_required(): + print("# HELP node_reboot_required Node reboot is required for software updates.") + print("# TYPE node_reboot_required gauge") diff -Nru prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-use-a-better-heuristic-for-the-apt-update-last-run-t.patch prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-use-a-better-heuristic-for-the-apt-update-last-run-t.patch --- prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-use-a-better-heuristic-for-the-apt-update-last-run-t.patch 1969-12-31 19:00:00.000000000 -0500 +++ prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/0001-use-a-better-heuristic-for-the-apt-update-last-run-t.patch 2023-10-31 13:57:52.000000000 -0400 @@ -0,0 +1,60 @@ +From feb943f6df8cbc536eaf7fca59a088d59bb39e82 Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anar...@debian.org> +Date: Tue, 17 Oct 2023 10:27:30 -0400 +Subject: [PATCH] use a better heuristic for the apt update last run time +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +As reported in #182, `pkgcache.bin` gets updated on more than just +`apt update`. In particular, it gets modified when a package is +installed, upgraded or removed. + +So let's fallback on a better heuristic, which is the `APT::Periodic` +timestamp. This should give us a much more precise and deliberate +status. + +We also fallback to the `lists` directory, which gets updated when new +mirror lists gets moved into place. This does run the risk of staying +silently unchanged if there's no change on mirrors but (a) that's +rather infrequent and (b) we might actually want to warn on that +anyway. + +Signed-off-by: Antoine Beaupré <anar...@debian.org> +Closes: #183 +--- + apt_info.py | 11 ++++++++++- + 1 file changed, 10 insertions(+), 1 deletion(-) + +Index: b/apt_info.py +=================================================================== +--- a/apt_info.py 2023-10-23 15:01:19.018720385 -0400 ++++ b/apt_info.py 2023-10-23 15:01:46.930720116 -0400 +@@ -8,6 +8,7 @@ + # Author: Kyle Fazzari <kyr...@ubuntu.com> + + import apt ++import apt_pkg + import collections + import os + +@@ -80,10 +81,18 @@ def _write_autoremove_pending(cache): + def _write_cache_timestamps(registry): + print('# HELP apt_package_cache_timestamp_seconds Apt update last run time.') + print('# TYPE apt_package_cache_timestamp_seconds gauge') ++ apt_pkg.init_config() ++ if apt_pkg.config.find_b("APT::Periodic::Update-Package-Lists"): ++ # if we run updates automatically with APT::Periodic, we can ++ # check this timestamp file ++ stamp_file = "/var/lib/apt/periodic/update-success-stamp" ++ else: ++ # if not, let's just fallback on the lists directory ++ stamp_file = '/var/lib/apt/lists' + try: + print( + "apt_package_cache_timestamp_seconds %s" +- % os.stat('/var/cache/apt/pkgcache.bin').st_mtime ++ % os.stat(stamp_file).st_mtime + ) + except OSError: + pass diff -Nru prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/series prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/series --- prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/series 1969-12-31 19:00:00.000000000 -0500 +++ prometheus-node-exporter-collectors-0.0~git20230203.6f710f8/debian/patches/series 2023-10-31 13:57:52.000000000 -0400 @@ -0,0 +1,3 @@ +0001-do-not-run-apt-update-or-simulate-apt-dist-upgrade.patch +0001-report-the-apt-cache-timestamp.patch +0001-use-a-better-heuristic-for-the-apt-update-last-run-t.patch