From: Omar Sandoval <osan...@fb.com> drgn [1] currently uses debuginfod with great success for debugging userspace processes. However, for debugging the Linux kernel (drgn's main use case), we have had some performance issues with debuginfod, so we intentionally avoid using it. Specifically, it sometimes takes over a minute for debuginfod to respond to queries for vmlinux and kernel modules (not including the actual download time).
The reason for the slowness is that Linux kernel debuginfo packages are very large and contain lots of files. To respond to a query for a Linux kernel debuginfo file, debuginfod has to decompress and iterate through the whole package until it finds that file. If the file is towards the end of the package, this can take a very long time. This was previously reported for vdso files [2][3], which debuginfod was able to mitigate with improved caching and prefetching. However, kernel modules are far greater in number, vary drastically by hardware and workload, and can be spread all over the package, so in practice I've still been seeing long delays. This was also discussed on the drgn issue tracker [4]. The fundamental limitation is that Linux packages, which are essentially compressed archives with extra metadata headers, don't support random access to specific files. However, the multi-threaded xz compression format does actually support random access. And, luckily, the kernel debuginfo packages on Fedora, Debian, and Ubuntu all happen to use multi-threaded xz compression! debuginfod can take advantage of this: when it scans a package, if it is a seekable xz archive, it can save the uncompressed offset and size of each file. Then, when it needs a file, it can seek to that offset and extract it from there. This requires some understanding of the xz format and low-level liblzma code, but the speedup is massive: where the worst case was previously about 50 seconds just to find a file in a kernel debuginfo package, with this change the worst case is 0.25 seconds, a ~200x improvement! This works for both .rpm and .deb files. Patch 1 is a preparatory refactor. Patch 2 implements saving the uncompressed offsets and sizes in the database. Patch 3 implements the seekable xz extraction. I tested this by requesting and verifying the digest of every file from a few kernel debuginfo rpms and debs [5]. P.S. The biggest downside of this change is that it depends on a very specific compression format. I think this is something we should formalize with Linux distributions: large debuginfo packages should use a seekable format. Currently, xz in multi-threaded mode is the only option, but Zstandard also has an experimental seekable format that is worth looking into [6]. 1: https://github.com/osandov/drgn 2: https://sourceware.org/bugzilla/show_bug.cgi?id=29478 3: https://bugzilla.redhat.com/show_bug.cgi?id=1970578 4: https://github.com/osandov/drgn/pull/380 5: https://gist.github.com/osandov/89d521fdc6c9a07aa8bb0ebf91974346 6: https://github.com/facebook/zstd/tree/dev/contrib/seekable_format Omar Sandoval (3): debuginfod: factor out common code for responding from an archive debuginfod: add archive entry size, mtime, and uncompressed offset to database debuginfod: optimize extraction from seekable xz archives configure.ac | 5 + debuginfod/Makefile.am | 2 +- debuginfod/debuginfod.cxx | 870 +++++++++++++++++++++++++++++++------- 3 files changed, 722 insertions(+), 155 deletions(-) -- 2.45.2