Hi folks interested in Sage packaging, Almost every time the topic comes up, I complain that it isn't easier to use more system packages as both build- and run-time dependencies of Sage. I'd like to make some progress on actually doing something about that, and I have some ideas, but I'd like to bounce them off anyone who's interested first before just going off and doing it.
There is enough work involved in this that I believe it can and should be broken up into a number of smaller tasks. I would also like to approach this in a way that works well and integrates with the existing "sage-the-distribution" infrastructure. I believe there are advantages to being able to develop on Sage in the "normal" way we're already used to, while also being able to take advantage of existing system packages wherever possible. So I'm just going to try to organize my existing thoughts on this and see what anyone thinks. Sorry if it's TL;DR, but I'm hoping that having a detailed discussion about this will make it more likely that something will actually be accomplished on it soon (because I think the actual implementation, once decided on, is not terribly difficult). Note: In this message I'm using "package" loosely to refer to any program, library, database, or other collection of files that is distributed and installed as a self-contained unit. It doesn't necessarily relate to any particular "packaging system". 1. Why? ======= The extent and scope to which Sage "vendors" its dependencies, in the form of what some call "sage-the-distribution", is *not* particularly normal in the open source world. Vendoring *some* dependencies is not unusual, but Sage does nearly all (even down the gcc, in certain cases). I've learned a lot of the history to this over the past year, and agree that most of the time this has been done with good reasons. For example, I can't think of any other software that forces me to build its own copy of ncurses just to build/install it. This was added for good reasons [1], but not reasons that can't also resolved in part by installing the appropriate system packages, or that might not be resolved by now in system packages that depend on ncurses (i.e. that should be built with ncurses support). Point being, this issue does not necessarily impact everyone, and building Sage's own ncurses is overkill in that case. It would be one thing if we were just talking one or two packages (I didn't pick on ncurses for any deep reason), but now multiply that by around 250 (give or take, depending on how many dependencies are even available as system packages) and it becomes real overhead to getting started *and* making progress with Sage development. I wouln't propose *removing* any existing spkgs that are still relevant. I think it's really useful that Sage has a list of known-good pinned versions of its dependencies. Further, "sage-the-distribution" makes it very easy to install those dependencies in such a way that they can be used as build/runtime dependencies by Sage without having to hunt the 'net for the right source packages of the right versions of those dependencies, and figure out how to configure and build them in a piecemeal fashion. In other words, even if we do expand the ability to use system packages for Sage's dependencies, it's still very nice that it's easy with a few commands to use the spkg if something goes wrong with the system package. It's also, of course, important for power users who wish to compile some dependencies on their own--especially highly tuned numerical libraries (but even those users usually only care about being able to hand-configure a few dependencies, not most). To summarize: being able to more aggressively rely on system packages can save a lot of time and frustration during normal development of Sage, and is also less jarring especially to new developers, of whom we would like to attract more. It should also decrease the time required to regularly build binary distributions of Sage (e.g. for Docker, Windows, and Linux distros). 2. Overview of how Sage manages dependencies now (and what won't change) ======================================================================== For many of you this will be unnecessary review, but I want to discuss a little about how dependencies are currently checked and installed in Sage-the-distribution. Doing so is helpful for me too, to make sure I understand it clearly (and correct me if I have any misunderstandings). Sage-the-distribution uses *Make* itself (cleverly, IMO) to manage dependencies insofar as making sure all dependencies are installed, and that when a package changes all packages that depend (directly or indirectly) on that package are rebuilt. Make works on files and timestamps, which does not translate directly to entire software packages, so to track whether or not an spkg is up to date, Sage uses the common "stamp pattern" for Make [2]--that is, when an spkg is installed it writes a file that effectively "represents" completion of the installation of that spkg for Make's purposes. These stamp files are the files typically stored under $SAGE_LOCAL/var/lib/sage/installed/<spkg>-<version>. This directory is also known in some places as SAGE_SPKG_INST. By including the version number in the name we can also force rebuilds when an spkg's version changes. When one runs `make <spkg>` with just the spkg name, this is actually a phony target with the path to the stamp file for that package (at its current version) as the sole target. So `make <spkg>` translates to `make $SAGE_SPKG_INST/<spkg>-<version>` for the current version of that spkg. The associated rule is to run the sage-spkg command for that package, which also takes care of writing the stamp file. sage-spkg also writes some information into each stamp file in a somewhat loose format that I don't believe is parsed anywhere. However the *existence* of these files is used by the (somewhat controversial, for downstream packagers) `is_package_installed()` function.* I'm actually going to propose later that we write and use these stamp files (with some slight changes) even when installing dependencies from a system package, so these files might be present even in binary packages for Sage (though that might be up to downstream packagers). When Sage's `./configure` script generates the main Makefile for all of Sage's dependencies, it loops over all the spkgs in build/pkgs/ and creates two make targets for each spkg: the aforementioned phony target consisting of just the package name, and the *real* target for the stamp file. It also creates a make variable named like `$(inst_<spkg>)` (where <spkg> is just the package name, without the version) referring to the full path of the stamp file for that package. Each spkg may list its build dependencies in its build/pkgs/<spkg>/dependencies file, in the format that it will appear in the Makefile as dependencies for the make target of that package. For convenience's sake, the `dependencies` file just contains the package names, but the `./configure` script converts this to the appropriate `$(inst_<spkg>)` variables, so that the stamp files become the real dependencies (part of how the "stamp pattern" normally works). When a package is upgraded (i.e. its version number changes) then the Makefile is regenerated, but with the `$(inst_<spkg>)` for that package pointing to a new stamp file, containing the new version number. Thus any dependents of that package will see this as an outdated dependency, and get rebuilt after the upgraded package is built. When packages are rebuilt (even if their version didn't change) their stamp files are touched, forcing further rebuilds of any of their dependents and so on, in normal Make behavior. As far as I can tell this has worked quite well for Sage--especially as it also allows leveraging Make's parallel build features. So I'm proposing to keep this all pretty much as-is, with possibly only minor tweaks in the details. Instead, many more of the changes will be at configure time. * There is proposed work already mostly done to replace use of is_package_installed() within the Sage library with a way to do runtime feature checks: https://trac.sagemath.org/ticket/20382 Some of this work *might* be redundant with what I want to propose, but can also coexist with it, as it is currently designed for runtime use by the Python code itself, and not during builds. 3. Case study--examples already in Sage ======================================= Sage-the-distribution already has a few examples of "spkgs" in the system that *may* use a system package, rather than building from source. As it is this is done in an ad-hoc manner that can be surprising and/or misleading. But I think it's useful to look at them to see how this is done currently and if there's anything we can learn from it. a) Blas ------- There are two different BLAS implementation packages to choose from currently in Sage: OpenBLAS and ATLAS.* The selection can be made currently at configure time with a --with-blas= flag which can take either 'openblas' or 'atlas'. The selection is used to write a variable called `$(BLAS)` in the makefile that points to the stamp file path for the actual BLAS implementation spkg selected. Other spkgs that have BLAS as a dependency list the `$(BLAS)` variable in its dependencies, rather than writing "openblas" or "atlas" explicitly. When openblas is selected (now the default) the openblas spkg is installed unconditionally. However, when *atlas* is selected, there happens to be a mechanism for using a system BLAS (why just with ATLAS I don't know--historical reasons I guess). In this case it still runs the spkg-install for ATLAS like for any other spkg, but its spkg-install checks for a special environment variable, `SAGE_ATLAS_LIB` (the only way to control this behavior). This invokes a search in standard locations first for a "libatlas.so" (or equivalent) explicitly. If that's not found, it will happily take whatever it does find as long as there's *some* "libblas.so" and "liblapack.so" found on the system. It doesn't do any feature checks or anything--it just takes what it finds. If it does find something resembling either ATLAS specifically, or a generic BLAS/LAPACK, then it skips installing the actual spkg, but still writes a stamp file indicating that "ATLAS" was installed, with whatever version is in the package-version.txt for the spkg, which can of course be misleading. (It also writes pkgconfig .pc files in $SAGE_LOCAL/lib for blas/cblas/lapack indicating which libs it found, along with a "fake" version of "1.0".) This, Sage will use these system libraries for all build and runtime requirements of BLAS, and in my experience this has generally worked. * There is another issue I would like to address--slightly orthogonal to supporting system packages--of having a regular way to support "abstract" packages that can have multiple alternative implementations (another example being GMP/MPIR). This has been talked about before, such as in this recent thread [3]. I have some ideas about this that integrate well with my ideas for system packages, but I will try to save that for a separate message. b) GCC ------ The GCC spkg is a bit of a different beast, since it is normally not installed by default, and was only added to support cases where the platform's GCC is broken or too old and has bugs that affect building Sage or its dependencies. Although Sage's `configure` script is responsible for determining whether or not GCC should be installed (in contrast to hacks in spkg-install like for ATLAS), there is no *flag* for `configure` (e.g. --with-gcc or something like that) for controlling this. Instead the behavior is controlled solely by an environment variable "SAGE_INSTALL_GCC" (this should probably be fixed, but we'll come to that). If the environment variable is set to "yes"/"no" then that forces the gcc installation behavior one way or the other. However, if the environment variable is not set, then the configure script goes through the necessary checks to see if the installed gcc is new enough, and also if gfortran is installed, among others. If GCC installation is deemed necessary then it sets a flag indicating as much, called `need_to_install_gcc=yes`. This is used later (see next section) to set the `$(inst_gcc)` variable. c) git ------ Sage actually includes an spkg for git, and installs it unconditionally (there is currently no way to control this) if a working 'git' is not found on the system. This is one of the few packages that just has a straightforward check for the system version at configure time. If a working git is not found (where 'working' here just means `git --version` works) the script sets a variable (similar to the gcc case) called `need_to_install_git=yes`. (It also sets a similar variable for `need_to_install_yasm` on x86-based systems.) Later, while writing the main Makefile, the configure script loops over all spkgs that *might* be installed and checks for a `need_to_install_<spkg>` variable. If not found, or not set to "no", the script sets the `$(inst_<spkg>)` variable to point to the standard stamp file for that package. Otherwise it sets `$(inst_<spkg>)` to a dummy file that always exists (this way any dependencies for that package are still satisfied, but the spkg is never actually built/installed). 4. Package sources ================== One of the main changes I'm proposing is that stamp files for packages will always be written to SAGE_SPKG_INST even for cases where the system package is used, and the Sage spkg is not actually installed. That is, I want to change the meaning of "spkg" to more broadly represent "a dependency of Sage that *may* be included in Sage-the-distribution". To this end I want to define a concept of spkg "sources" (not to be confused with source code). Instead, these are sources from which the spkg dependency can be satisfied. Three possible sources I have in mind (and I'm not sure that there would be any other): a) sage-dist: This is the current notion of an "spkg", where the source tarball is downloaded from one of the Sage mirrors, unpacked and installed to $SAGE_LOCAL using sage-spkg + the spkg's spkg-install script. The resulting stamp file, with the version taken from package-version.txt is written to $SAGE_SPKG_INST. b) system: In this case a check is made to see if the dependency is already satisfied by the system. How exactly this check is performed depends heavily on the package. *If possible* the version of the system package is also determined (will discuss the nuts-and-bolts of this later). In this case a stamp file is still written to $SAGE_SPKG_INST, but indicating somehow that the system package was used, not the sage-dist package. c) source: This case is not necessary for supporting system packages, but I think would be useful for testing new versions of a package. In this case it would be possible to install an spkg from an existing source tree for that package, which would be installed using the spkg-install script. If possible the version number would be determined from the package source code, and not assumed. I think this would be useful, but won't discuss this case any further for now. I just point it out as another possibility within this framework of allowing different spkg "sources". To summarize, no matter how an spkg dependency is satisfied, a stamp file for that spkg is written to $SAGE_SPKG_INSTALL, possibly indicating the *actual* version of the package being used by Sage, and indicating how the dependency was satisfied. 5. Nuts and bolts ================= a) New stamp file format ------------------------ As suggested in the previous section, no matter how an spkg dependency was satisfied, a stamp file is written to the $SAGE_SPKG_INST directory. In order to support multiple possible package "sources", the source that was used should be included in the stamp file. This way, it will also be possible to re-run `./configure` and specify a different source for a package, thus forcing a rebuild. So I think the stamp filename format should be something like: $SAGE_SPKG_INST/<name>-<source>-<version> where <name> would be the base package name, <source> would be something like "sagedist" or "system", and <version> the *actual* version of the package being used. I'll discuss in the next section how this might be determined for system packages. There's plenty of room for bikeshedding in this, but I think this makes sense. We could also support the old filename format, if such files are found, for backwards compatibility. b) Checking packages -------------------- For any dependency that may be satisfied by system packages, there needs to be a way to specify what the minimum dependency is for Sage (be it a version number, or the presence of certain features) there needs to be a way for each package to check that the dependency is satisfied. I've gone back and forth on exactly how this should be done, but I think that the best way to do this is to allow per-package m4 files, containing an m4 macro that checks that dependency on that package is satisfied (again, be it version number or some other check). Each macro could be named something like SAGE_SPKG_CHECK_<name> Optionally the macro should set a variable indicating the package *version* if the package dependency is satisfied. This is the version string that can be used in the stamp file, for example. If there is no clear way to determine the version (though it most cases there will be), a string like "unknown" could still be allowed for the version. The macro would be defined in a file like sage_spkg_check.m4 under each build/pkgs/<spkg> directory, and loaded on an as-needed basis using the m4_include command in configure.ac. Writing an m4 macro for autoconf is not a common skill, which is why I've hesitated on this. But I think it has a few justifications: It allows one to take advantage of the many existing macros that come with autoconf to perform common checks, such as whether a program is installed, or a function is available in a library. For many packages the SAGE_SPKG_CHECK_ macro would probably just wrap one or two existing autoconf macros. Another justification is that for some packages there may be existing macros to check for them that we can borrow from other projects. We can also provide, in the documentation, a simple template macro demonstrating how to wrap a few shell commands. *NOTE*: To be clear, I'm not proposing that, to implement this proposal, we go through and write 250+ m4 macros for every Sage spkg. This check will be optional, and we can write them one at a time on an as-needed basis, starting with some of the most important ones. I'll discuss more about how missing checks are handled in the next section. Obviously the packages that already have checks in configure.ac (gcc, git, yasm) would have those checks moved out to their package-specific macros. c) Driving the system --------------------- As previously noted, selecting the source for a package would be done at ./configure time. My proposal would be to change very little about the current default behavior. By default, all packages would be installed from the sage-dist source as is the case now. We could still make exceptions for build dependencies like gcc and git. I don't care whether these exceptions are hard-coded in configure.ac, or specified in some generic way. However, the configure script would support, for all spkgs, a `--with-system-<spkg>` argument (e.g. `--with-system-zlib`). For each spkg to be installed (all standard packages, optional packages if selected), if the `--with-system-<spkg>` argument is given, it will attempt to load and run the SAGE_SPKG_CHECK_<spkg> macro for that package. If the macro is not defined, there would be a *warning* that system package was selected for that package, but there is no way to check if it was installed. The warning would make clear that if the build fails it may be due to this dependency being missing. Otherwise it runs the check, and if the check succeeds the configure script would continue, while if the check fails the configure would stop with an error. Optionally, we could add arguments to control all of this behavior. For example, it might be useful to have an option to install the sage-dist spkg if a check is not defined. This might even be better as the default--a possible bikeshed issue. Another possible option is one that enables system packages, but disables any checks. This might be useful for system packagers who already have external guarantees that the dependencies have been met. Finally, there should be an option like `--with-system-all` to automatically use system packages for all dependencies, so that downstream packagers don't have to supply hundreds of `--with-system-` flags. Otherwise, generation of the build/make/Makefile by the configure script would proceed more or less as it does currently. It would just take into account information gained through any `--with-system-` flags to generate the new format stamp filenames. The .dummy stamp file would not be used anymore. Also, the rule for building system packages would be to simply write the stamp file. 6. Q&A ===== Q: What if I install with --with-system-<spkg> but later want to install the sage-dist version of that package? A: We should also support some way to deselect system packages. Perhaps --without-system-<spkg> / --with-system-<spkg>=no (these are two ways of saying the same things in standard configure scripts). Q: The reverse: What if I install the sage-dist package, but want to switch to the system package? A: Same thing, but this is a little trickier because we would need to *uninstall* the package from $SAGE_LOCAL. I have a proposal for improving spkg uninstallation written up at https://trac.sagemath.org/ticket/22510 Q: What if I use a system package when building Sage, but that package is later upgraded, or worse, removed? A: There's no great solution to this. Certainly, I think the ./configure time checks should be cached (since updates are not usually *that* frequent). So there needs to be good documentation on invalidating the cache when re-running ./configure. Still, that only helps with configure-time detection. Sage can still break at runtime if a system package it depends on changes. This is a generic problem for *any* software development, however, and something developers should be aware if if they're updating their system. Granted, most people don't always closely examine what's changing when they install, for example, OS updates. I certainly don't always check this with a fine-toothed comb. But it's a general issue. Keeping the ability to install the "standard", known-working sage-dist spkgs if needed is also a big advantage of this proposal. Any other questions? 7. Future concepts ================== a) Platform hooks ----------------- It might be nice, when using system packages, for the underlying OS/distribution system to hook into the SAGE_SPKG_CHECK_ system, both to check if a package is installed, and to provide its version number. For example, when building Sage on Debian, it might just hook into the dpkg system to provide this information in a manner consistent with the system. b) Abstract packages -------------------- Returning to the question of dependencies that can be satisfied by more than one package (e.g. BLAS, GMP), I think it would be nice to have a generic way of handling such cases that's a little cleaner than the current ad-hoc system. I would like a way of specifying an "abstract" package (which might be named "blas", for example). Installing an abstract package would mean installing the concrete package selected to satisfy it, but it would also include a system for switching between concrete implementations. So for example it would be possible to have multiple BLAS implementations installed simultaneously, and installing "blas" with the current selection might just be a matter of updating some symlinks. I think this concept fits in well with the proposal for handling system packages, but doesn't necessarily need to be handled simultaneously with it. For now we can just maintain the special cases I think... 8. Conclusion (for now) ======================= I've heard many valid concerns with going beyond sage-the-distribution for building/running Sage. Sage's huge collection of dependencies can lead to many fragilities: Version X of package Y might work with dependency A, but completely break dependency B. And supporting versions V, W, and X of package Y simultaneously is a lot of overhead compared to always just using version Y of that package for Sage. I do personally have a preference, when it comes to writing software, to supporting as wide a range of versions for my dependencies as is feasible. For some dependencies the versions supported may, necessarily, be very narrow. But for other cases there can be a lot more room for flexibility. Regardless, I think this proposal maintains the current stability of Sage by keeping the current preference for sage-the-distribution in all cases by default. It also maintains the ability to use custom-built versions of some of Sage dependencies. But I think this will also provide more flexibility in experimenting with using existing system packages in cases where that's sufficient, and avoid Sage duplicating system packages unnecessarily. Best, Erik [1] https://trac.sagemath.org/ticket/14405 [2] https://www.technovelty.org/tips/the-stamp-idiom-with-make.html [3] https://groups.google.com/d/msg/sage-devel/8MJBe_qxWJ0/fTzOPVzDAAAJ -- You received this message because you are subscribed to the Google Groups "sage-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+unsubscr...@googlegroups.com. To post to this group, send email to sage-devel@googlegroups.com. Visit this group at https://groups.google.com/group/sage-devel. For more options, visit https://groups.google.com/d/optout.