[sage-devel] Brainstorming about Sage dependencies from system packages

Erik Bray Fri, 26 May 2017 06:01:45 -0700

Hi folks interested in Sage packaging,

Almost every time the topic comes up, I complain that it isn't easier
to use more system packages as both build- and run-time dependencies
of Sage.  I'd like to make some progress on actually doing something
about that, and I have some ideas, but I'd like to bounce them off
anyone who's interested first before just going off and doing it.


There is enough work involved in this that I believe it can and should
be broken up into a number of smaller tasks.  I would also like to
approach this in a way that works well and integrates with the
existing "sage-the-distribution" infrastructure.  I believe there are
advantages to being able to develop on Sage in the "normal" way we're
already used to, while also being able to take advantage of existing
system packages wherever possible.

So I'm just going to try to organize my existing thoughts on this and
see what anyone thinks.  Sorry if it's TL;DR, but I'm hoping that
having a detailed discussion about this will make it more likely that
something will actually be accomplished on it soon (because I think
the actual implementation, once decided on, is not terribly
difficult).

Note: In this message I'm using "package" loosely to refer to any
program, library, database, or other collection of files that is
distributed and installed as a self-contained unit.  It doesn't
necessarily relate to any particular "packaging system".


1. Why?
=======

The extent and scope to which Sage "vendors" its dependencies, in the
form of what some call "sage-the-distribution", is *not* particularly
normal in the open source world.  Vendoring *some* dependencies is not
unusual, but Sage does nearly all (even down the gcc, in certain
cases).  I've learned a lot of the history to this over the past year,
and agree that most of the time this has been done with good reasons.

For example, I can't think of any other software that forces me to
build its own copy of ncurses just to build/install it.  This was
added for good reasons [1], but not reasons that can't also resolved
in part by installing the appropriate system packages, or that might
not be resolved by now in system packages that depend on ncurses (i.e.
that should be built with ncurses support).  Point being, this issue
does not necessarily impact everyone, and building Sage's own ncurses
is overkill in that case.  It would be one thing if we were just
talking one or two packages (I didn't pick on ncurses for any deep
reason), but now multiply that by around 250 (give or take, depending
on how many dependencies are even available as system packages) and it
becomes real overhead to getting started *and* making progress with
Sage development.

I wouln't propose *removing* any existing spkgs that are still
relevant.  I think it's really useful that Sage has a list of
known-good pinned versions of its dependencies. Further,
"sage-the-distribution" makes it very easy to install those
dependencies in such a way that they can be used as build/runtime
dependencies by Sage without having to hunt the 'net for the right
source packages of the right versions of those dependencies, and
figure out how to configure and build them in a piecemeal fashion.  In
other words, even if we do expand the ability to use system packages
for Sage's dependencies, it's still very nice that it's easy with a
few commands to use the spkg if something goes wrong with the system
package.  It's also, of course, important for power users who wish to
compile some dependencies on their own--especially highly tuned
numerical libraries (but even those users usually only care about
being able to hand-configure a few dependencies, not most).

To summarize: being able to more aggressively rely on system packages
can save a lot of time and frustration during normal development of
Sage, and is also less jarring especially to new developers, of whom
we would like to attract more.  It should also decrease the time
required to regularly build binary distributions of Sage (e.g. for
Docker, Windows, and Linux distros).


2. Overview of how Sage manages dependencies now (and what won't change)
========================================================================

For many of you this will be unnecessary review, but I want to discuss
a little about how dependencies are currently checked and installed in
Sage-the-distribution.  Doing so is helpful for me too, to make sure I
understand it clearly (and correct me if I have any
misunderstandings).

Sage-the-distribution uses *Make* itself (cleverly, IMO) to manage
dependencies insofar as making sure all dependencies are installed,
and that when a package changes all packages that depend (directly or
indirectly) on that package are rebuilt.  Make works on files and
timestamps, which does not translate directly to entire software
packages, so to track whether or not an spkg is up to date, Sage uses
the common "stamp pattern" for Make [2]--that is, when an spkg is
installed it writes a file that effectively "represents" completion of
the installation of that spkg for Make's purposes.  These stamp files
are the files typically stored under
$SAGE_LOCAL/var/lib/sage/installed/<spkg>-<version>.  This directory
is also known in some places as SAGE_SPKG_INST.  By including the
version number in the name we can also force rebuilds when an spkg's
version changes.

When one runs `make <spkg>` with just the spkg name, this is actually
a phony target with the path to the stamp file for that package (at
its current version) as the sole target.  So `make <spkg>` translates
to `make $SAGE_SPKG_INST/<spkg>-<version>` for the current version of
that spkg.  The associated rule is to run the sage-spkg command for
that package, which also takes care of writing the stamp file.
sage-spkg also writes some information into each stamp file in a
somewhat loose format that I don't believe is parsed anywhere.
However the *existence* of these files is used by the (somewhat
controversial, for downstream packagers) `is_package_installed()`
function.*  I'm actually going to propose later that we write and use
these stamp files (with some slight changes) even when installing
dependencies from a system package, so these files might be present
even in binary packages for Sage (though that might be up to
downstream packagers).

When Sage's `./configure` script generates the main Makefile for all
of Sage's dependencies, it loops over all the spkgs in build/pkgs/ and
creates two make targets for each spkg: the aforementioned phony
target consisting of just the package name, and the *real* target for
the stamp file.  It also creates a make variable named like
`$(inst_<spkg>)` (where <spkg> is just the package name, without the
version) referring to the full path of the stamp file for that
package.  Each spkg may list its build dependencies in its
build/pkgs/<spkg>/dependencies file, in the format that it will appear
in the Makefile as dependencies for the make target of that package.
For convenience's sake, the `dependencies` file just contains the
package names, but the `./configure` script converts this to the
appropriate `$(inst_<spkg>)` variables, so that the stamp files become
the real dependencies (part of how the "stamp pattern" normally
works).

When a package is upgraded (i.e. its version number changes) then the
Makefile is regenerated, but with the `$(inst_<spkg>)` for that
package pointing to a new stamp file, containing the new version
number.  Thus any dependents of that package will see this as an
outdated dependency, and get rebuilt after the upgraded package is
built.  When packages are rebuilt (even if their version didn't
change) their stamp files are touched, forcing further rebuilds of any
of their dependents and so on, in normal Make behavior.

As far as I can tell this has worked quite well for Sage--especially
as it also allows leveraging Make's parallel build features.  So I'm
proposing to keep this all pretty much as-is, with possibly only minor
tweaks in the details.  Instead, many more of the changes will be at
configure time.


* There is proposed work already mostly done to replace use of
is_package_installed() within the Sage library with a way to do
runtime feature checks: https://trac.sagemath.org/ticket/20382  Some
of this work *might* be redundant with what I want to propose, but can
also coexist with it, as it is currently designed for runtime use by
the Python code itself, and not during builds.


3. Case study--examples already in Sage
=======================================

Sage-the-distribution already has a few examples of "spkgs" in the
system that *may* use a system package, rather than building from
source.  As it is this is done in an ad-hoc manner that can be
surprising and/or misleading.  But I think it's useful to look at them
to see how this is done currently and if there's anything we can learn
from it.

a) Blas
-------

There are two different BLAS implementation packages to choose from
currently in Sage: OpenBLAS and ATLAS.*  The selection can be made
currently at configure time with a --with-blas= flag which can take
either 'openblas' or 'atlas'.  The selection is used to write a
variable called `$(BLAS)` in the makefile that points to the stamp
file path for the actual BLAS implementation spkg selected.  Other
spkgs that have BLAS as a dependency list the `$(BLAS)` variable in
its dependencies, rather than writing "openblas" or "atlas"
explicitly.

When openblas is selected (now the default) the openblas spkg is
installed unconditionally.

However, when *atlas* is selected, there happens to be a mechanism for
using a system BLAS (why just with ATLAS I don't know--historical
reasons I guess).  In this case it still runs the spkg-install for
ATLAS like for any other spkg, but its spkg-install checks for a
special environment variable, `SAGE_ATLAS_LIB` (the only way to
control this behavior).  This invokes a search in standard locations
first for a "libatlas.so" (or equivalent) explicitly.  If that's not
found, it will happily take whatever it does find as long as there's
*some* "libblas.so" and "liblapack.so" found on the system.  It
doesn't do any feature checks or anything--it just takes what it
finds.

If it does find something resembling either ATLAS specifically, or a
generic BLAS/LAPACK, then it skips installing the actual spkg, but
still writes a stamp file indicating that "ATLAS" was installed, with
whatever version is in the package-version.txt for the spkg, which can
of course be misleading.  (It also writes pkgconfig .pc files in
$SAGE_LOCAL/lib for blas/cblas/lapack indicating which libs it found,
along with a "fake" version of "1.0".)

This, Sage will use these system libraries for all build and runtime
requirements of BLAS, and in my experience this has generally worked.

* There is another issue I would like to address--slightly orthogonal
to supporting system packages--of having a regular way to support
"abstract" packages that can have multiple alternative implementations
(another example being GMP/MPIR).  This has been talked about before,
such as in this recent thread [3].  I have some ideas about this that
integrate well with my ideas for system packages, but I will try to
save that for a separate message.


b) GCC
------

The GCC spkg is a bit of a different beast, since it is normally not
installed by default, and was only added to support cases where the
platform's GCC is broken or too old and has bugs that affect building
Sage or its dependencies.

Although Sage's `configure` script is responsible for determining
whether or not GCC should be installed (in contrast to hacks in
spkg-install like for ATLAS), there is no *flag* for `configure` (e.g.
--with-gcc or something like that) for controlling this.  Instead the
behavior is controlled solely by an environment variable
"SAGE_INSTALL_GCC" (this should probably be fixed, but we'll come to
that).  If the environment variable is set to "yes"/"no" then that
forces the gcc installation behavior one way or the other.  However,
if the environment variable is not set, then the configure script goes
through the necessary checks to see if the installed gcc is new
enough, and also if gfortran is installed, among others.  If GCC
installation is deemed necessary then it sets a flag indicating as
much, called `need_to_install_gcc=yes`.

This is used later (see next section) to set the `$(inst_gcc)` variable.

c) git
------

Sage actually includes an spkg for git, and installs it
unconditionally (there is currently no way to control this) if a
working 'git' is not found on the system.  This is one of the few
packages that just has a straightforward check for the system version
at configure time.  If a working git is not found (where 'working'
here just means `git --version` works) the script sets a variable
(similar to the gcc case) called `need_to_install_git=yes`.

(It also sets a similar variable for `need_to_install_yasm` on
x86-based systems.)

Later, while writing the main Makefile, the configure script loops
over all spkgs that *might* be installed and checks for a
`need_to_install_<spkg>` variable.  If not found, or not set to "no",
the script sets the `$(inst_<spkg>)` variable to point to the standard
stamp file for that package.  Otherwise it sets `$(inst_<spkg>)` to a
dummy file that always exists (this way any dependencies for that
package are still satisfied, but the spkg is never actually
built/installed).


4. Package sources
==================

One of the main changes I'm proposing is that stamp files for packages
will always be written to SAGE_SPKG_INST even for cases where the
system package is used, and the Sage spkg is not actually installed.

That is, I want to change the meaning of "spkg" to more broadly
represent "a dependency of Sage that *may* be included in
Sage-the-distribution".

To this end I want to define a concept of spkg "sources" (not to be
confused with source code).  Instead, these are sources from which the
spkg dependency can be satisfied.  Three possible sources I have in
mind (and I'm not sure that there would be any other):

a) sage-dist:  This is the current notion of an "spkg", where the
source tarball is downloaded from one of the Sage mirrors, unpacked
and installed to $SAGE_LOCAL using sage-spkg + the spkg's spkg-install
script.  The resulting stamp file, with the version taken from
package-version.txt is written to $SAGE_SPKG_INST.

b) system: In this case a check is made to see if the dependency is
already satisfied by the system. How exactly this check is performed
depends heavily on the package.  *If possible* the version of the
system package is also determined (will discuss the nuts-and-bolts of
this later).  In this case a stamp file is still written to
$SAGE_SPKG_INST, but indicating somehow that the system package was
used, not the sage-dist package.

c) source: This case is not necessary for supporting system packages,
but I think would be useful for testing new versions of a package.  In
this case it would be possible to install an spkg from an existing
source tree for that package, which would be installed using the
spkg-install script.  If possible the version number would be
determined from the package source code, and not assumed.  I think
this would be useful, but won't discuss this case any further for now.
I just point it out as another possibility within this framework of
allowing different spkg "sources".

To summarize, no matter how an spkg dependency is satisfied, a stamp
file for that spkg is written to $SAGE_SPKG_INSTALL, possibly
indicating the *actual* version of the package being used by Sage, and
indicating how the dependency was satisfied.


5. Nuts and bolts
=================

a) New stamp file format
------------------------

As suggested in the previous section, no matter how an spkg dependency
was satisfied, a stamp file is written to the $SAGE_SPKG_INST
directory.  In order to support multiple possible package "sources",
the source that was used should be included in the stamp file.  This
way, it will also be possible to re-run `./configure` and specify a
different source for a package, thus forcing a rebuild.  So I think
the stamp filename format should be something like:

    $SAGE_SPKG_INST/<name>-<source>-<version>

where <name> would be the base package name, <source> would be
something like "sagedist" or "system", and <version> the *actual*
version of the package being used.  I'll discuss in the next section
how this might be determined for system packages.  There's plenty of
room for bikeshedding in this, but I think this makes sense.  We could
also support the old filename format, if such files are found, for
backwards compatibility.


b) Checking packages
--------------------

For any dependency that may be satisfied by system packages, there
needs to be a way to specify what the minimum dependency is for Sage
(be it a version number, or the presence of certain features) there
needs to be a way for each package to check that the dependency is
satisfied.

I've gone back and forth on exactly how this should be done, but I
think that the best way to do this is to allow per-package m4 files,
containing an m4 macro that checks that dependency on that package is
satisfied (again, be it version number or some other check).  Each
macro could be named something like

    SAGE_SPKG_CHECK_<name>

Optionally the macro should set a variable indicating the package
*version* if the package dependency is satisfied.  This is the version
string that can be used in the stamp file, for example.  If there is
no clear way to determine the version (though it most cases there will
be), a string like "unknown" could still be allowed for the version.
The macro would be defined in a file like sage_spkg_check.m4 under
each build/pkgs/<spkg> directory, and loaded on an as-needed basis
using the m4_include command in configure.ac.

Writing an m4 macro for autoconf is not a common skill, which is why
I've hesitated on this.  But I think it has a few justifications: It
allows one to take advantage of the many existing macros that come
with autoconf to perform common checks, such as whether a program is
installed, or a function is available in a library.  For many packages
the SAGE_SPKG_CHECK_ macro would probably just wrap one or two
existing autoconf macros.  Another justification is that for some
packages there may be existing macros to check for them that we can
borrow from other projects.

We can also provide, in the documentation, a simple template macro
demonstrating how to wrap a few shell commands.

*NOTE*: To be clear, I'm not proposing that, to implement this
proposal, we go through and write 250+ m4 macros for every Sage spkg.
This check will be optional, and we can write them one at a time on an
as-needed basis, starting with some of the most important ones.  I'll
discuss more about how missing checks are handled in the next section.

Obviously the packages that already have checks in configure.ac (gcc,
git, yasm) would have those checks moved out to their package-specific
macros.


c) Driving the system
---------------------

As previously noted, selecting the source for a package would be done
at ./configure time.  My proposal would be to change very little about
the current default behavior.

By default, all packages would be installed from the sage-dist source
as is the case now.  We could still make exceptions for build
dependencies like gcc and git.  I don't care whether these exceptions
are hard-coded in configure.ac, or specified in some generic way.

However, the configure script would support, for all spkgs, a
`--with-system-<spkg>` argument (e.g. `--with-system-zlib`).

For each spkg to be installed (all standard packages, optional
packages if selected), if the `--with-system-<spkg>` argument is
given, it will attempt to load and run the SAGE_SPKG_CHECK_<spkg>
macro for that package.  If the macro is not defined, there would be a
*warning* that system package was selected for that package, but there
is no way to check if it was installed.  The warning would make clear
that if the build fails it may be due to this dependency being
missing.  Otherwise it runs the check, and if the check succeeds the
configure script would continue, while if the check fails the
configure would stop with an error.

Optionally, we could add arguments to control all of this behavior.
For example, it might be useful to have an option to install the
sage-dist spkg if a check is not defined.  This might even be better
as the default--a possible bikeshed issue.

Another possible option is one that enables system packages, but
disables any checks.  This might be useful for system packagers who
already have external guarantees that the dependencies have been met.

Finally, there should be an option like `--with-system-all` to
automatically use system packages for all dependencies, so that
downstream packagers don't have to supply hundreds of `--with-system-`
flags.

Otherwise, generation of the build/make/Makefile by the configure
script would proceed more or less as it does currently.  It would just
take into account information gained through any `--with-system-`
flags to generate the new format stamp filenames.  The .dummy stamp
file would not be used anymore.  Also, the rule for building system
packages would be to simply write the stamp file.


6. Q&A
=====

Q: What if I install with --with-system-<spkg> but later want to
install the sage-dist version of that package?

A: We should also support some way to deselect system packages.
Perhaps --without-system-<spkg> / --with-system-<spkg>=no (these are
two ways of saying the same things in standard configure scripts).

Q: The reverse: What if I install the sage-dist package, but want to
switch to the system package?

A: Same thing, but this is a little trickier because we would need to
*uninstall* the package from $SAGE_LOCAL.  I have a proposal for
improving spkg uninstallation written up at
https://trac.sagemath.org/ticket/22510

Q: What if I use a system package when building Sage, but that package
is later upgraded, or worse, removed?

A: There's no great solution to this.  Certainly, I think the
./configure time checks should be cached (since updates are not
usually *that* frequent).  So there needs to be good documentation on
invalidating the cache when re-running ./configure.  Still, that only
helps with configure-time detection.  Sage can still break at runtime
if a system package it depends on changes.  This is a generic problem
for *any* software development, however, and something developers
should be aware if if they're updating their system.  Granted, most
people don't always closely examine what's changing when they install,
for example, OS updates.  I certainly don't always check this with a
fine-toothed comb.  But it's a general issue.  Keeping the ability to
install the "standard", known-working sage-dist spkgs if needed is
also a big advantage of this proposal.

Any other questions?


7. Future concepts
==================

a) Platform hooks
-----------------

It might be nice, when using system packages, for the underlying
OS/distribution system to hook into the SAGE_SPKG_CHECK_ system, both
to check if a package is installed, and to provide its version number.
For example, when building Sage on Debian, it might just hook into the
dpkg system to provide this information in a manner consistent with
the system.

b) Abstract packages
--------------------

Returning to the question of dependencies that can be satisfied by
more than one package (e.g. BLAS, GMP), I think it would be nice to
have a generic way of handling such cases that's a little cleaner than
the current ad-hoc system.  I would like a way of specifying an
"abstract" package (which might be named "blas", for example).
Installing an abstract package would mean installing the concrete
package selected to satisfy it, but it would also include a system for
switching between concrete implementations.  So for example it would
be possible to have multiple BLAS implementations installed
simultaneously, and installing "blas" with the current selection might
just be a matter of updating some symlinks.

I think this concept fits in well with the proposal for handling
system packages, but doesn't necessarily need to be handled
simultaneously with it.  For now we can just maintain the special
cases I think...


8. Conclusion (for now)
=======================

I've heard many valid concerns with going beyond sage-the-distribution
for building/running Sage.  Sage's huge collection of dependencies can
lead to many fragilities: Version X of package Y might work with
dependency A, but completely break dependency B.  And supporting
versions V, W, and X of package Y simultaneously is a lot of overhead
compared to always just using version Y of that package for Sage.

I do personally have a preference, when it comes to writing software,
to supporting as wide a range of versions for my dependencies as is
feasible.  For some dependencies the versions supported may,
necessarily, be very narrow.  But for other cases there can be a lot
more room for flexibility.

Regardless, I think this proposal maintains the current stability of
Sage by keeping the current preference for sage-the-distribution in
all cases by default.  It also maintains the ability to use
custom-built versions of some of Sage dependencies.  But I think this
will also provide more flexibility in experimenting with using
existing system packages in cases where that's sufficient, and avoid
Sage duplicating system packages unnecessarily.

Best,
Erik


[1] https://trac.sagemath.org/ticket/14405
[2] https://www.technovelty.org/tips/the-stamp-idiom-with-make.html
[3] https://groups.google.com/d/msg/sage-devel/8MJBe_qxWJ0/fTzOPVzDAAAJ

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.

[sage-devel] Brainstorming about Sage dependencies from system packages

Reply via email to