when will we rebuild AI-based software from sources/datasets?

Stefano Zacchiroli Sat, 08 Feb 2025 06:09:01 -0800

Hello Mo, all, I've now read through the full GR text and commentary.
I've a bunch of comments, but I'll post them separately (and/or in MR).
In this mail I'd like to focus on one important aspect related to
implications:

On Sun, Feb 02, 2025 at 12:56:59AM -0500, M. Zhou wrote:
> (2) are the options clear enough for vote? Considering lots of the readers may
> not be faimiliar with how AI is created. I tried to explain it, as well as
> the implication if some components are missing.

I'd like to understand, in case option A passes, when and how Debian
will rebuild AI-models that are included in some free software that is
in the archive from its "source" (which will include the full training
dataset, as per option A indeed).

Concrete examples
-----------------

Let's take two simple examples that, size-wise, could fit in the Debian
archive together with training pipelines and datasets.

First, let's take a "small" image classification model that one day
might be included in Digikam or similar free software. Let's say the
trained model is ~1 GiB (like Moondream [1] today) and that the training
dataset is ~10 GiB (I've no idea if the Moondrean training dataset is
open data and I'm probably being very conservative with its size here;
just assume it is correct for now).

[1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01 

For a second, even smaller example, let's consider gnubg (GNU
backgammon) that contains today, in the Debian archive, a trained neural
network [2] of less than 1 MiB. Its training data is *not* in the
archive, but is available online without a license (AFAICT) [3] and
weights about ~80 MiB. The training code is available as well [4], even
though still in Python 2.

[2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights
[3]: https://alpha.gnu.org/gnu/gnubg/nn-training
[4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git

What do we put where?
---------------------

Regarding source packages, I suspect that most or our upstream authors
that will end up using free AI will *not* include training datasets in
distribution tarballs or Git repositories of the main software. So what
will we do downstream? Do we repack source packages to include the
training datasets? Do we create *separate* source packages for the
training datasets? Do we create a separate (ftp?  git-annex?  git-lfs?)
hosting place where to host large training datasets to avoid exploding
mirror sizes? Do we simply refer to external hosting places that are not
under Debian control?

Of course this would be irrelevant problem in the gnubg case (we can
just store everything in the source package), but it will become more
significant in the Digikam case, and the amount of those cases will
probably increase over time..

(Note that I do not have definitive answers to any of the questions in
this email. And also that I'm *not* raising them as counter arguments to
option A, which is my favorite one at the moment. I just want to make
sure that we have a rough idea of how we will in practice *implement*
option A in a way that fits Debian processes, rather than thinking about
them only after the vote. We have been there for a number of GRs in the
past, and it has not been fun.)

Regarding binary packages, the question applies for large trained AI
models too. On this front, we can rely entirely on upstream software to
"unbundle" trained datasets from their software, so that it is
downloaded on the fly on user machines and never enters the Debian
archive. Based on previous answers, I suspect this might be what Mo has
in mind. But I don't find it very satisfactory for a number of reasons:
it will not be universal, we might end up having to host some of the
large models ourselves at some point, and even when it is handled
upstream we will leave users on their own in terms of installation
risks, etc. (Yes, this is not a new problem, and applies to other
software that automatically download plugins and whatnot, but I still
don't like it.)

When do we retrain?
-------------------

The most difficult question for me is: when do we retrain AI models
(shipped in Debian binary packages) from their training datasets
(shipped in source packages)?

In some cases, as pointed out in the GR commentary, it will be
computationally impossible for Debian to do so. We can mostly ignore
these cases, but it hence begs the question: do we want to ship in
Debian trained AI models that *allegedly* have all their training
dataset and pipeline available under free licenses, if we cannot
verify/rebuild them ourselves? (If this smells like the XZ utils hack to
you, you're not alone!)

Let's focus now on the cases that *could* be retrained by Debian,
possibly after buying a dozen on GPUs to put on dedicated buildds.
Potential answers on when to retrain are:

- We never retrain.

  We are now back to the already discussed XZ utils smell. I don't think
  this would be wise/acceptable in case where it is feasible for us to
  retrain.

- We retrain at every package build.

  Why not, but it will be quite expensive. (Add here your favorite
  environmental concerns.) It will also require some dedicated
  scheduling to separate packages that need GPUs to build from others.
  This will probably result in a natural separation between source
  packages containing training datasets, that can be tagged as needing
  GPUs to build, and result in binary packages that are dependencies for
  the final software installed by users. Seems appealing to me.

- We retrain every now and then (e.g., once per release).

  Compromise situation, between the previous two. Can be analogous to
  bootstraping compilers, which we don't do systematically, but can be
  done by motivated developers and porters. If we go down this path, we
  probably want to standardize some debian/rules target ("bootstrap"?)
  that recreate trained models from sources and can then be used to
  update source packages that ship them.

Reproducible builds
-------------------

Side, but important consideration: retraining will in most cases not be
bitwise reproducible, as pointed out already in the GR commentary. The
practical consequence for Debian is that, all packages that will end up
containing the logic for retraining AI models will remain non-bitwise
reproducible for the foreseeable future. (Which is an additional good
argument for clearly separating those packages from others.)

Have I missed any other specific Debian process that will be impacted?

Cheers
-- 
Stefano Zacchiroli . z...@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CSO Software Heritage            o o o     o       /\|^|/\
Mastodon: https://mastodon.xyz/@zacchiro                        '" V "'

when will we rebuild AI-based software from sources/datasets?

Reply via email to