Hello Mo, all, I've now read through the full GR text and commentary. I've a bunch of comments, but I'll post them separately (and/or in MR). In this mail I'd like to focus on one important aspect related to implications:
On Sun, Feb 02, 2025 at 12:56:59AM -0500, M. Zhou wrote: > (2) are the options clear enough for vote? Considering lots of the readers may > not be faimiliar with how AI is created. I tried to explain it, as well as > the implication if some components are missing. I'd like to understand, in case option A passes, when and how Debian will rebuild AI-models that are included in some free software that is in the archive from its "source" (which will include the full training dataset, as per option A indeed). Concrete examples ----------------- Let's take two simple examples that, size-wise, could fit in the Debian archive together with training pipelines and datasets. First, let's take a "small" image classification model that one day might be included in Digikam or similar free software. Let's say the trained model is ~1 GiB (like Moondream [1] today) and that the training dataset is ~10 GiB (I've no idea if the Moondrean training dataset is open data and I'm probably being very conservative with its size here; just assume it is correct for now). [1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01 For a second, even smaller example, let's consider gnubg (GNU backgammon) that contains today, in the Debian archive, a trained neural network [2] of less than 1 MiB. Its training data is *not* in the archive, but is available online without a license (AFAICT) [3] and weights about ~80 MiB. The training code is available as well [4], even though still in Python 2. [2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights [3]: https://alpha.gnu.org/gnu/gnubg/nn-training [4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git What do we put where? --------------------- Regarding source packages, I suspect that most or our upstream authors that will end up using free AI will *not* include training datasets in distribution tarballs or Git repositories of the main software. So what will we do downstream? Do we repack source packages to include the training datasets? Do we create *separate* source packages for the training datasets? Do we create a separate (ftp? git-annex? git-lfs?) hosting place where to host large training datasets to avoid exploding mirror sizes? Do we simply refer to external hosting places that are not under Debian control? Of course this would be irrelevant problem in the gnubg case (we can just store everything in the source package), but it will become more significant in the Digikam case, and the amount of those cases will probably increase over time.. (Note that I do not have definitive answers to any of the questions in this email. And also that I'm *not* raising them as counter arguments to option A, which is my favorite one at the moment. I just want to make sure that we have a rough idea of how we will in practice *implement* option A in a way that fits Debian processes, rather than thinking about them only after the vote. We have been there for a number of GRs in the past, and it has not been fun.) Regarding binary packages, the question applies for large trained AI models too. On this front, we can rely entirely on upstream software to "unbundle" trained datasets from their software, so that it is downloaded on the fly on user machines and never enters the Debian archive. Based on previous answers, I suspect this might be what Mo has in mind. But I don't find it very satisfactory for a number of reasons: it will not be universal, we might end up having to host some of the large models ourselves at some point, and even when it is handled upstream we will leave users on their own in terms of installation risks, etc. (Yes, this is not a new problem, and applies to other software that automatically download plugins and whatnot, but I still don't like it.) When do we retrain? ------------------- The most difficult question for me is: when do we retrain AI models (shipped in Debian binary packages) from their training datasets (shipped in source packages)? In some cases, as pointed out in the GR commentary, it will be computationally impossible for Debian to do so. We can mostly ignore these cases, but it hence begs the question: do we want to ship in Debian trained AI models that *allegedly* have all their training dataset and pipeline available under free licenses, if we cannot verify/rebuild them ourselves? (If this smells like the XZ utils hack to you, you're not alone!) Let's focus now on the cases that *could* be retrained by Debian, possibly after buying a dozen on GPUs to put on dedicated buildds. Potential answers on when to retrain are: - We never retrain. We are now back to the already discussed XZ utils smell. I don't think this would be wise/acceptable in case where it is feasible for us to retrain. - We retrain at every package build. Why not, but it will be quite expensive. (Add here your favorite environmental concerns.) It will also require some dedicated scheduling to separate packages that need GPUs to build from others. This will probably result in a natural separation between source packages containing training datasets, that can be tagged as needing GPUs to build, and result in binary packages that are dependencies for the final software installed by users. Seems appealing to me. - We retrain every now and then (e.g., once per release). Compromise situation, between the previous two. Can be analogous to bootstraping compilers, which we don't do systematically, but can be done by motivated developers and porters. If we go down this path, we probably want to standardize some debian/rules target ("bootstrap"?) that recreate trained models from sources and can then be used to update source packages that ship them. Reproducible builds ------------------- Side, but important consideration: retraining will in most cases not be bitwise reproducible, as pointed out already in the GR commentary. The practical consequence for Debian is that, all packages that will end up containing the logic for retraining AI models will remain non-bitwise reproducible for the foreseeable future. (Which is an additional good argument for clearly separating those packages from others.) Have I missed any other specific Debian process that will be impacted? Cheers -- Stefano Zacchiroli . z...@upsilon.cc . https://upsilon.cc/zack _. ^ ._ Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'