Re: when will we rebuild AI-based software from sources/datasets?

Gerardo Ballabio Mon, 10 Feb 2025 02:49:48 -0800

Stefano Zacchiroli wrote:
> Regarding source packages, I suspect that most or our upstream authors
that will end up using free AI will *not* include training datasets in
distribution tarballs or Git repositories of the main software. So what
will we do downstream? Do we repack source packages to include the
training datasets? Do we create *separate* source packages for the
training datasets? Do we create a separate (ftp?  git-annex?  git-lfs?)
hosting place where to host large training datasets to avoid exploding
mirror sizes?


I'd suggest separate source packages *and* put them in a special
section of the archive, that mirrors may choose not to host.

I'm not sure whether there could also be technical problems with
many-gigabytes-sized packages, e.g., is there an upper limit to file
size that could be hit? Can the package download be resumed if it is
interrupted? Might tar fail to unpack the package? (Although those
could all be solved by splitting the package into chunks...)

> Do we simply refer to external hosting places that are not
under Debian control?

As I understand it, if we rule that the training dataset is "part of
the source", and if we allow the software package into main, then we
must host the training dataset on Debian servers (though not
necessarily as a package). Otherwise the software package must go to
non-free.

Gerardo

Re: when will we rebuild AI-based software from sources/datasets?

Reply via email to