Stefano Zacchiroli wrote: > Regarding source packages, I suspect that most or our upstream authors that will end up using free AI will *not* include training datasets in distribution tarballs or Git repositories of the main software. So what will we do downstream? Do we repack source packages to include the training datasets? Do we create *separate* source packages for the training datasets? Do we create a separate (ftp? git-annex? git-lfs?) hosting place where to host large training datasets to avoid exploding mirror sizes?
I'd suggest separate source packages *and* put them in a special section of the archive, that mirrors may choose not to host. I'm not sure whether there could also be technical problems with many-gigabytes-sized packages, e.g., is there an upper limit to file size that could be hit? Can the package download be resumed if it is interrupted? Might tar fail to unpack the package? (Although those could all be solved by splitting the package into chunks...) > Do we simply refer to external hosting places that are not under Debian control? As I understand it, if we rule that the training dataset is "part of the source", and if we allow the software package into main, then we must host the training dataset on Debian servers (though not necessarily as a package). Otherwise the software package must go to non-free. Gerardo