On Mon, 10 Feb 2025, Gerardo Ballabio wrote: > Stefano Zacchiroli wrote: > > Regarding source packages, I suspect that most or our upstream authors > that will end up using free AI will *not* include training datasets in > distribution tarballs or Git repositories of the main software. So what > will we do downstream? Do we repack source packages to include the > training datasets? Do we create *separate* source packages for the > training datasets? Do we create a separate (ftp? git-annex? git-lfs?) > hosting place where to host large training datasets to avoid exploding > mirror sizes?
> I'd suggest separate source packages *and* put them in a special > section of the archive, that mirrors may choose not to host. > I'm not sure whether there could also be technical problems with > many-gigabytes-sized packages, e.g., is there an upper limit to file > size that could be hit? Can the package download be resumed if it is Just want to chime in support of using git-annex as an underlying technology and provide a possible sketch on a solution: - git-annex allows for (just a few most relevant here points out of wide range of general aspects) - "link" into a wide range of data sources, and if needed to create custom "special remotes" to access data. https://datasets.datalad.org/ is a proof to that -- provides access to 100s of TBs of data from a wide range of hosting solutions (S3, tarballs on an http server, some rclone compatible storage solutions, ...) - seamlessly to end-user diversify/tier data backup/storage. To that degree, I have (ab)used claimed to be "unlimited" institutional dropbox to backup over 600TBs of a public data archive, and then would be easily announce it "dead" whenever data would no longer available there - separate "data availability" tracking (stored in git-annex branch) from actual version tracking (your "master" branch). This way adjustment of data availability does nohow require changes to your "versioned data release". - similarly to how we have https://neuro.debian.net/debian/dists/data/ of "classical" debian packages, there could be a similar suite on debian with multi-version (multiple versions of a package allowed within the same suite) package available which would deploy git-annex upon installation. Then individual Debian suite (stable, unstable) would rely on using specific version(s) of packages from the "data" suite - "data source package" could be just prescription on how to establish data access, lean and nice. "binary package" also relatively lean, since data itself accessible via git-annex - separate service could observe/verify continued availability of data and when necessary establish > interrupted? Might tar fail to unpack the package? (Although those > could all be solved by splitting the package into chunks...) FWIW -- https://datasets.datalad.org could be considered a "single package" as it is a single git repository leading to the next tier of git-submodules, overall reaching into thousands of them. But, logically, a separate git-repo could be equated to a separate Debian package. Additionally, "flavors" of packages could subset types of files to retrieve, e.g. smth like openclipart-png could depend on openclipart-annex which would just install git-annex repo, and the -png flavor - fetch all *.png files only. Access to individual files are orchestrated via git-annex and it has already built-in mechanisms for data integrity validation (often "on the fly" while downloading), retries, stall detection etc. Cheers, -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 WWW: http://www.linkedin.com/in/yarik
signature.asc
Description: PGP signature