Re: when will we rebuild AI-based software from sources/datasets?

Yaroslav Halchenko Mon, 10 Feb 2025 07:54:52 -0800

On Mon, 10 Feb 2025, Gerardo Ballabio wrote:
> Stefano Zacchiroli wrote:
> > Regarding source packages, I suspect that most or our upstream authors
> that will end up using free AI will *not* include training datasets in
> distribution tarballs or Git repositories of the main software. So what
> will we do downstream? Do we repack source packages to include the
> training datasets? Do we create *separate* source packages for the
> training datasets? Do we create a separate (ftp?  git-annex?  git-lfs?)
> hosting place where to host large training datasets to avoid exploding
> mirror sizes?


> I'd suggest separate source packages *and* put them in a special
> section of the archive, that mirrors may choose not to host.

> I'm not sure whether there could also be technical problems with
> many-gigabytes-sized packages, e.g., is there an upper limit to file
> size that could be hit? Can the package download be resumed if it is

Just want to chime in support of using git-annex as an underlying
technology and provide a possible sketch on a solution:

- git-annex allows for (just a few most relevant here points out
  of wide range of general aspects)

  - "link" into a wide range of data sources, and if needed to create
    custom "special remotes" to access data.

    https://datasets.datalad.org/ is a proof to that -- provides access
    to 100s of TBs of data from a wide range of hosting solutions (S3,
    tarballs on an http server, some rclone compatible storage solutions, ...)

  - seamlessly to end-user diversify/tier data backup/storage.

    To that degree, I have (ab)used claimed to be "unlimited"
    institutional dropbox to backup over 600TBs of a public data archive,
    and then would be easily announce it "dead" whenever data would no
    longer available there

  - separate "data availability" tracking (stored in git-annex
    branch) from actual version tracking (your "master" branch).

    This way adjustment of data availability does nohow require changes
    to your "versioned data release".
  
- similarly to how we have https://neuro.debian.net/debian/dists/data/
  of "classical" debian packages, there could be a similar suite on
  debian with multi-version (multiple versions of a package allowed
  within the same suite) package available which would deploy
  git-annex upon installation.  Then individual Debian suite (stable,
  unstable) would rely on using specific version(s) of packages from the
  "data" suite

  - "data source package" could be just prescription on how to establish
    data access, lean and nice.  "binary package" also relatively lean,
    since data itself accessible via git-annex

  - separate service could observe/verify continued availability of data
    and when necessary establish 

> interrupted? Might tar fail to unpack the package? (Although those
> could all be solved by splitting the package into chunks...)

FWIW -- https://datasets.datalad.org could be considered a "single
package" as it is a single git repository leading to the next tier of
git-submodules, overall reaching into thousands of them.  

But, logically, a separate git-repo could be equated to a separate
Debian package.  Additionally, "flavors" of packages could subset
types of files to retrieve, e.g. smth like openclipart-png could depend
on openclipart-annex which would just  install git-annex repo, and the
-png flavor - fetch all *.png files only.

Access to individual files are orchestrated via git-annex and it has
already built-in mechanisms for data integrity validation (often "on the
fly" while downloading), retries, stall detection etc.

Cheers,
-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik

signature.asc
Description: PGP signature

Re: when will we rebuild AI-based software from sources/datasets?

Reply via email to