Thanks for your real-world input! It helped me clarify a few technical and societal impacts.
One point I want to clarify (with the comments below) is: what is the practical difference between Debian including in its mirrors a 100TiB file like crawl-data/CC-MAIN-2025-18/warc.paths.gz from https://commoncrawl.org/blog/april-2025-crawl-archive-now-available versus having a field in a Debian source package metadata simply including the HTTPs link to that same file? Because the latter is *far* easier to do and maintain for Debian. On Thu, 15 May 2025 at 02:04, Arian Ott <arian....@ieee.org> wrote: > In my undergraduate work, we frequently relied on publicly available datasets > from sources such as Kaggle. These enabled us to train our own models, > interpret results, and explore data-driven questions in a hands-on manner. > Providing access to training data empowers researchers, institutions, and > independent developers to create models adapted to their specific needs. > Moreover, it facilitates the composability of data, an essential feature in > interdisciplinary research and real-world applications. I wanted to highlight this part - there already exist organisations out there that gather and maintain datasets and provide access to them, including access to frozen snapshots that never change. Be it Kaggle or Common Crawl or others. It would take a very specific need for Debian to duplicate their efforts and take on *massive* infrastructure commitments as well as the legal risiko. > Debian’s commitment to reproducibility and openness logically extends to the > realm of AI. Distributing a model without its corresponding training data > violates this principle and undermines the ability of users to validate, > audit, or adapt the model for their own contexts. That is a good point. But the same can be achieved by simply pointing to the relevant data set snapshot from a dataset provider. > If Debian were to allow AI models to be packaged without the accompanying > data, it would risk reducing its standards to those of existing platforms > such as Hugging Face, where reproducibility is often not enforced. In > contrast, requiring training data to be available fosters trust, academic > rigour, and long-term sustainability. To enforce reproducibility Debian would need to actually spend the resources to rebuild the training models. I do not think that is feasible at this point. And it is still possible to do that when data is hosted outside Debian. The assumption here is that all models that we are talking about are sufficiently complex that full model retraining will not be part of the regular process of Debian source package compilation to Debian binary package (which Debian does *quite* often). > The strategic value of Debian enforcing open data is clear: > Data scientists and developers can rely on Debian-hosted datasets being > legally sound and freely reusable. > This lowers the barrier to entry for high-quality, ethical AI development. > It also positions Debian as a trusted ecosystem for research-grade and > production-ready AI tooling. Still here as well, I do not believe it is actually necessary for Debian to host and redistribute the data to achieve that. I do not think there would be an additional practical benefit to doing so. Pointing from Debian metadata to a particular snapshot of a particular Kaggle or Common Crawl dataset would suffice for any reproduction or modification work. Also there is a thorny problem that most datasets may *only* be as freely usable for researchers and for data mining purposes and very explicitly may not be usable for other purposes. > In summary, my vision would include: > Making datasets available through apt or similar tools We already have a separation between "apt-get download/install" that gets binary packages and "apt-get source" that downloads source packages. And the source packages can have extra targets inside their debian/rules Makefile. Downloading the training dataset and re-training the model could be separate (Policy-defined) "debian/rules" targets, without having to put multiple 100TiB files onto the Debian mirrors and store them for decades. > Treating AI models as first-class citizens in Debian’s packaging ecosystem > Enforcing that models included in Debian main must be accompanied by the > training data that enables their reproducibility Enforcing that the external training data is still accessible could be as simple as doing a HTTPs HEAD request to the specified source data URLs as part of packagetests and failing if that file is no longer offered or if its size changed. It would be a good addition to Policy for such models, if we ever get that far. -- Best regards, Aigars Mahinovs