While I'm still digesting the very impactful (for me) message by the other Sam (hartmans), a quick but important note on the following:
On Fri, Feb 07, 2025 at 01:35:00PM +0100, Sam Johnston wrote: > "Large language models (LMs) have been shown to memorize parts of > their training data, and when prompted appropriately, they will emit > the memorized training data verbatim." I don't think we should focus our conversation on LLMs much, if at all. The reason is that, even if a completely free-as-in-freedom (including in its training dataset), high quality LLM were to materialize in the future, its preferred form of modification (which includes the dataset) will be practically impossible to distribute by Debian due to its size. So when we think of concrete examples, let's focus on what could be reasonably distributed by Debian. This includes small(er) generative AI language models, but also all sorts of *non-generative* AI models, e.g., classification models. The latter do not generate copyrightable content, so most of the issues you pointed out do not apply to them. Other issues still apply to them, including biases analyses (at a scale which *is* manageable, addressing some of the issues pointed out by hartmans), and ethical data sourcing. Cheers -- Stefano Zacchiroli . z...@upsilon.cc . https://upsilon.cc/zack _. ^ ._ Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'