On Fri, 7 Feb 2025, Sam Johnston wrote: >On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <t...@debian.org> wrote: >> >> I’d like to remind you that these huge binary blobs still contain, >> in lossily compressed form, illegally obtained and unethically >> pre-prepared, copies of copyrighted works, whose licences are not >> honoured by the proposed implementations. >> >> As such I cannot consider them acceptable even for Debian’s non-free. > >Agreed, we know these models can and do routinely recall training data >in the course of normal operation[1]: […] >We also know that even models carefully designed to avoid this, often >using guardrails that would be trivially removed when running locally >rather than as a service like OpenAI, will divulge their secrets if >coerced[2]:
Indeed, I’ve seen more examples for this. >The OSI and others arguing […] demonstrates they either do not >understand the technology, or worse, do and are trying to deceive us. >For me, the debate should end here. +1 >> While the act of training such a model *for data analysēs* may be >> legal, distributing it, or output gained from it that is not a, and >> I quote the copyright law, “pattern, trend [or] correlation” isn’t >> legal. > >Some 4D chess players have argued that a model is not copyrightable as >it is merely "a set of factual observations about the data", and that This sounds like the usual “some random software developer trying their hand at legalese” which lawyers routinely laugh about. I’ve heard that this is irrelevant, as long as the model’s output can reproduce sufficiently recognisable parts of others’ works, standalone and/or as collage, it’s a derived work. IANAL, ofc. >excusable (if unethical) under fair use … which is purely a US-american thing… > and text and data mining exemptions. … which does not allow reproduction of works, only what I quoted above. >This ignores the intentions of the authors of the content That, too. >Unless otherwise specified, content is typically published with "all >rights reserved" by default. The Berne Convention says so, indeed. >In any case, the result is "a statistical model that spits out >memorized information [that] might infringe [...] copyright". The >exemptions relied upon for training do not extend to reproduction >during inference, for which a test of “substantial similarity” would >apply (otherwise one might argue such copyright violations are >coincidental). +1 >Allowing this would be knowingly shipping obfuscated binary blobs in >main, akin to a book archive (Authors Guild v. Google, 2015) with >trivially reversible encryption, or a printer driver that can >spontaneously reproduce copyrighted content from memory. Interesting comparisons. If you take the lossy compression into account (book archive as JPEG or so), with possibly increased lossiness/compresion rate, this is a good fit. >Digital Public Goods (albeit not yet certified like Fedora[3]), as the >DPGA has just today "finalized the decision to make training data >mandatory for AI systems applying to become DPGs. This requirement will Interesting, didn’t know about DPGs yet. (Hmm, they have a requirement for CC licences for data collections, which (except CC0 which on the other hand is problematic for reuse in/of code) aren’t Copyfree…grml…) >While I'm not trying to be alarmist, I am alarmed. Our community was >built on respect for rights, and dropping this principle out of >expediency now would be a radical departure from the norm. I don't >think this is clear enough in lumin's proposal and "Toxic Candy" I’ve not read the actual proposal (I saw the mail after responding to *this* thread only), but the summary by lumin in this thread makes it clear to me that it doesn’t go far enough, see also below. What does this mean for the proposed GR? Honestly, I’d rather skip it as it’s clear enough it’s inacceptable (your “should end here”). Mo, can you perhaps solicit input from ftpmasters first (also to see if they lean towards a similar hard stance)? If so, we can probably end up not needing one. *peeks at the proposals in the current text in your (Mo’s) repo* “A free software AI should publish the training data and training software under free software license, not just a FOSS-licensed pre-trained model along with the inference software.” OK, this doesn’t mention that they are acceptable for non-free, which your wording in the thread indicated. I could vote for that… except “free software license” isn’t what we need, we need “DFSG-compliant licence” (whether software or not as most FOSS data licences aren’t software licences, with The MirOS Licence as notable exception). I’ll file an issue about that. “Downside: This is not compatible with OSAID.” That’s an upside in my book, and many on Fedi would agree. I’d also argue that upsides/downsides belong into a conclusion, not into the text of the proposal, as they (some more than others) are subjective statements of the drafter. I’ll file that separately. (Note I haven’t looked at the rest. Still hoping we can not need a GR.) Zack wrote: >let's focus on what could be reasonably distributed by Debian. This >includes small(er) generative AI language models, but also all sorts of >*non-generative* AI models, e.g., classification models. I think the same rule as for other statically linked binaries applies. All sources must be available and in Debian (or at least to Debian and its users) and their licences must work together and be honoured. For non-free we can waive the requirement to reproduce sources, but not that the licences of the sources are honoured and are compatible, which includes auditability. The licence terms of the model itself must be suitable, of course, but must also include the licence terms of the “training data”. The output (which isn’t just a pattern/trend/correlation) made from the model must also be considered “potentially a derivative work of parts or all of its input, including training data”, and so default to have the entire terms applicable to it, unless the model can know which parts were used and which weren’t. (The output is machine-generated like a compiler, so it cannot be copyrighted as new work by itself, but this doesn’t mean it can’t be copyrighted as derived work.) I think this is true for all kids of models, generative or not, though if a classification model is small enough that it can be proven, to at least reasonable exclusion, that it cannot reproduce its inputs in a form sufficient for copyright, could get partial exeptions. For main I think I’d still want sources available. In a twist, I agree that those small-enough classification models (not sure about generative models) could go to non-free-firmware. >The latter do not generate copyrightable content, >so most of the issues you pointed out do not apply to them. AIUI, models and software making use of them are distinct (data vs. code, otherwise the models couldn’t go to non-free-firmware). It would have to be seen if you could take a sufficiently large classification model and plug it, possibly with minor changes, into a “generative AI” program (gods I hate that term, it regurgitates, doesn’t generate, and it misrepresents true generative art as well); if so, or if there’s something that can take a model and “disassemble” it into recognisable parts of the training mate‐ rial, it’d still be an issue. >The reason is that, even if a completely free-as-in-freedom (including >in its training dataset), high quality LLM were to materialize in the >future, its preferred form of modification (which includes the dataset) >will be practically impossible to distribute by Debian due to its size. Probably/possibly, but there’s still a distinction between contrib and non-free (and “just no”) on the line. It’d also most likely not realistically be reproducable by Debian. I once was asked (while preparing a response to a questionnaire about this) what conditions it would take for me to accept “an AI”, and besides honouring and reproducing licence terms, attributions, etc. one condition was participating in a “reproducible builds” effort, where the “training data” and all other input used during training, such as the PRNG stream, would be recorded, and others with sufficiently beefy systems could then reproduce the created model. If this is occasionally checked (and if, during development, steps are taken to not “accidentally” break it), then we could deal with the ready-made model. From a freedom perspective, we would still want all sources available, so that people with the means to do can still produce a model from modified sources. I admit I haven’t thought on some of the things I wrote above, like how they can fit into a Debian-ish model, as much as into the other things (especially what I put on the webpages linked in the previous mail), but they should serve as a good start. >Other issues still apply to them, including biases analyses (at a scale >which *is* manageable, addressing some of the issues pointed out by >hartmans), and ethical data sourcing. And environmental concerns, indeed, indeed. These can probably be handled by the relevant team (d-science?) like they are with other prospective packages, should the other concerns (DFSG-freeness, archive rules, etc.) pass. bye, //mirabilos (still not subscribed) -- Save the environment. Don’t use “AI” to summarise this eMail, please.