On Fri, 7 Feb 2025, Sam Johnston wrote:

>On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <t...@debian.org> wrote:
>>
>> I’d like to remind you that these huge binary blobs still contain,
>> in lossily compressed form, illegally obtained and unethically
>> pre-prepared, copies of copyrighted works, whose licences are not
>> honoured by the proposed implementations.
>>
>> As such I cannot consider them acceptable even for Debian’s non-free.
>
>Agreed, we know these models can and do routinely recall training data
>in the course of normal operation[1]:
[…]
>We also know that even models carefully designed to avoid this, often
>using guardrails that would be trivially removed when running locally
>rather than as a service like OpenAI, will divulge their secrets if
>coerced[2]:

Indeed, I’ve seen more examples for this.

>The OSI and others arguing […] demonstrates they either do not
>understand the technology, or worse, do and are trying to deceive us.
>For me, the debate should end here.

+1

>> While the act of training such a model *for data analysēs* may be
>> legal, distributing it, or output gained from it that is not a, and
>> I quote the copyright law, “pattern, trend [or] correlation” isn’t
>> legal.
>
>Some 4D chess players have argued that a model is not copyrightable as
>it is merely "a set of factual observations about the data", and that

This sounds like the usual “some random software developer trying
their hand at legalese” which lawyers routinely laugh about.

I’ve heard that this is irrelevant, as long as the model’s output
can reproduce sufficiently recognisable parts of others’ works,
standalone and/or as collage, it’s a derived work. IANAL, ofc.

>excusable (if unethical) under fair use

… which is purely a US-american thing…

> and text and data mining exemptions.

… which does not allow reproduction of works, only what I quoted above.

>This ignores the intentions of the authors of the content

That, too.

>Unless otherwise specified, content is typically published with "all
>rights reserved" by default.

The Berne Convention says so, indeed.

>In any case, the result is "a statistical model that spits out
>memorized information [that] might infringe [...] copyright". The
>exemptions relied upon for training do not extend to reproduction
>during inference, for which a test of “substantial similarity” would
>apply (otherwise one might argue such copyright violations are
>coincidental).

+1

>Allowing this would be knowingly shipping obfuscated binary blobs in
>main, akin to a book archive (Authors Guild v. Google, 2015) with
>trivially reversible encryption, or a printer driver that can
>spontaneously reproduce copyrighted content from memory.

Interesting comparisons. If you take the lossy compression into
account (book archive as JPEG or so), with possibly increased
lossiness/compresion rate, this is a good fit.

>Digital Public Goods (albeit not yet certified like Fedora[3]), as the
>DPGA has just today "finalized the decision to make training data
>mandatory for AI systems applying to become DPGs. This requirement will

Interesting, didn’t know about DPGs yet. (Hmm, they have a requirement
for CC licences for data collections, which (except CC0 which on the
other hand is problematic for reuse in/of code) aren’t Copyfree…grml…)

>While I'm not trying to be alarmist, I am alarmed. Our community was
>built on respect for rights, and dropping this principle out of
>expediency now would be a radical departure from the norm. I don't
>think this is clear enough in lumin's proposal and "Toxic Candy"

I’ve not read the actual proposal (I saw the mail after responding
to *this* thread only), but the summary by lumin in this thread makes
it clear to me that it doesn’t go far enough, see also below.

What does this mean for the proposed GR? Honestly, I’d rather skip it
as it’s clear enough it’s inacceptable (your “should end here”). Mo,
can you perhaps solicit input from ftpmasters first (also to see if
they lean towards a similar hard stance)? If so, we can probably end
up not needing one.

*peeks at the proposals in the current text in your (Mo’s) repo*

“A free software AI should publish the training data and training
 software under free software license, not just a FOSS-licensed
 pre-trained model along with the inference software.”

OK, this doesn’t mention that they are acceptable for non-free, which
your wording in the thread indicated. I could vote for that… except
“free software license” isn’t what we need, we need “DFSG-compliant
licence” (whether software or not as most FOSS data licences aren’t
software licences, with The MirOS Licence as notable exception). I’ll
file an issue about that.

“Downside: This is not compatible with OSAID.”

That’s an upside in my book, and many on Fedi would agree.

I’d also argue that upsides/downsides belong into a conclusion, not
into the text of the proposal, as they (some more than others) are
subjective statements of the drafter. I’ll file that separately.

(Note I haven’t looked at the rest. Still hoping we can not need a GR.)


Zack wrote:

>let's focus on what could be reasonably distributed by Debian. This
>includes small(er) generative AI language models, but also all sorts of
>*non-generative* AI models, e.g., classification models.

I think the same rule as for other statically linked binaries applies.
All sources must be available and in Debian (or at least to Debian and
its users) and their licences must work together and be honoured.

For non-free we can waive the requirement to reproduce sources, but not
that the licences of the sources are honoured and are compatible, which
includes auditability.

The licence terms of the model itself must be suitable, of course, but
must also include the licence terms of the “training data”. The output
(which isn’t just a pattern/trend/correlation) made from the model must
also be considered “potentially a derivative work of parts or all of its
input, including training data”, and so default to have the entire terms
applicable to it, unless the model can know which parts were used and
which weren’t. (The output is machine-generated like a compiler, so it
cannot be copyrighted as new work by itself, but this doesn’t mean it
can’t be copyrighted as derived work.)

I think this is true for all kids of models, generative or not, though
if a classification model is small enough that it can be proven, to at
least reasonable exclusion, that it cannot reproduce its inputs in a
form sufficient for copyright, could get partial exeptions.

For main I think I’d still want sources available. In a twist, I agree
that those small-enough classification models (not sure about generative
models) could go to non-free-firmware.

>The latter do not generate copyrightable content,
>so most of the issues you pointed out do not apply to them.

AIUI, models and software making use of them are distinct (data vs. code,
otherwise the models couldn’t go to non-free-firmware). It would have to
be seen if you could take a sufficiently large classification model and
plug it, possibly with minor changes, into a “generative AI” program (gods
I hate that term, it regurgitates, doesn’t generate, and it misrepresents
true generative art as well); if so, or if there’s something that can take
a model and “disassemble” it into recognisable parts of the training mate‐
rial, it’d still be an issue.

>The reason is that, even if a completely free-as-in-freedom (including
>in its training dataset), high quality LLM were to materialize in the
>future, its preferred form of modification (which includes the dataset)
>will be practically impossible to distribute by Debian due to its size.

Probably/possibly, but there’s still a distinction between contrib and
non-free (and “just no”) on the line.

It’d also most likely not realistically be reproducable by Debian.

I once was asked (while preparing a response to a questionnaire about
this) what conditions it would take for me to accept “an AI”, and besides
honouring and reproducing licence terms, attributions, etc. one condition
was participating in a “reproducible builds” effort, where the “training
data” and all other input used during training, such as the PRNG stream,
would be recorded, and others with sufficiently beefy systems could then
reproduce the created model. If this is occasionally checked (and if,
during development, steps are taken to not “accidentally” break it), then
we could deal with the ready-made model.

From a freedom perspective, we would still want all sources available,
so that people with the means to do can still produce a model from
modified sources.


I admit I haven’t thought on some of the things I wrote above, like how
they can fit into a Debian-ish model, as much as into the other things
(especially what I put on the webpages linked in the previous mail), but
they should serve as a good start.


>Other issues still apply to them, including biases analyses (at a scale
>which *is* manageable, addressing some of the issues pointed out by
>hartmans), and ethical data sourcing.

And environmental concerns, indeed, indeed.

These can probably be handled by the relevant team (d-science?) like
they are with other prospective packages, should the other concerns
(DFSG-freeness, archive rules, etc.) pass.

bye,
//mirabilos (still not subscribed)
-- 
Save the environment. Don’t use “AI” to summarise this eMail, please.

Reply via email to