Re: A Different Take on AI

Sam Johnston Fri, 07 Feb 2025 04:37:35 -0800

On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <t...@debian.org> wrote:
>
> I’d like to remindyou that these huge binary blobs still contain,
> in lossily compressed form, illegally obtained and unethically
> pre-prepared, copies of copyrighted works, whose licences are not
> honoured by the proposed implementations.
>
> As such I cannot consider them acceptable even for Debian’s non-free.

Agreed, we know these models can and do routinely recall training data
in the course of normal operation[1]:

"Large language models (LMs) have been shown to memorize parts of
their training data, and when prompted appropriately, they will emit
the memorized training data verbatim."

We also know that even models carefully designed to avoid this, often
using guardrails that would be trivially removed when running locally
rather than as a service like OpenAI, will divulge their secrets if
coerced[2]:

"The Times paid someone to hack OpenAI’s products,” and even so, it
“took them tens of thousands of attempts to generate the highly
anomalous results”

The OSI and others arguing that this is a valid way to protect
sensitive training data (copyright content, but also personally
identifiable information, medical records, proprietary datasets for
federated learning, and even CSAM) demonstrates they either do not
understand the technology, or worse, do and are trying to deceive us.
For me, the debate should end here.

> While the act of training such a model *for data analysēs* may be
> legal, distributing it, or output gained from it that is not a, and
> I quote the copyright law, “pattern, trend [or] correlation” isn’t
> legal.

Some 4D chess players have argued that a model is not copyrightable as
it is merely "a set of factual observations about the data", and that
the copyright violations necessary for training are technically
excusable (if unethical) under fair use and text and data mining
exemptions. This ignores the intentions of the authors of the content
(and the exemptions, which pre-date LLMs), with training on e.g.
Common Crawl being done without their consent. Unless otherwise
specified, content is typically published with "all rights reserved"
by default.

In any case, the result is "a statistical model that spits out
memorized information [that] might infringe [...] copyright". The
exemptions relied upon for training do not extend to reproduction
during inference, for which a test of “substantial similarity” would
apply (otherwise one might argue such copyright violations are
coincidental).

Allowing this would be knowingly shipping obfuscated binary blobs in
main, akin to a book archive (Authors Guild v. Google, 2015) with
trivially reversible encryption, or a printer driver that can
spontaneously reproduce copyrighted content from memory. That we've
been discussing these AI policy issues on the public record for years
could even subject the project to claims of contributory copyright
infringement when our users inevitably commit direct infringement
(deliberately or inadvertently).

It would be a shame to see Debian enter the same category as Grokster
and Napster. It would also be unfortunate if Debian and derivatives
could no longer be considered Digital Public Goods (albeit not yet
certified like Fedora[3]), as the DPGA has just today "finalized the
decision to make training data mandatory for AI systems applying to
become DPGs. This requirement will help ensure that AI systems are
built ethically and are transparent and interpretable"[4]. This too
should give pause to advocates of allowing obviously non-free modules
in main.

While I'm not trying to be alarmist, I am alarmed. Our community was
built on respect for rights, and dropping this principle out of
expediency now would be a radical departure from the norm. I don't
think this is clear enough in lumin's proposal and "Toxic Candy"
language, but rather than splitting the vote we should work on a
consolidated clear and concise position, keeping the context separate.
The alternative would also have unintended consequences, including
chilling effects on open data, and on high-quality open models that
emerged around/after (and in many cases, before) OSI's contentious
OSAID release.

- samj

1. https://arxiv.org/abs/2202.07646
2. https://hls.harvard.edu/today/does-chatgpt-violate-new-york-times-copyrights/
3.
https://www.networkworld.com/article/970236/fedora-linux-declared-a-digital-public-good.html
4.
https://github.com/DPGAlliance/dpg-standard/issues/193#issuecomment-2642584851

Re: A Different Take on AI

Reply via email to