[nexa] Re: LLMs e copyright

GC F via nexa Fri, 27 Mar 2026 11:30:48 -0700

Noi qui abbiamo una posizione diversa:

Preserving Balance in the EU Digital Single Market: How Like Company Could
Reframe Copyright and Innovation in the Generative AI Era by Enrico
Bonadio, Giancarlo Frosio, Christophe Geiger, Andrés Guadamuz, Stavroula
Karapapa, Irini A. Stamatoudi :: SSRN
<https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6326401>


Lo studio Ahmed et al mostra quei risultati su un libro che è stato
'overfitted' migliaia di volte e con adversarial prompting che ha costi tre
o quattro volte il valore del libro.

Leggerò l'altro studio di cui vedo Ginsburg è una co-autrice. Nel frattempo
questo studio ha posizioni diverse:  Haviv, A. et al., ‘We Should Separate
Memorization from Copyright’, arXiv:2602.08632v1 [cs.CY], 9 February 2026,
https://doi.org/10.48550/arXiv.2602.08632. In questo secondo studio, Niva
Elkin Koren è una delle co-autrici. Ginsburg e Koren hanno entrambe grande
reputazione internazionale in materia di diritti d'autore. Ginsburg però è
una massimalista del copyright, mentre Koren una minimalista...

Giancarlo

On Fri, 27 Mar 2026 at 10:22, Enrico Nardelli via nexa <
[email protected]> wrote:

> Per chi se li fosse persi: un paio di lavori che indicano in modo
> abbastanza chiaro come gli LLM siano certamente delle gigantesche memorie
> che contengono intere opere soggette a diritto d'autore.
>
> Ciao, Enrico
>
> 1)
>
> Extracting books from production language models
> hmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang
> https://arxiv.org/abs/2601.02671
>
> Many unresolved legal questions over LLMs and copyright center on
> memorization: whether specific training data have been encoded in the
> model's weights during training, and whether those memorized data can be
> extracted in the model's outputs. While many believe that LLMs do not
> memorize much of their training data, recent work shows that substantial
> amounts of copyrighted text can be extracted from open-weight models.
> However, it remains an open question if similar extraction is feasible for
> production LLMs, given the safety measures these systems implement. We
> investigate this question ... and we measure extraction success with a
> score computed from a block-based approximation of longest common substring
> (nv-recall). With different per-LLM experimental configurations, we were
> able to extract varying amounts of text. ... e.g, nv-recall of 76.8% and
> 70.3%, respectively, for Harry Potter and the Sorcerer's Stone ... Taken
> together, our work highlights that, even with model- and system-level
> safeguards, extraction of (in-copyright) training data remains a risk for
> production LLMs.
>
> ----------------
>
> 2)
>
> Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty
> Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of
> Copyrighted Books in Large Language Models
> https://arxiv.org/abs/2603.20957
>
> Frontier LLM companies have repeatedly assured courts and regulators that
> their models do not store copies of training data. They further rely on
> safety alignment strategies via RLHF, system prompts, and output filters to
> block verbatim regurgitation of copyrighted works, and have cited the
> efficacy of these measures in their legal defenses against copyright
> infringement claims. We show that finetuning bypasses these protections: by
> training models to expand plot summaries into full text, a task naturally
> suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro,
> and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books,
> with single verbatim spans exceeding 460 words, using only semantic
> descriptions as prompts and no actual book text.
>
> ...
>
> Our findings offer compelling evidence that model weights store copies of
> copyrighted works and that the security failures that manifest after
> finetuning on individual authors’ works undermine a key premise of recent
> fair use rulings, where courts have conditioned favorable outcomes on the
> adequacy of measures preventing reproduction of protected expression.
>
> --
>
> -- EN
> https://www.hoepli.it/libro/la-rivoluzione-informatica/9788896069516.html
> ======================================================
> Prof. Enrico Nardelli
> Past President di "Informatics Europe"
> Direttore del Laboratorio Nazionale "Informatica e Scuola" del CINI
> Dipartimento di Matematica - Università di Roma "Tor Vergata"
> Via della Ricerca Scientifica snc - 00133 Roma
> home page: https://www.mat.uniroma2.it/~nardelli
> blog: https://link-and-think.blogspot.it/
> tel: +39 06 7259.4204 fax: +39 06 7259.4699
> mobile: +39 335 590.2331 e-mail: [email protected]
> online meeting: https://blue.meet.garr.it/b/enr-y7f-t0q-ont
> ======================================================
> --
>

[nexa] Re: LLMs e copyright

Reply via email to