Re: [nexa] ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Alberto Cammozzo via nexa Fri, 29 Sep 2023 02:32:17 -0700

Caro Fabio,

convengo con te che il (c) abbia dei limiti in tale contesto, esoprattutto non credo che il /fair use/ sia l'unico criterio utile.

Le corti in varie parti del mondo saranno sempre più investite inmerito, vedremo che orientamento prenderanno e se la legge sul (c) saràl'unico strumento azionato. Gli LLM sfidano un apparato giuridico chenon è pensato per la produzione industriale di testi, fenomeno chefinora non esisteva.

In merito a quanto dici sull'equivalenza per macchine ed umani di'/istruirsi/' sullo stesso testo, per parte mia credo chel'/apprendimento/ umano e il /training/ del modello LLM abbiano unaenorme differenza.

L'umano è in grado di esprimersi col linguaggio anche senza i testi inquestione, e di estrarre le 'idee' in esso contenute prescindendo dallaformulazione esatta, mentre la macchina produce linguaggiostatisticamente correlato con la semantica associata a quelle idee soloseguendo la formulazione linguistica dei testi pertinenti, e solo contali testi. Non potrei dire la stessa cosa pensando a uno studente che/si istruisce/ dai libri.

Per la macchina l'idea (che per noi è significante, contenuto) nonesiste, ma solo il linguaggio (il significante): anche se questa nonprodurrà frasi che copiano letteralmente l'input di training, lospecifico input è essenziale alla produzione di testi con la semanticadell'input in questione.

Vedrei poi altri aspetti che le corti potrebbero tenere inconsiderazione, che emergeranno forse maggiormente in futuro, ma chemeriterebbero approfondimento ora.

- l'appropriazione del lavoro linguistico non riconosciuto dell'autore,anche come collettività autorale. Questo mi pare si veda già con laproduzione automatica di codice informatico attingendo ai repository;

- la responsabilità sulle conseguenze del contenuto dei testi eeventuali danni derivanti dalla scarsa qualità dello stesso (già vediamoricette velenose, fake news, suicidi, induzione a comportamentipericolosi, bug nel codice ...);

- l'inquinamento ambientale. Non solo quello energetico (per laproduzione e l'aggiornamento degli LLM), ma inquinamento informativodell'ecosistema linguistico (o in generale simbolico). Ammettendo illinguaggio come bene sistemico e patrimonio comune, l'immissione massivadi testi (o immagini) generati artificialmente interferisce conl'ecosistema e la sua naturale evoluzione.

Questo ultimo aspetto è quello meno immediatamente visibile ma sarà ilpiù intenso, e investirà per primi i motori di ricerca e gli altriattori dell'ecosistema digitale, che vedranno diluirsi il rapportosegnale/rumore all'aumentare dei testi generati artificialmente, e diconseguenza il valore del loro servizio. Dovranno decidere da che partestare... Anche la produzione di software a codice aperto risentirà dellostesso problema.

In generale le collettività che producono testi, codice e immagini e cheli riversano nei commons ne subiscono un danno dal momento che laproduzione industriale artificiale sommergerà il loro ecosistema conprodotti di qualità dubbia e che sottraggono loro lavoro e riconoscimento.


Stiamo (anche) a vedere...

Alberto


On 29/09/23 10:24, Fabio Alemagna wrote:

L'idea che istruire un modello su dei testi coperti da copyright sia
una violazione del suddetto copyright è altamente opinabile, e il
ragionamento è in realtà abbastanza semplice: se istruirsi su un testo
ne violasse il copyright, saremmo tutti dei criminali. Di quei testi
non c'è fisicamente traccia all'interno dei modelli, non viene copiato
niente. I modelli sono un'opera trasformativa di quei testi, non
derivativa.

Lo argomenta molto bene Creative Commons:
https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/

Detto questo, cito le parole di un altro autore, Jeff Jarvis:
https://www.facebook.com/jeff.jarvis/posts/pfbid0LMFeqdTYoxnGHQAZwp5HMmeeVqgMSjL2dkcwMcBojkb2cinBpgYTHyc7Fhq1B9NPl

«I, for one, am not complaining about my books being in in large
language model training sets. I write to enter ideas into public
discourse. I prefer informed over ignorant AI. I believe it is fair
use for anyone to read & use books for transformative work. In fact,
I'd probably feel snubbed if my books were not there. I'm happy when
they are in libraries. I'm fine that they're here.»

Fabio

Il giorno ven 29 set 2023 alle ore 07:52 Alberto Cammozzo via nexa
<nexa@server-nexa.polito.it>  ha scritto:

<https://www.theguardian.com/australia-news/2023/sep/28/australian-books-training-ai-books3-stolen-pirated>


Thousands of books from some of Australia’s most celebrated authors have 
potentially been caught up in what Booker prize-winning novelist Richard 
Flanagan has called “the biggest act of copyright theft in history”.

The works have allegedly been pirated by the US-based Books3 dataset and used 
to train generative AI for corporations such as Meta and Bloomberg.

Flanagan, who found 10 of his works, including the multi-international 
award-winning 2013 novel The Narrow Road to the Deep North, on the Books3 
dataset, told Guardian Australia he was deeply shocked by the discovery made 
several days ago.

“I felt as if my soul had been strip mined and I was powerless to stop it,” he 
said in a statement.

“This is the biggest act of copyright theft in history.”

AI could ‘turbo-charge fraud’ and be monopolised by tech companies, Andrew 
Leigh warns


The Australian Publishers Association confirmed to Guardian Australia on 
Wednesday that as many as 18,000 fiction and nonfiction titles with Australian 
ISBNs (unique international standard book numbers) appeared to be affected by 
the copyright infringement, although it is not yet clear what proportion of 
these are Australian editions of internationally authored books.

“We’re still working through [the data] to work out the impact in terms of 
Australian authors,” APA spokesperson Stuart Glover said.

“This is a massive legal and ethical challenge for the publishing industry and 
for authors globally.”

A search tool published on Monday by US media platform The Atlantic and 
uploaded by the US Authors Guild on Wednesday revealed the works of Peter 
Carey, Helen Garner, Kate Grenville, Anna Funder, Christos Tsiolkas and Thomas 
Keneally, as well as Flanagan and dozens of other high-profile Australian 
authors, were included in the pirated dataset containing more than 180,000 
titles.

On Thursday, the Australian Society of Authors issued a statement saying it was 
“horrified” to learn that the works of Australian writers were being used to 
train artificial intelligence without permission from the authors.

ASA chief executive, Olivia Lanchester, described the Books3 dataset as piracy 
on an industrial scale.

“Authors appropriately feel outraged,” Lanchester said. “The fact is this 
technology relies upon books, journals, essays written by authors, yet 
permission was not sought nor compensation granted.”

Lanchester said the Australian literary industry, while not objecting per se to 
emerging technologies such as AI, was deeply concerned about the lack of 
transparency evident in the development and monetisation of AI by global tech 
companies.

“Turning a blind eye to the legitimate rights of copyright owners threatens to 
diminish already precarious creative careers,” she said.

“The enrichment of a few powerful companies is at the cost of thousands of 
individual creators. This is not how a fair market functions.”

Josephine Johnston, chief executive of Australia’s Copyright Agency, described 
the Books3 development as “a free kick to big tech” at the expense of 
Australia’s creative and cultural life.

“We’re going to need greater transparency – how these tools have been 
developed, trained, how they operate – before people can truly understand what 
their legal rights might be,” she said.

“We seem to be in this terrible position now where content owners – remembering 
that the vast majority of them will be individual authors – may actually have 
to take out court cases to enforce their rights.”

Australian copyright law protects creators of original content from data 
scraping.

Litigation in the US against ChatGPT creator OpenAI over use of allegedly 
pirated book datasets, Books1 and Books2 (which do not appear to be affiliated 
with Books3) has already commenced.


In July, North American horror/fantasy writers Mona Awad (author of Bunny) and 
Paul Tremblay (author of The Cabin at the End of the World) filed a lawsuit in 
a San Francisco federal court, alleging ChatGPT unlawfully digested their books 
as part of its AI training data.

On 28 August, OpenAI filed a motion to dismiss the lawsuit, arguing that the 
authors “misconceive the scope of copyright, failing to take into account the 
limitations and exceptions (including fair use) that properly leave room for 
innovations like the large language models now at the forefront of artificial 
intelligence”.

On 19 September the Writers Guild and 17 of its members, including bestselling 
novelists John Grisham, George RR Martin and Jodi Picoult, filed a complaint in 
a New York district court against OpenAI, seeking redress for “flagrant and 
harmful infringements” of guild members’ registered copyrights.

In a statement on its website, the guild says while it is aware that companies 
such as Meta and Bloomberg have used the Books3 dataset to train their LLMs, it 
is not yet clear whether OpenAI is using Books3 to train its ChatGPT models GPT 
3.5 or GPT 4.

Democracies face ‘truth decay’ as AI blurs fact and fiction, warns head of 
Australia’s military


Guardian Australia has sought comment from OpenAI, which has yet to officially 
respond to the guild’s complaint, and Meta.

On 4 September, US technology magazine Wired reported that a Danish anti-piracy 
group called Rights Alliance had been told by Bloomberg that the company did 
not plan to train future versions of its BloombergGPT using Books3.

Bloomberg declined to respond to the Guardian’s queries.

The APA said the global nature of the issue would present significant 
challenges in enforcement and prosecution, and has joined the authors’ society 
in calling for AI technologies to be regulated.

Consultation closed last month for a Department of Industry, Science and 
Resources discussion paper on supporting responsible AI.

A parliamentary inquiry is under way examining the use of generative artificial 
intelligence in the Australian education system.

Flanagan said it was up to the Australian government to act to protect 
Australia’s writers.

“It has power and we do not,” he said.

“If it cares for our culture it must now stand up and fight for it.”

_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

Re: [nexa] ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Reply via email to