Re: [nexa] ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Stefano Quintarelli Fri, 29 Sep 2023 11:03:39 -0700

grazie

e per quanto riguarda l'uso di  testi generati facendo scansione ed OCR ?

grazie!, s.

On 29/09/23 19:58, Maurizio Borghi wrote:

Caro Stefano, il tuo ragionamento è corretto. Aggiungo che la rimozione del DRM è giàperseguibile come violazione del diritto d’autore, non solo nei paesi UE, ma anche in USA.Se l’opera non è protetta da DRM il discorso è un po’ più complicato: in UE, il contrattodi licenza non può escludere certi usi consentiti, in particolare il text and data miningper scopi non commerciali. Può però escludere lo stesso utilizzo se per scopi commercialie se l’uso è espressamente riservato. In USA non ci sono regole precise, ma la libertàcontrattuale tende di solito a prevalere sulla disponibilità di eccezioni (fair use). Nonè un caso che nella class action contro GitHub / Copilot i claim si basino interamente suviolazione dei contratti di licenza (open source) e sulla rimozione dei DRM, anziché sullaviolazione del copyright nel software utilizzato per addestrare l’algoritmo.

Un caro saluto
Maurizio

On Fri, 29 Sep 2023 at 15:21, Stefano Quintarelli <stef...@quintarelli.it<mailto:stef...@quintarelli.it>> wrote:

    Ho una domanda per i giuristi (anzi, piu' di una)

    per allenare un modello, ho bisogno di un file con la versione digitale di 
un testo.
    (cosnsidero ovviamente testi non PD, CC0, ecc.)

    la versione digitale di un testo la posso ottenere da un ebook (gia' 
digitale), togliendo
    il probabile DRM.
    ma un ebook non e' unbene ma e' un servizio soggetto a licenza d'uso, 
quindi se non e'
    prevista nella licenza d'uso la facolta' di estrarre il testo digitale per 
allenarci un
    modello, mi sembra che ci sia gia' una violazione della licenza, per cui, 
credo, non
    possa
    essere usato come base di un allenamento, tanto piu' se il fine di tale 
allenamento e'
    commerciale (se vendo un servizio basato su quel modello).

    se e' cosi', per allenare il mio modello  devo allora prednere il testo 
digitale facendo
    scan/ocr di un testo cartaceo.
    ma cio' e' possibile, se non erro, solo per uso personale e non commerciale.

    se questo e' corretto, non mi pare ci sia un modo per prendere un testo 
digitale senza
    infrangere una licenza d'uso/copyright

    dove e' la fallacia del ragionamento ?

    grazie, s.

    On 29/09/23 15:00, Stefano Borroni Barale wrote:
     > Buongiorno lista,
     >
     >> L'idea che istruire un modello su dei testi coperti da copyright sia una
    violazione del suddetto copyright è altamente opinabile
     >
     > Fin qui, ho l'impressione che tutti i legali in lista concorderanno.
     >
     >> ragionamento è in realtà abbastanza semplice: se istruirsi su un
     >> testo ne violasse il copyright, saremmo tutti dei criminali.
     >
     > Ma siccome noi siamo umani e quello che produciamo non è - salvo i 
discorsi dei
    politici(*) - ontologicamente identico alla produzione di esseri tecnici 
non viventi,
    logica vuole che quanto si applica a noi non possa applicarsi a un LLM, 
tanto quanto
    la legge sul copyright non si applica pedissequamente all'utilizzo di testi 
umani per
    creare modelli linguistici.
     >
     > Questo è il motivo per il quale tutti i tentativi di "proteggere via 
copyright" il
    prodotto di software generativi sono falliti miseramente, e con motivazioni 
scritte in
    sentenze; che per il diritto credo abbiano un peso assai maggiore del sito 
di CC.
     >
     > La mia impressione è che la questione terrà impegnati legali, 
informatici, filosofi
    e società ancora moooooolto a lungo.
     > SBB
     >
     > (*) Come sanno bene i bambini degli anni '80 che hanno giocato con 
questo spassoso
    giocattolo: https://www.enricodalbosco.it/giochi/tubolario/
    <https://www.enricodalbosco.it/giochi/tubolario/>
     >
     >
     > Di quei testi
     >> non c'è fisicamente traccia all'interno dei modelli, non viene copiato
     >> niente. I modelli sono un'opera trasformativa di quei testi, non
     >> derivativa.
     >>
     >> Lo argomenta molto bene Creative Commons:
     >> https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/
    <https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/>
     >>
     >> Detto questo, cito le parole di un altro autore, Jeff Jarvis:
     >>
    
https://www.facebook.com/jeff.jarvis/posts/pfbid0LMFeqdTYoxnGHQAZwp5HMmeeVqgMSjL2dkcwMcBojkb2cinBpgYTHyc7Fhq1B9NPl
 
<https://www.facebook.com/jeff.jarvis/posts/pfbid0LMFeqdTYoxnGHQAZwp5HMmeeVqgMSjL2dkcwMcBojkb2cinBpgYTHyc7Fhq1B9NPl>
     >>
     >> «I, for one, am not complaining about my books being in in large
     >> language model training sets. I write to enter ideas into public
     >> discourse. I prefer informed over ignorant AI. I believe it is fair
     >> use for anyone to read & use books for transformative work. In fact,
     >> I'd probably feel snubbed if my books were not there. I'm happy when
     >> they are in libraries. I'm fine that they're here.»
     >>
     >> Fabio
     >>
     >> Il giorno ven 29 set 2023 alle ore 07:52 Alberto Cammozzo via nexa
     >> nexa@server-nexa.polito.it <mailto:nexa@server-nexa.polito.it> ha 
scritto:
     >>
     >>>
    
https://www.theguardian.com/australia-news/2023/sep/28/australian-books-training-ai-books3-stolen-pirated
 
<https://www.theguardian.com/australia-news/2023/sep/28/australian-books-training-ai-books3-stolen-pirated>
     >>>
     >>> Thousands of books from some of Australia’s most celebrated authors 
have
    potentially been caught up in what Booker prize-winning novelist Richard 
Flanagan has
    called “the biggest act of copyright theft in history”.
     >>>
     >>> The works have allegedly been pirated by the US-based Books3 dataset 
and used to
    train generative AI for corporations such as Meta and Bloomberg.
     >>>
     >>> Flanagan, who found 10 of his works, including the multi-international
    award-winning 2013 novel The Narrow Road to the Deep North, on the Books3 
dataset,
    told Guardian Australia he was deeply shocked by the discovery made several 
days ago.
     >>>
     >>> “I felt as if my soul had been strip mined and I was powerless to stop 
it,” he
    said in a statement.
     >>>
     >>> “This is the biggest act of copyright theft in history.”
     >>>
     >>> AI could ‘turbo-charge fraud’ and be monopolised by tech companies, 
Andrew Leigh
    warns
     >>>
     >>> The Australian Publishers Association confirmed to Guardian Australia 
on
    Wednesday that as many as 18,000 fiction and nonfiction titles with 
Australian ISBNs
    (unique international standard book numbers) appeared to be affected by the 
copyright
    infringement, although it is not yet clear what proportion of these are 
Australian
    editions of internationally authored books.
     >>>
     >>> “We’re still working through [the data] to work out the impact in 
terms of
    Australian authors,” APA spokesperson Stuart Glover said.
     >>>
     >>> “This is a massive legal and ethical challenge for the publishing 
industry and
    for authors globally.”
     >>>
     >>> A search tool published on Monday by US media platform The Atlantic 
and uploaded
    by the US Authors Guild on Wednesday revealed the works of Peter Carey, 
Helen Garner,
    Kate Grenville, Anna Funder, Christos Tsiolkas and Thomas Keneally, as well 
as
    Flanagan and dozens of other high-profile Australian authors, were included 
in the
    pirated dataset containing more than 180,000 titles.
     >>>
     >>> On Thursday, the Australian Society of Authors issued a statement 
saying it was
    “horrified” to learn that the works of Australian writers were being used 
to train
    artificial intelligence without permission from the authors.
     >>>
     >>> ASA chief executive, Olivia Lanchester, described the Books3 dataset 
as piracy on
    an industrial scale.
     >>>
     >>> “Authors appropriately feel outraged,” Lanchester said. “The fact is 
this
    technology relies upon books, journals, essays written by authors, yet 
permission was
    not sought nor compensation granted.”
     >>>
     >>> Lanchester said the Australian literary industry, while not objecting 
per se to
    emerging technologies such as AI, was deeply concerned about the lack of 
transparency
    evident in the development and monetisation of AI by global tech companies.
     >>>
     >>> “Turning a blind eye to the legitimate rights of copyright owners 
threatens to
    diminish already precarious creative careers,” she said.
     >>>
     >>> “The enrichment of a few powerful companies is at the cost of 
thousands of
    individual creators. This is not how a fair market functions.”
     >>>
     >>> Josephine Johnston, chief executive of Australia’s Copyright Agency, 
described
    the Books3 development as “a free kick to big tech” at the expense of 
Australia’s
    creative and cultural life.
     >>>
     >>> “We’re going to need greater transparency – how these tools have been 
developed,
    trained, how they operate – before people can truly understand what their 
legal rights
    might be,” she said.
     >>>
     >>> “We seem to be in this terrible position now where content owners – 
remembering
    that the vast majority of them will be individual authors – may actually 
have to take
    out court cases to enforce their rights.”
     >>>
     >>> Australian copyright law protects creators of original content from 
data scraping.
     >>>
     >>> Litigation in the US against ChatGPT creator OpenAI over use of 
allegedly pirated
    book datasets, Books1 and Books2 (which do not appear to be affiliated with 
Books3)
    has already commenced.
     >>>
     >>> In July, North American horror/fantasy writers Mona Awad (author of 
Bunny) and
    Paul Tremblay (author of The Cabin at the End of the World) filed a lawsuit 
in a San
    Francisco federal court, alleging ChatGPT unlawfully digested their books 
as part of
    its AI training data.
     >>>
     >>> On 28 August, OpenAI filed a motion to dismiss the lawsuit, arguing 
that the
    authors “misconceive the scope of copyright, failing to take into account 
the
    limitations and exceptions (including fair use) that properly leave room for
    innovations like the large language models now at the forefront of 
artificial
    intelligence”.
     >>>
     >>> On 19 September the Writers Guild and 17 of its members, including 
bestselling
    novelists John Grisham, George RR Martin and Jodi Picoult, filed a 
complaint in a New
    York district court against OpenAI, seeking redress for “flagrant and 
harmful
    infringements” of guild members’ registered copyrights.
     >>>
     >>> In a statement on its website, the guild says while it is aware that 
companies
    such as Meta and Bloomberg have used the Books3 dataset to train their 
LLMs, it is not
    yet clear whether OpenAI is using Books3 to train its ChatGPT models GPT 
3.5 or GPT 4.
     >>>
     >>> Democracies face ‘truth decay’ as AI blurs fact and fiction, warns 
head of
    Australia’s military
     >>>
     >>> Guardian Australia has sought comment from OpenAI, which has yet to 
officially
    respond to the guild’s complaint, and Meta.
     >>>
     >>> On 4 September, US technology magazine Wired reported that a Danish 
anti-piracy
    group called Rights Alliance had been told by Bloomberg that the company 
did not plan
    to train future versions of its BloombergGPT using Books3.
     >>>
     >>> Bloomberg declined to respond to the Guardian’s queries.
     >>>
     >>> The APA said the global nature of the issue would present significant 
challenges
    in enforcement and prosecution, and has joined the authors’ society in 
calling for AI
    technologies to be regulated.
     >>>
     >>> Consultation closed last month for a Department of Industry, Science 
and
    Resources discussion paper on supporting responsible AI.
     >>>
     >>> A parliamentary inquiry is under way examining the use of generative 
artificial
    intelligence in the Australian education system.
     >>>
     >>> Flanagan said it was up to the Australian government to act to protect
    Australia’s writers.
     >>>
     >>> “It has power and we do not,” he said.
     >>>
     >>> “If it cares for our culture it must now stand up and fight for it.”
     >>>
     >>> _______________________________________________
     >>> nexa mailing list
     >>> nexa@server-nexa.polito.it <mailto:nexa@server-nexa.polito.it>
     >>> https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa
    <https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>
     >>
     >> _______________________________________________
     >> nexa mailing list
     >> nexa@server-nexa.polito.it <mailto:nexa@server-nexa.polito.it>
     >> https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa
    <https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>
     > _______________________________________________
     > nexa mailing list
     > nexa@server-nexa.polito.it <mailto:nexa@server-nexa.polito.it>
     > https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa
    <https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>
    _______________________________________________
    nexa mailing list
    nexa@server-nexa.polito.it <mailto:nexa@server-nexa.polito.it>
    https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa
    <https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>

_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

Re: [nexa] ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Reply via email to