Re: [nexa] Bruce Schneier sulla "Open Source AI" di OSI

Stefano Maffulli Tue, 12 Nov 2024 06:28:14 -0800

Peccato che Schneier abbia dedicato a un tema così complesso solo poche
parole. La questione fondamentale in questo dibattito è fare uno sforzo per
comprendere che ci sono 4 tipi di "dati"
<https://hackmd.io/@opensourceinitiative/osaid-faq#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition>
(dalle FAQ della OSI):

   - Open training data: data that can be copied, preserved, modified and
   reshared. It provides the best way to enable users to study the system.
   This must be shared.
   - Public training data: data that others can inspect as long as it
   remains available. This also enables users to study the work. However, this
   data can degrade as links or references are lost or removed from network
   availability. To obviate this, different communities will have to work
   together to define standards, procedures, tools and governance models to
   overcome this risk, and Data Information is required in case the data
   becomes later unavailable. This must be disclosed with full details on
   where to obtain it.
   - Obtainable training data: data that can be obtained, including for a
   fee. This information provides transparency and is similar to a purchasable
   component in an open hardware system. The Data Information provides a means
   of understanding this data other than obtaining or purchasing it. This is
   an area that is likely to change rapidly and will need careful monitoring
   to protect Open Source AI developers. This must be disclosed with full
   details on where to obtain it.
   - Unshareable non-public training data: data that cannot be shared for
   explainable reasons, like Personally Identifiable Information (PII). For
   this class of data, the ability to study some of the system's biases
   demands a detailed description of the data – what it is, how it was
   collected, its characteristics, and so on – so that users can understand
   the biases and categorization underlying the system. This must be revealed
   in detail so that, for example, a hospital can create a dataset with
   identical structure using their own patient data.

Alcuni di questi non si possono distribuire (copyright e diritti vari) e
neppure richiedere (PII).

Quindi richiedere che Open Source AI sia solo composto da open data porta
come conseguenza che Open Source AI sia sempre ridotto a prodotto inferiore
(ridotta quantità di dati), e sia escluso da alcuni scopi (AI mediche, per
esempio)... come se GNU si rassegnasse a essere una copia ridotta di Unix
perché gli standard ISO non sono liberamente copiabili.

Alcuni soggetti bollano con leggerezza la questione senza notare che
praticamente tutti i grandi dataset "aperti" hanno più o meno gli stessi
problemi, anche quelli che non sono ancora stati portati in giudizio per
violazione di copyright e altro.

Il mondo "Open" è già in profondo svantaggio in questo settore e il
consiglio di amministrazione di OSI ha l'obbligo di difendere l'ecosistema
Open Source.

Il prof. Liang <https://press.airstreet.com/p/percy-liang-on-truly-open-ai>
sintetizza bene:

Companies also want to avoid costly and time-consuming litigation (think
OpenAI and the New York Times), so it makes short-term economic sense to be
closed about their data.

We’re also seeing a shrinking of the AI Data Commons - the crawlable web
data assembled <https://arxiv.org/abs/2407.14933> in corpora like C4,
RefinedWeb, and Dolma.

5-7% of previously available training data has been restricted
<https://arxiv.org/abs/2407.14933> in just the past year, with rates
getting closer to 20-33% for valuable sources like news sites and social
media. The current tools for managing this access, like robots.txt (which
dates back to 1995) are inconsistently implemented, creating a messy
patchwork of restrictions.

[...]For a model to be open-source, we need full information about the
processing code and data.
Sempre Liang <https://xcancel.com/percyliang/status/1832164166314160163#m>:

On the flip side, even a full data release is insufficient for the goals of
open-source (to study and modify). From just the data, it is hard to
understand *why* certain tokens are included or excluded. For this, you
really need the code for the *full* data processing pipeline.

La complessità della questione è ribadita anche dalla posizione della FSF
<https://www.fsf.org/news/fsf-is-working-on-freedom-in-machine-learning-applications>,
che sostanzialmente giunge alla stessa conclusione della OSI accettando un
paradosso: per la FSF i sistemi AI addestrati con dati privati sono
"nonfree" ma moralmente accettabili.

Ridicola poi l'accusa che la definizione di Open Source AI sia ben vista
dalle mega aziende: Meta sta facendo una lotta senza quartiere, con IBM e
Amazon seguendo a ruota. Mistral e i suoi finanziatori ugualmente si
tengono a debita distanza perché la definizione richiede di rilevare i veri
"segreti" che stanno non tanto nei dati quanto nel codice che produce il
dataset e il training.

Infine, vorrei far notare che durante il processo di gestazione della
definizione di Open Source AI, la OSI ha coordinato un'analisi di una
dozzina di sistemi disponibili pubblicamente e ha convalidato la
definizione trovando che effettivamente i sistemi Open Source sono quelli
che quantomeno fanno uno sforzo a produrre i dataset di training. Per chi
continua a dire che Llama è "open source", trovate argomentazioni
contrarie sulla
FAQ
<https://hackmd.io/@opensourceinitiative/osaid-faq#Which-AI-systems-comply-with-the-Open-Source-AI-Definition>
oltre al commento specifico sulla licenza
<https://blog.opensource.org/metas-llama-2-license-is-not-open-source/>.

Il tema purtroppo non è intuitivo per chi da sempre promuove libertà e
aperture: c'è un paradosso e la OSI ha suggerito un modo per aggirarlo che
ne rispetta i valori fondanti. Nei prossimi mesi vedremo come applicare in
pratica la nuova Definizione e sarà più chiaro cosa cambiare nelle versioni
future.

/stefano maffulli
Direttore esecutivo - Open Source Initiative

On Mon, Nov 11, 2024 at 10:10 AM Roberto Resoli <robe...@resolutions.it>
wrote:

> Bruce Schneier si unisce ai critici della definizione OSI di IA Open
> Source, riferendone nel suo blog:
>
>
> https://www.schneier.com/blog/archives/2024/11/ai-industry-is-trying-to-subvert-the-definition-of-open-source-ai.html
>
> ... e non ci va giù piano:
>
>  > The Open Source Initiative has published (news article here) its
> definition of “open source AI,” and it’s terrible. It allows for secret
> training data and mechanisms. It allows for development to be done in
> secret. Since for a neural network, the training data is the source
> code—it’s how the model gets programmed—the definition makes no sense.
>  > ...
>
> Nell'articolo ci sono molti link ad altrettante critiche (anche di
> Debian[1], ad esempio).
>
> rob
>
> [1]
>
> https://samjohnston.org/2024/10/22/debian-general-resolution-gr-drafted-opposing-osis-open-source-ai-definition-osaid/
>

Re: [nexa] Bruce Schneier sulla "Open Source AI" di OSI

Reply via email to