Re: [nexa] draft 0.0.9 di Open Source AI Definition (da: large language model e open washing)

Giacomo Tesio Thu, 05 Sep 2024 16:17:35 -0700

Ops! il mio post è stato "nascosto"...

On Fri, 6 Sep 2024 00:55:55 +0200 Giacomo Tesio <giac...@tesio.it>
wrote:

> Qui trovi la mia controproposta:
> 
> https://discuss.opensource.org/t/draft-v-0-0-9-of-the-open-source-ai-definition-is-available-for-comments/513/11

Fortunatamente, la Wayback Machine è stata più veloce:

http://web.archive.org/web/20240905230145/https://discuss.opensource.org/t/draft-v-0-0-9-of-the-open-source-ai-definition-is-available-for-comments/513/#post_11

Riporto comunque il contenuto di seguito, caso mai richiedessero di
cancellarlo anche di lì (mi è già capitato in passato...)

```
Totally agree with @thesteve0.

Systems based on machine learning techniques are composed of two kind
of software: a virtual machine (with a specific architecture) that
basically maps vectors to vectors and a set of “weight” matrices that
constitute the software executed by such virtual machine (the “AI
model”).

The source code of the virtual machine can be open source, so that
given the proper compiler, we can create an exact copy of such software.

In the same way, the software executed by the virtual machine (usually
referred to as “the AI model”) is encoded in a binary form that the
specific machine can directly execute (the weight matrices). The source
code of such binary is composed of all the data required to recreate an
exact copy of the binary (the weights). Such data include the full
dataset used but also any random seed or input used during the process,
such as, for example, the initial random value used to initialize an
artificial neural network.

Even if the weights are handy to modify an AI system, they are in no
way enough to study it.

So, any system that does not provide the whole dataset required to
recreate an exact copy of the model, cannot be defined open source.

Note that in a age of supply chain attacks that leverage opensource,
the right to study the system also has a huge practical security value
as arXiv:2204.06974 showed that you can plant undetectable backdoors in
machine learning models.

Thus I suggest to modify the definition so that

    Data information: Sufficiently detailed information about all the
    data used to train the system (including any random value used
    during the process), so that a skilled person can recreate an exact
    copy of the system using the same data. Data information shall be
    made available with licenses that comply with the Open Source
    Definition.

Being able to build a “substantially equivalent” system means not being
able to build that system, but a different one. It would be like
defining Google Chrome as “open source” just because we have access to
Chromium source code.

When its training data cannot legally be shared, an AI system cannot be
defined as “open source” even if all the other components comply with
the open source definition, because you cannot study that system, but
only the components available under the os license.

Such a system can be valuable, but not open source, even if the weights
are available under a OSD compliant license, because they encode an
opaque binary for a specific architecture, not source code.

Lets properly call such models and systems “freeware” and build a
definition of OpenSource AI that is coherent with the OpenSource one.
```

Giacomo

Re: [nexa] draft 0.0.9 di Open Source AI Definition (da: large language model e open washing)

Reply via email to