Should we endorse the Open Source AI Definition?

Merlijn Sebrechts Wed, 05 Feb 2025 11:28:57 -0800

Hi Community Council and Technical Board



This is a discussion for both, given that this decision has both technical
and community/governance implications.

As you might have seen, the Open Source Initiative has been developing a
definition for "Open Source AI" for the past few years. They have recently
released version 1.0 and are asking for organisations and individuals to
endorse it. Already, the Eclipse Foundation, SUSE, Mozilla and many other
orgs endorsed it.

https://opensource.org/ai/endorsements

*I think Ubuntu should endorse the Open Source AI definition.*



*Why is this important for Ubuntu?*
There is a big issue of "openwashing" in the AI space. Many large
organizations like Meta are blatantly calling their AI open source, even
though the models are released under a license severely restricting use.
The media in general is not catching on to this ruse. For an example, take
a look at my analysis of the Llama license:
https://merlijn.sebrechts.be/blog/2024-08-06-problematic-meta-llama-3-1-license/

   - *Ubuntu heavily benefits from the fact that "Open Source" is a strong,
   well-defined term.* Our users trust that they can use Ubuntu for any
   purpose, without needing to consult lawyers first.
   - *AI Openwashing dilutes the meaning of "open source"*. As a result,
   our users are becoming confused and either start to lose trust in the term
   open source, or are drawn to competitors who use the term, but don't
   actually walk the talk.
   - *AI Openwashing creates a complicated compliance issue for Ubuntu.*
   Even though these tools do not adhere to our license requirements, users
   might start to expect us to ship and/or integrate with them. At least when,
   for example, MongoDB switched, everyone was in agreement that this was not
   open source.


*What is our goal for endorsing it?*

*Solving the issue of Openwashing requires a strong definition of what Open
Source AI means.* A strong definition that is supported both by law and by
the public perception. The Open Source Initiative is perfectly positioned
to do this. They have been stewards of the Open Source Software definition
for a long time, and their authority has been confirmed in court. Example:
https://opensource.org/blog/court-affirms-its-false-advertising-to-claim-software-is-open-source-when-its-not

Our goal for endorsing this definition is to give more weight to the OSI
definition of Open Source, so that it has a higher chance of getting
adopted both by the public perception and by the legal system.

*What are we endorsing specifically?*

Our endorsement means two things.

   - Firstly, we are endorsing that the current definition is a good step
   in the right direction.
   - Secondly, we are endorsing that the OSI has the authority to create
   this definition and that we have faith in their process to create and
   improve this definition.

As for the actual current version, you can find version 1.0 of the
definition here: https://opensource.org/ai/open-source-ai-definition

The gist of the definition is this:

An Open Source AI is an AI system made available under terms and in a way
> that grant the freedoms to:
>
>    -  Use the system for any purpose and without having to ask for
>    permission.
>    -  Study how the system works and inspect its components.
>    -  Modify the system for any purpose, including to change its output.
>    -  Share the system for others to use with or without modifications,
>    for any purpose.
>
> These freedoms apply both to a fully functional system and to discrete
> elements of a system. A precondition to exercising these freedoms is to
> have access to the preferred form to make modifications to the system.


The definition further explains what the "preferred form to make
modifications to the system" is.

   - *Detailed information about the training data.* This does not include
   the data itself, but it needs to be detailed enough so that a third party
   can create a similar data set.
   - *The code to train and run the system.*
   -
*The model parameters/weights/configuration values. *

Most current "open source" AI models only open source the code to run the
system. All other parts are either hidden or are only released under
"source-available" licenses such as Meta's Llama license.


*Isn't there some controversy surrounding the definition?*

There is some debate in the community about whether just the data
information needs to be open source or all the data itself. The current
OSAID is a pragmatic definition that says only detailed information about
the data needs to be open source. Not the data itself. It makes this
compromise (not enforcing open data) because open data for AI training is a
difficult legal and practical subject. Most countries actually allow
training of open source AI on copyright restricted datasets. As a result,
retraining an AI does not require open source datasets. Moreover, due to
strange data IP laws all over the world, it is very difficult to even
determine whether data is open source, and some countries like Japan
explicitly break open source data. Finally, with the most
privacy-respecting form of AI training (federated learning), the model
creator does not even have access to the training data in order to respect
the privacy of the data subjects.

For more information on why the current proposal doesn't require open
source training data, see

   - Explaining the concept of Data Information
   https://opensource.org/blog/explaining-the-concept-of-data-information
   - Open Data and Open Source AI: Charting a course to get more of both
   
https://opensource.org/blog/open-data-and-open-source-ai-charting-a-course-to-get-more-of-both
   - Why datasets built on public domain might not be enough for AI
   
https://opensource.org/blog/why-datasets-built-on-public-domain-might-not-be-enough-for-ai

All in all, I think the current proposal is a pragmatic compromise in order
to ensure the Open Source AI definition is practical and reflects the
current technological and legal landscape, instead of an aspirational text
that is useless. This, in my opinion, fits well into Ubuntu's pragmatic
ethos.

*I think a world in which the OSI defines what open source AI is, is a much
better world for Ubuntu than a world where every AI creator can decide for
themselves what that means.* Even though the current definition is not
perfect, I trust the OSI and their community processes as the steward for
this definition.



*What are your opinions on this? Do you agree we should enforce the Open
Source AI definition as Ubuntu?*

-- 
technical-board mailing list
technical-board@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/technical-board

Should we endorse the Open Source AI Definition?

Reply via email to