Re: Should we endorse the Open Source AI Definition?

Merlijn Sebrechts Thu, 13 Mar 2025 18:12:22 -0700

Welcome to the new Ubuntu Technical Board!


I'm forwarding this old thread. What are your opinions about endorsing the
open source AI definition?

It's incredibly frustrating to see models like Deepseek being referred to
as "open source AI" when even the license of the model itself heavily
restricts usage. We need more pressure to on companies to either stop using
the term open source or to actually release these models as Open Source AI.

I think it would benefit us greatly to have more clarity about what the
bare minimum requirements are to be able to call an AI model open source.
We are a large and prominent community so our endorsement will carry weight.




Kind regards
Merlijn


On Fri 31 Jan 2025, 13:28 Merlijn Sebrechts, <merlijn.sebrec...@gmail.com>
wrote:

> Hi Community Council and Technical Board
>
>
>
> This is a discussion for both, given that this decision has both technical
> and community/governance implications.
>
> As you might have seen, the Open Source Initiative has been developing a
> definition for "Open Source AI" for the past few years. They have recently
> released version 1.0 and are asking for organisations and individuals to
> endorse it. Already, the Eclipse Foundation, SUSE, Mozilla and many other
> orgs endorsed it.
>
> https://opensource.org/ai/endorsements
>
> *I think Ubuntu should endorse the Open Source AI definition.*
>
>
>
> *Why is this important for Ubuntu?*
> There is a big issue of "openwashing" in the AI space. Many large
> organizations like Meta are blatantly calling their AI open source, even
> though the models are released under a license severely restricting use.
> The media in general is not catching on to this ruse. For an example, take
> a look at my analysis of the Llama license:
> https://merlijn.sebrechts.be/blog/2024-08-06-problematic-meta-llama-3-1-license/
>
>    - *Ubuntu heavily benefits from the fact that "Open Source" is a
>    strong, well-defined term.* Our users trust that they can use Ubuntu
>    for any purpose, without needing to consult lawyers first.
>    - *AI Openwashing dilutes the meaning of "open source"*. As a result,
>    our users are becoming confused and either start to lose trust in the term
>    open source, or are drawn to competitors who use the term, but don't
>    actually walk the talk.
>    - *AI Openwashing creates a complicated compliance issue for Ubuntu.*
>    Even though these tools do not adhere to our license requirements, users
>    might start to expect us to ship and/or integrate with them. At least when,
>    for example, MongoDB switched, everyone was in agreement that this was not
>    open source.
>
>
> *What is our goal for endorsing it?*
>
> *Solving the issue of Openwashing requires a strong definition of what
> Open Source AI means.* A strong definition that is supported both by law
> and by the public perception. The Open Source Initiative is perfectly
> positioned to do this. They have been stewards of the Open Source Software
> definition for a long time, and their authority has been confirmed in
> court. Example:
> https://opensource.org/blog/court-affirms-its-false-advertising-to-claim-software-is-open-source-when-its-not
>
> Our goal for endorsing this definition is to give more weight to the OSI
> definition of Open Source, so that it has a higher chance of getting
> adopted both by the public perception and by the legal system.
>
> *What are we endorsing specifically?*
>
> Our endorsement means two things.
>
>    - Firstly, we are endorsing that the current definition is a good step
>    in the right direction.
>    - Secondly, we are endorsing that the OSI has the authority to create
>    this definition and that we have faith in their process to create and
>    improve this definition.
>
> As for the actual current version, you can find version 1.0 of the
> definition here: https://opensource.org/ai/open-source-ai-definition
>
> The gist of the definition is this:
>
> An Open Source AI is an AI system made available under terms and in a way
>> that grant the freedoms to:
>>
>>    -  Use the system for any purpose and without having to ask for
>>    permission.
>>    -  Study how the system works and inspect its components.
>>    -  Modify the system for any purpose, including to change its output.
>>    -  Share the system for others to use with or without modifications,
>>    for any purpose.
>>
>> These freedoms apply both to a fully functional system and to discrete
>> elements of a system. A precondition to exercising these freedoms is to
>> have access to the preferred form to make modifications to the system.
>
>
> The definition further explains what the "preferred form to make
> modifications to the system" is.
>
>    - *Detailed information about the training data.* This does not
>    include the data itself, but it needs to be detailed enough so that a third
>    party can create a similar data set.
>    - *The code to train and run the system.*
>    -
> *The model parameters/weights/configuration values. *
>
> Most current "open source" AI models only open source the code to run the
> system. All other parts are either hidden or are only released under
> "source-available" licenses such as Meta's Llama license.
>
>
> *Isn't there some controversy surrounding the definition?*
>
> There is some debate in the community about whether just the data
> information needs to be open source or all the data itself. The current
> OSAID is a pragmatic definition that says only detailed information about
> the data needs to be open source. Not the data itself. It makes this
> compromise (not enforcing open data) because open data for AI training is a
> difficult legal and practical subject. Most countries actually allow
> training of open source AI on copyright restricted datasets. As a result,
> retraining an AI does not require open source datasets. Moreover, due to
> strange data IP laws all over the world, it is very difficult to even
> determine whether data is open source, and some countries like Japan
> explicitly break open source data. Finally, with the most
> privacy-respecting form of AI training (federated learning), the model
> creator does not even have access to the training data in order to respect
> the privacy of the data subjects.
>
> For more information on why the current proposal doesn't require open
> source training data, see
>
>    - Explaining the concept of Data Information
>    https://opensource.org/blog/explaining-the-concept-of-data-information
>    - Open Data and Open Source AI: Charting a course to get more of both
>    
> https://opensource.org/blog/open-data-and-open-source-ai-charting-a-course-to-get-more-of-both
>    - Why datasets built on public domain might not be enough for AI
>    
> https://opensource.org/blog/why-datasets-built-on-public-domain-might-not-be-enough-for-ai
>
> All in all, I think the current proposal is a pragmatic compromise in
> order to ensure the Open Source AI definition is practical and reflects the
> current technological and legal landscape, instead of an aspirational text
> that is useless. This, in my opinion, fits well into Ubuntu's pragmatic
> ethos.
>
> *I think a world in which the OSI defines what open source AI is, is a
> much better world for Ubuntu than a world where every AI creator can decide
> for themselves what that means.* Even though the current definition is
> not perfect, I trust the OSI and their community processes as the steward
> for this definition.
>
>
>
> *What are your opinions on this? Do you agree we should enforce the Open
> Source AI definition as Ubuntu?*
>
>
>
>
>

-- 
technical-board mailing list
technical-board@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/technical-board

Re: Should we endorse the Open Source AI Definition?

Reply via email to