On Wed, 14 May 2025 at 08:58, Simon Josefsson <si...@josefsson.org> wrote: > To me I think we have at least two camps: > > 1) We must have DFSG-compliant licensing of source code for everything > in main, and that source code should encompass everything needed for a > skilled person to re-create identical (although possibly not bit-by-bit > identical) artifacts. >
2) We must have DFSG-compliant licensing of source code for everything in main, but training data is not part of source code. Instead source code for training models would be code and protocol describing how to generate or gather training data in such a way that a skilled person would be able to re-create functionally the same (although not identical) artifacts. If re-creation is impractical (due to compute costs) then the model must also be modifiable after training by a skilled person with tooling in the archive. This matches the meaning of https://opensource.org/ai definition, just mapped onto DFSG criteria (using the "Data Information" definition from OSI). Or, reformulated extremely concisely as clarifications to DFSG scope: 1) AI training data is source code. 2a) AI training data is not source code. 2b) AI training data is not source code, but "Data Information" is source code. 2b+) AI models must either be easily retrainable from training data *or* have to be easily modifiable and adaptable after training to satisfy DFSG.3 The 2a is likely an obsolete, maximally permissive option in this discussion context. And the combination of 2b and 2b+ is for me the preferable position. > Neither position has much to do with AI models as far as I can tell. It is a bit more clearly to do with AI after reformulation. > Is there any complication beyond size and infrastructure to recreate models > that are a factor here? Or is this "just" a re-hash of the perpetual > main vs non-free discussion? Whether an OSI-free LLM would be acceptable to distribute in non-free as it is right now is an interesting side question. non-free does not have a strict full source code requirement. But does have a binary redistributability requirement. IMHO this right now could be opposed with either moral argumentation against non-consensually trained AIs (like Russ described) or by arguing the legal position that the model itself is a derived work from all of its training data and so we can not trust the copyright or license terms that the creators of the LLM claim and thus we can not have the rights to redistribute the model. While I find the legal component of these arguments to be shaky, the moral argument is a matter of opinion. I do not agree with that opinion, but I can see how that is a perfectly valid and consistent opinion to hold. -- Best regards, Aigars Mahinovs