On Mon, 2025-02-10 at 19:12 +0100, Christian Kastner wrote: > > Preferred Form of Modification > > ============================== > > [...] > > As a practical matter, for the non-monopolies in the free software > > ecosystem, the preferred form of modification for base models is the > > model themselves. > > I would have strongly disagreed with this until a short while ago, and > stated that unless I can run a modified training process -- which would > require the training data -- I don't have the preferred form of > modification. > > However, recent advances point to new useful models being built from > other models, for example what DeepSeek accomplished with Llama. They > obviously didn't have the original training data, yet still built > something very useful from the base model. > > So I now have a slight doubt. But it is only slight; my gut says that > even many useful derivations cannot "heal" an initial problem of > free-ness. Because if the original base model were to disappear (as you > put it in "Free Today, Gone Tomorrrow"), all derivations in the chain > would lose their reproducibility, too.
And independence too, which connects to a healthy ecosystem for long run. Think about the case where basemodel-v1 is released under MIT, and there are some derivative works around this v1 model. Then someday, the license of basemodel-v2 has been changed into a proprietary one. The open source ecosystem around v2 will simply decay. For a traditional open source or free software, if people are unsatisfied about how software-v1 is written, or the upstream of software-v1 decides to discontinue the effort, people can still fork the v1 work, and potentially create a v2 independently. Data access matters even more for academia. Without the original training data, there will never be a fair comparison, let alone rigorous research for making real improvements. For example, ResNet (for image classification) is trained on ImageNet (a large scale image dataset, academic-use-only). The original author has already stopped making improvements on this "base model". However, people can still try to train new "base models" such as ViT (vision transformer) on ImageNet to make real improvements. The original training dataset being accessible, although academic-use-only, is one key factor that maintains the line of research healthy. If anybody is unsatisfied with ResNet / ViT / etc, they can reproduce the original base model, and try to make improvements. No model is endgame so far. Pre-trained models will be replaced very quickly. An open-source ecosystem built upon a frozen toxic candy base model is not iteratable. As long as the frozen base model is outdated, the whole ecosystem is outdated, because the system is not independent, and cannot iterate itself. Similarly, if we treat "sbuild" as a "frozen base model". Then the community can create sbuild-schroot, sbuild-unshare, etc around it. When sbuild discontinues, the derivative works will be impacted. However, if the fundamentals (dpkg-dev) remains public, people can still independently design other "frozen base models", like debspawn (systemd-nspawn) or even docker-based ones. In that sense, the debian package builder ecosystem is still healthy. My interpretation on the "toxic candy" is not only focusing on the current, but also the future, especially the key factors that contributes to a healthy, positive loop where the ecosystem can constructively grow. If the software freedom is defined on top of a "toxic candy" base model and is dependent on it -- once the base model quit the game and discontinues, the "software freedom" will have to quit the game and discontinue, because nobody other than the original author has the freedom to improve the original base model itself. "Toxic candy" models are not reproducible, and are not something people can independently improve. By definition, I don't believe this satisfies the definition of software freedom. If disagreed on this point, then the question turns to whether "being able to do secondary development" covers all freedoms in definition. Independence matters for some aspects, like trustworthiness. For example, what if a "toxic candy" language model respond with spam advertisements instead of really answering user's question. Nobody is able to fix this "base model" other than the original author. Should I trust the "toxic candy" model and regard it as "free software", while being unable to study or modify the "base model" itself?