On Mon, 2025-02-10 at 19:12 +0100, Christian Kastner wrote:
> > Preferred Form of Modification
> > ==============================
> > [...]
> > As a practical matter, for the non-monopolies in the free software
> > ecosystem, the preferred form of modification for base models is the
> > model themselves.
> 
> I would have strongly disagreed with this until a short while ago, and
> stated that unless I can run a modified training process -- which would
> require the training data -- I don't have the preferred form of
> modification.
> 
> However, recent advances point to new useful models being built from
> other models, for example what DeepSeek accomplished with Llama. They
> obviously didn't have the original training data, yet still built
> something very useful from the base model.
> 
> So I now have a slight doubt. But it is only slight; my gut says that
> even many useful derivations cannot "heal" an initial problem of
> free-ness. Because if the original base model were to disappear (as you
> put it in "Free Today, Gone Tomorrrow"), all derivations in the chain
> would lose their reproducibility, too.

And independence too, which connects to a healthy ecosystem for long run.

Think about the case where basemodel-v1 is released under MIT, and there
are some derivative works around this v1 model. Then someday, the license
of basemodel-v2 has been changed into a proprietary one. The open source
ecosystem around v2 will simply decay.

For a traditional open source or free software, if people are unsatisfied
about how software-v1 is written, or the upstream of software-v1 decides to
discontinue the effort, people can still fork the v1 work, and potentially
create a v2 independently.

Data access matters even more for academia. Without the original training
data, there will never be a fair comparison, let alone rigorous research
for making real improvements. For example, ResNet (for image classification)
is trained on ImageNet (a large scale image dataset, academic-use-only).
The original author has already stopped making improvements on this "base
model". However, people can still try to train new "base models"
such as ViT (vision transformer) on ImageNet to make real improvements.
The original training dataset being accessible, although academic-use-only,
is one key factor that maintains the line of research healthy. If anybody
is unsatisfied with ResNet / ViT / etc, they can reproduce the original
base model, and try to make improvements.

No model is endgame so far. Pre-trained models will be replaced very quickly.
An open-source ecosystem built upon a frozen toxic candy base model is not
iteratable. As long as the frozen base model is outdated, the whole ecosystem
is outdated, because the system is not independent, and cannot iterate itself.

Similarly, if we treat "sbuild" as a "frozen base model". Then the community
can create sbuild-schroot, sbuild-unshare, etc around it. When sbuild 
discontinues,
the derivative works will be impacted. However, if the fundamentals (dpkg-dev)
remains public, people can still independently design other "frozen base 
models",
like debspawn (systemd-nspawn) or even docker-based ones. In that sense, the
debian package builder ecosystem is still healthy.

My interpretation on the "toxic candy" is not only focusing on the current,
but also the future, especially the key factors that contributes to a healthy,
positive loop where the ecosystem can constructively grow.

If the software freedom is defined on top of a "toxic candy" base model and
is dependent on it -- once the base model quit the game and discontinues,
the "software freedom" will have to quit the game and discontinue, because
nobody other than the original author has the freedom to improve the original
base model itself.

"Toxic candy" models are not reproducible, and are not something people can
independently improve. By definition, I don't believe this satisfies the 
definition
of software freedom. If disagreed on this point, then the question turns to
whether "being able to do secondary development" covers all freedoms in 
definition.

Independence matters for some aspects, like trustworthiness. For example,
what if a "toxic candy" language model respond with spam advertisements instead
of really answering user's question. Nobody is able to fix this "base model" 
other
than the original author. Should I trust the "toxic candy" model and regard it
as "free software", while being unable to study or modify the "base model" 
itself?

Reply via email to