Hi all,

On 2025-02-05 15:45, Sam Hartman wrote:
> First, thanks for all your work on AI and free software.
> When I started my own AI explorations, I found your ML policy
> inspirational in how I thought about AI and free software.

Same here -- thanks, Mo!

> I have come to believe that:
> 
> 1) AI models are not very transparent even if you have the training
> data. taking advantage of the training data for a base model is probably
> outside the scope of most of us even if we had it.  That's definitely
> true for retraining a model, but I think is also true for understanding
> where bias is coming from.

Sam, apologies in advance if I'm wrong, but it sounds to me that you are
thinking in the context of a good actor. And I would concur then, just
having transparency might not help understanding much.

My concern, however, are bad actors.

I often thought that simplest way to "prove" that a free model trained
on private data cannot really be free is to train one that purposefully
introduces an undocumented bias such that it creates a
self-contradicting model ("I am not free").

I'd publish code, tech report (in which I lie), the works, everything
but the data. But without the data, how would anyone else be able to
reasonably reject my model's assertion that it is not free, and still
consider the model free?

That was how I always perceived the "toxic" part of Toxic Candy. To me,
it was not about understanding all of the ingredients of regular candy,
but about knowing whether poison was added to it.

Actually, "toxic" doesn't even require a bad actor. The toxin may have
been added inadvertently, eg: a copyright issue with some of the
training data, as others have pointed out.

> With my Debian hat on, I don't really care whether base models are
> considered free or non-free. I don't think it will be important for
> Debian to include base-models in our archive.

I concur. Also, quoting Mo from another thread:

On 2025-02-05 18:20, M. Zhou wrote:
>> I do not see how proposal A harms the ecosystem. It just prevents huge
>> binary blobs from entering Debian's main section of the archive. It does not
>> stop people from uploading the binary blobs to non-free section.

On 2025-02-05 15:45, Sam Hartman wrote:
> What I do care about is what we can do with software that takes base
> models and adapts them for a particular use case.

Sure, but we still need to resolve the question of how to treat these
base models.

> Debian as Second Class
> ======================
> 
> I am concerned that if we are not careful the quality of models we are
> able to offer our users will lag significantly behind the rest of the
> world.
> If we are much more strict than other free-software projects, we will
> limit the models our users can use.

I think we, or any other free-software project, won't be able to offer
our users the models at all. Or at least, our current methods for
offering this (.debs) seem impractical.

So even if we deem these models "free" (=Mo's Proposal B), they'd still
get them from somewhere else.

I really think what we are discussing is more about a principle, but
nevertheless an important one to have as a distribution.

(Note that if we were to deem these models unfit even for non-free, that
would still not obstruct users in any meaningful way. They already get
these models from huggingface, or a dozen other places.)

> Our social contract promises we will value our users and free software.
> If we reduce the selection (and thus quality) of what we offer our
> users, it should somehow serve free software.
> In this instance, I believe that it probably does not serve transparency
> and harms our core goal of making software easy to modify. In other
> words I do not believe free software is being helped enough to justify
> disadvantaging our users.

Again, I don't think we are meaningfully reducing anything, or
meaningfully disadvantaging anyone. These models today are already just
a `wget` away from any user. I think it mostly matters to _us_ how we
want to treat these models.

> Preferred Form of Modification
> ==============================
> [...]
> As a practical matter, for the non-monopolies in the free software
> ecosystem, the preferred form of modification for base models is the
> model themselves.

I would have strongly disagreed with this until a short while ago, and
stated that unless I can run a modified training process -- which would
require the training data -- I don't have the preferred form of
modification.

However, recent advances point to new useful models being built from
other models, for example what DeepSeek accomplished with Llama. They
obviously didn't have the original training data, yet still built
something very useful from the base model.

So I now have a slight doubt. But it is only slight; my gut says that
even many useful derivations cannot "heal" an initial problem of
free-ness. Because if the original base model were to disappear (as you
put it in "Free Today, Gone Tomorrrow"), all derivations in the chain
would lose their reproducibility, too.

Best,
Christian

Reply via email to