TL;DR: I think it is important for Debian to consider AI models free
even if those models are based on models that do not release their
training data. In the terms of the DFSG, I think that a model itself is
often a preferred form of modification for creating derived works. Put
another way, I don't think toxic candy is as toxic as  I thought it was
reading  lumin's original ML policy.
If we focus too much on availability of data, I think we will help the
large players and force individuals and small contributors out of the
free software ecosystem.
I will be drafting a GR option to support this position.


Dear lumin:

First, thanks for all your work on AI and free software.
When I started my own AI explorations, I found your ML policy
inspirational in how I thought about AI and free software.
As I have begun my own explorations, which often involve trying to
change/remove bias from models, I have come to think somewhat
differently than you did in your original ML policy.

I apologize that I did not include a lot of references in this message.
I found that I was having trouble coming up with enough time to write it
at all.  I wanted to give you some notice that I planned to draft what I
believe is a competing GR option, and doing that took the time I have.
I am not a researcher by trade, and I do not have all the references and
links I wish I did handy.
I'm just a free software person who has been working on AI as a side
project, because I hope it can make parts of the world I care about
better.

As I understand it, you believe that:

1) Looking at the original training data would be the best approach for
trying to remove bias from a model.

2) It would be difficult/impossible to do that kind of work without
access to the original training data.

I have come to believe that:

1) AI models are not very transparent even if you have the training
data. taking advantage of the training data for a base model is probably
outside the scope of most of us even if we had it.  That's definitely
true for retraining a model, but I think is also true for understanding
where bias is coming from.  That's for base models.  I think that fine
tuning data sets for things like Open Assistant are within the scope of
the masses to examine and use.

2) I think that retraining, particularly with training techniques like
ORPO, is a more effective strategy for the democratized (read
non-Google, non-Meta) community to change bias than using training data.
In other words, I am not convinced that we would use training data even
if we had it, to adjust the bias of our models.
Which is to say I think the preferred form of modification for models is
often the model itself rather than the training data.

Goals
=====


I think both of us are more concerned about democratizing AI. I think we
are more interested in preserving individuals' ability to modify and
create software than we are in promoting monopolies or advantaging
OpenAI, Meta, Google, and the like.
I think we may disagree about how to do that.

With my Debian hat on, I don't really care whether base models are
considered free or non-free. I don't think it will be important for
Debian to include base-models in our archive.
What I do care about is what we can do with software that takes base
models and adapts them for a particular use case.
If LibreOffice gained an AI assistant, our users would be served if we
are able to include a high quality AI assistant that preserves their
core freedoms.
With my Debian hat on, I care more about what we can do with things
derived from base models than the base models themselves.

Core Freedoms
=============

I think that the core freedoms we care about are:

1) Being able to use software.
2) Being able to share software.
3) Being able to modify software.
4) transparency: being able to understand how software works.

Debian has always valued transparency, but I think the DFSG and our
practices have always valued transparency less than the other freedoms.
There's nothing in the DFSG itself that requires transparency.
We've had plenty of arguments over the years about things like minimized
forms of code and whether they met the conditions of the DFSG.
One factor that has been raised is transparency, but it mostly gets
swept aside by the question of  can we modify the software.
The idea appears to be that if we have the preferred form of
modification  that  's transparent enough.
If the upstream doesn't have any advantages in transparency , well, we
decide that's free enough.

One argument that comes up over the years looking at vendoring is
whether replacement is the preferred form of modification.
I have some vendored blob that is a minimized representation of an
upstream software project.  Say minimized javascript or some form of
byte code.
Most of the time I'm going to modify that by replacing the upstream
sources entirely with a new version.
So, at least for vendored code, is that good enough?
generally we've decided that no it is not.
We want individuals to be able to make arbitrary  modifications to the
code, not just replace it.

My claim is that analysis works differently for AI than for minimized
Javascript.

AI is Big
=========

I cannot get my head around how big AI training sets for base models
are.

I was recently looking at the DeepSeek Math paper [1]:

  [1]: https://arxiv.org/abs/2402.03300

As I understand it, they took Their Deepseek Code model as a base.
So that's trained on some huge dataset--so big that they didn't even
want to repeat it.

Then they had a 1.2 billion token dataset (say 6G uncompressed text)
that they used for a training round--some sort of fine tuning round.

Then they applied 2**17 examples (so over a hundred thousand examples)
where they knew both the question and a correct answer.
But the impressive part for me was how the 1.2 billion token dataset was
produced.  I found the discussion of that process fascinating, but it
involved going over a significant chunk of the Common Crawl dataset,
which is mind bogglingly unbelievably huge, to figure out which fraction
of that dataset talks about math reasoning.

Searching the 1.2 billion token data set is clearly within our
capability.
But it's not at all clear to me that I could find what in a 6G dataset
is going to be producing bias.
I think it would be quite possible to hide bias in such a dataset
intentionally, in such a way that even given the 1.2 billion tokens we
would find it difficult to remove the bias by modifying the dataset. I
think there will certainly be unintentional bias there I colud not find.

So, to really have the training data, we need Common Crawl, and we need
the scripts and random seeds necessary to reproduce the 1.2 billion
token dataset.
I also believe there was at least one language model in that process, so
you would also need the training data for that model.

I am quite sure that finding bias in something that large, or even
examining it is outside the scope of all but well funded players.

I am absolutely sure that reducing Common Crawl to the 1.2 billion
tokens --that is actually running the data analysis, including all the
runs of any language models involved--is outside the scope of all but
well funded players. In other words, taking that original training data
and using it is a preferred form of modification is outside of the scope
of everyone we want to center in our work.

And then we're left repeating the process for the base model DeepSeek
Code.

My position is that by taking this approach we've sacrificed
modifyability for transparency, and I am not even sure we have gained
transparency at a price that is available to the members of the
community we want to center in our work.
In this focus on data, we have taken the wrong value trade off for
Debian.
Debian has always put the ability to modify software first.

Free Today, Gone Tomorrow
=========================

One significant concern I know Lumen is aware of with requiring data is
what happens when the data is available today but not tomorrow.
One of the models that tried to be as open as possible ran into problems
because they were forced to take down part of their dataset after the
model was released.
(I believe a copyright issue.)

The AI copyright landscape is very fluid.
Right now we do not know what is fair use.
We do not even have a firm ethical ground for what sharing of data for
AI training should look like in terms of social policy.

We run a significant risk that significant chunks of free software will
depend on what we believe is a free model today, only to have it get
reclassified as non-free tomorrow when some of the training data is no
longer available to the public.

We run significant risks when different jurisdictions have different
laws.

It is very likely that there will be cases where models will still be
distributable even when some of the training datasets underlying the
model can no longer be distributed.

So, you say, let's have several models and switch from one to another if
we run into problems with one model.
Hold that in the back of your mind.  We'll come back.

Debian as Second Class
======================

I am concerned that if we are not careful the quality of models we are
able to offer our users will lag significantly behind the rest of the
world.
If we are much more strict than other free-software projects, we will
limit the models our users can use.
Significant sources of training data will be available to others but not
our users.
I suspect that models that  only need to release data information rather
than training data will be  higher quality because they can have access
to things like published books, works that can be freely used, but not
freely distributed and the like.

Our social contract promises we will value our users and free software.
If we reduce the selection (and thus quality) of what we offer our
users, it should somehow serve free software.
In this instance, I believe that it probably does not serve transparency
and harms our core goal of making software easy to modify. In other
words I do not believe free software is being helped enough to justify
disadvantaging our users.

Preferred Form of Modification
==============================

I talked earlier about how if one model ended up being non-free, we
could switch to another one.
That happens all the time in the AI ecosystem.
A software system has a fine tuning dataset.
They might fine tune a version of Llama3, Mistral, one of the newer
models, all against their dataset.
They will pick the one that performs best.
As new models come out, the base model for some software might switch.

As a practical matter, for the non-monopolies in the free software
ecosystem, the preferred form of modification for base models is the
model themselves.
We switch out models and then adjust the code on top of that, using
various fine tuning and prompt engineering tasks to adapt a model.

The entire ecosystem has evolved to support this.  There are
competitions between models with similar (or the same) inputs.
There are sites that allow you to interact with more than one model at
once so you can choose which works best for you and switch out.  (Or get
around biases or restrictions, perhaps using Chat GPT to write part of a
story, and a more open model to write adult scenes that Chat GPT would
refuse to write.)

On the other hand, I did talk about fine tuning and task-specific or
program-specific datasets.
many of those are at scopes we could modify, and fine tuning (or
producing adapters) models based on those datasets is part of the
preferred form of modification for the programs involved.

What I want for Debian
======================

here's what I want to be able to do for Debian:

* First, the bits of the model--its code and parameters--need to be
  under a DFSG-free license.  So Llama3 is never going to meet Debian
  main's needs under its current license.

* We look at what the software authors actually do to modify models they
  incorporate to determine the preferred form of modification. If in
  practice they switch out base models and fine tune, that's okay.  In
  this situation we probably would need full access to the fine tuning
  data, but the training data for the base model.

*Where it is plausible that the preferred form of modification works
 this way, we effectively cut off the source code and do not look further.  If 
you
 are integrating model x into your software, your software is free if
 model x is under a free license and any fine tuning data/scripts you
 use are free.  I.E. if our users could actually go from upstream model
 x  to  what the software uses, that's DFSG-free enough even if the user
 could not reproduce model x itself.

I firmly believe that the ability to retrain models to change their bias
without access to the original training data will only continue to
improve.
i think that especially with techniques like ORPO, my explorations
suggest that for smaller models we may have already reached a point that
is good enough for free software.

So What about the OSI Definition
================================

I don't know.
I think it depends on how the OSI definition treats derivative works.
If what we're saying is base models need to release training data, I
think that would harm the free software community.
It would mean free models were always of lower quality than proprietary
models, at least unless the fair use cases go in a direction where all
the models are of low quality.
I  think data information is best for base models.

If instead, what we're saying is that OSI's definition is more focused
on software incorporating models, and it is okay to use a model without
fully specified data is an input so long as you give all the data for
what you do to that model in your  program, I could agree.

If we are saying that to be open source software, any model you use
needs to provide full training data up to the original training run with
random parameters, I think that would harm our community.

Attachment: signature.asc
Description: PGP signature

Reply via email to