TL;DR: I think it is important for Debian to consider AI models free even if those models are based on models that do not release their training data. In the terms of the DFSG, I think that a model itself is often a preferred form of modification for creating derived works. Put another way, I don't think toxic candy is as toxic as I thought it was reading lumin's original ML policy. If we focus too much on availability of data, I think we will help the large players and force individuals and small contributors out of the free software ecosystem. I will be drafting a GR option to support this position.
Dear lumin: First, thanks for all your work on AI and free software. When I started my own AI explorations, I found your ML policy inspirational in how I thought about AI and free software. As I have begun my own explorations, which often involve trying to change/remove bias from models, I have come to think somewhat differently than you did in your original ML policy. I apologize that I did not include a lot of references in this message. I found that I was having trouble coming up with enough time to write it at all. I wanted to give you some notice that I planned to draft what I believe is a competing GR option, and doing that took the time I have. I am not a researcher by trade, and I do not have all the references and links I wish I did handy. I'm just a free software person who has been working on AI as a side project, because I hope it can make parts of the world I care about better. As I understand it, you believe that: 1) Looking at the original training data would be the best approach for trying to remove bias from a model. 2) It would be difficult/impossible to do that kind of work without access to the original training data. I have come to believe that: 1) AI models are not very transparent even if you have the training data. taking advantage of the training data for a base model is probably outside the scope of most of us even if we had it. That's definitely true for retraining a model, but I think is also true for understanding where bias is coming from. That's for base models. I think that fine tuning data sets for things like Open Assistant are within the scope of the masses to examine and use. 2) I think that retraining, particularly with training techniques like ORPO, is a more effective strategy for the democratized (read non-Google, non-Meta) community to change bias than using training data. In other words, I am not convinced that we would use training data even if we had it, to adjust the bias of our models. Which is to say I think the preferred form of modification for models is often the model itself rather than the training data. Goals ===== I think both of us are more concerned about democratizing AI. I think we are more interested in preserving individuals' ability to modify and create software than we are in promoting monopolies or advantaging OpenAI, Meta, Google, and the like. I think we may disagree about how to do that. With my Debian hat on, I don't really care whether base models are considered free or non-free. I don't think it will be important for Debian to include base-models in our archive. What I do care about is what we can do with software that takes base models and adapts them for a particular use case. If LibreOffice gained an AI assistant, our users would be served if we are able to include a high quality AI assistant that preserves their core freedoms. With my Debian hat on, I care more about what we can do with things derived from base models than the base models themselves. Core Freedoms ============= I think that the core freedoms we care about are: 1) Being able to use software. 2) Being able to share software. 3) Being able to modify software. 4) transparency: being able to understand how software works. Debian has always valued transparency, but I think the DFSG and our practices have always valued transparency less than the other freedoms. There's nothing in the DFSG itself that requires transparency. We've had plenty of arguments over the years about things like minimized forms of code and whether they met the conditions of the DFSG. One factor that has been raised is transparency, but it mostly gets swept aside by the question of can we modify the software. The idea appears to be that if we have the preferred form of modification that 's transparent enough. If the upstream doesn't have any advantages in transparency , well, we decide that's free enough. One argument that comes up over the years looking at vendoring is whether replacement is the preferred form of modification. I have some vendored blob that is a minimized representation of an upstream software project. Say minimized javascript or some form of byte code. Most of the time I'm going to modify that by replacing the upstream sources entirely with a new version. So, at least for vendored code, is that good enough? generally we've decided that no it is not. We want individuals to be able to make arbitrary modifications to the code, not just replace it. My claim is that analysis works differently for AI than for minimized Javascript. AI is Big ========= I cannot get my head around how big AI training sets for base models are. I was recently looking at the DeepSeek Math paper [1]: [1]: https://arxiv.org/abs/2402.03300 As I understand it, they took Their Deepseek Code model as a base. So that's trained on some huge dataset--so big that they didn't even want to repeat it. Then they had a 1.2 billion token dataset (say 6G uncompressed text) that they used for a training round--some sort of fine tuning round. Then they applied 2**17 examples (so over a hundred thousand examples) where they knew both the question and a correct answer. But the impressive part for me was how the 1.2 billion token dataset was produced. I found the discussion of that process fascinating, but it involved going over a significant chunk of the Common Crawl dataset, which is mind bogglingly unbelievably huge, to figure out which fraction of that dataset talks about math reasoning. Searching the 1.2 billion token data set is clearly within our capability. But it's not at all clear to me that I could find what in a 6G dataset is going to be producing bias. I think it would be quite possible to hide bias in such a dataset intentionally, in such a way that even given the 1.2 billion tokens we would find it difficult to remove the bias by modifying the dataset. I think there will certainly be unintentional bias there I colud not find. So, to really have the training data, we need Common Crawl, and we need the scripts and random seeds necessary to reproduce the 1.2 billion token dataset. I also believe there was at least one language model in that process, so you would also need the training data for that model. I am quite sure that finding bias in something that large, or even examining it is outside the scope of all but well funded players. I am absolutely sure that reducing Common Crawl to the 1.2 billion tokens --that is actually running the data analysis, including all the runs of any language models involved--is outside the scope of all but well funded players. In other words, taking that original training data and using it is a preferred form of modification is outside of the scope of everyone we want to center in our work. And then we're left repeating the process for the base model DeepSeek Code. My position is that by taking this approach we've sacrificed modifyability for transparency, and I am not even sure we have gained transparency at a price that is available to the members of the community we want to center in our work. In this focus on data, we have taken the wrong value trade off for Debian. Debian has always put the ability to modify software first. Free Today, Gone Tomorrow ========================= One significant concern I know Lumen is aware of with requiring data is what happens when the data is available today but not tomorrow. One of the models that tried to be as open as possible ran into problems because they were forced to take down part of their dataset after the model was released. (I believe a copyright issue.) The AI copyright landscape is very fluid. Right now we do not know what is fair use. We do not even have a firm ethical ground for what sharing of data for AI training should look like in terms of social policy. We run a significant risk that significant chunks of free software will depend on what we believe is a free model today, only to have it get reclassified as non-free tomorrow when some of the training data is no longer available to the public. We run significant risks when different jurisdictions have different laws. It is very likely that there will be cases where models will still be distributable even when some of the training datasets underlying the model can no longer be distributed. So, you say, let's have several models and switch from one to another if we run into problems with one model. Hold that in the back of your mind. We'll come back. Debian as Second Class ====================== I am concerned that if we are not careful the quality of models we are able to offer our users will lag significantly behind the rest of the world. If we are much more strict than other free-software projects, we will limit the models our users can use. Significant sources of training data will be available to others but not our users. I suspect that models that only need to release data information rather than training data will be higher quality because they can have access to things like published books, works that can be freely used, but not freely distributed and the like. Our social contract promises we will value our users and free software. If we reduce the selection (and thus quality) of what we offer our users, it should somehow serve free software. In this instance, I believe that it probably does not serve transparency and harms our core goal of making software easy to modify. In other words I do not believe free software is being helped enough to justify disadvantaging our users. Preferred Form of Modification ============================== I talked earlier about how if one model ended up being non-free, we could switch to another one. That happens all the time in the AI ecosystem. A software system has a fine tuning dataset. They might fine tune a version of Llama3, Mistral, one of the newer models, all against their dataset. They will pick the one that performs best. As new models come out, the base model for some software might switch. As a practical matter, for the non-monopolies in the free software ecosystem, the preferred form of modification for base models is the model themselves. We switch out models and then adjust the code on top of that, using various fine tuning and prompt engineering tasks to adapt a model. The entire ecosystem has evolved to support this. There are competitions between models with similar (or the same) inputs. There are sites that allow you to interact with more than one model at once so you can choose which works best for you and switch out. (Or get around biases or restrictions, perhaps using Chat GPT to write part of a story, and a more open model to write adult scenes that Chat GPT would refuse to write.) On the other hand, I did talk about fine tuning and task-specific or program-specific datasets. many of those are at scopes we could modify, and fine tuning (or producing adapters) models based on those datasets is part of the preferred form of modification for the programs involved. What I want for Debian ====================== here's what I want to be able to do for Debian: * First, the bits of the model--its code and parameters--need to be under a DFSG-free license. So Llama3 is never going to meet Debian main's needs under its current license. * We look at what the software authors actually do to modify models they incorporate to determine the preferred form of modification. If in practice they switch out base models and fine tune, that's okay. In this situation we probably would need full access to the fine tuning data, but the training data for the base model. *Where it is plausible that the preferred form of modification works this way, we effectively cut off the source code and do not look further. If you are integrating model x into your software, your software is free if model x is under a free license and any fine tuning data/scripts you use are free. I.E. if our users could actually go from upstream model x to what the software uses, that's DFSG-free enough even if the user could not reproduce model x itself. I firmly believe that the ability to retrain models to change their bias without access to the original training data will only continue to improve. i think that especially with techniques like ORPO, my explorations suggest that for smaller models we may have already reached a point that is good enough for free software. So What about the OSI Definition ================================ I don't know. I think it depends on how the OSI definition treats derivative works. If what we're saying is base models need to release training data, I think that would harm the free software community. It would mean free models were always of lower quality than proprietary models, at least unless the fair use cases go in a direction where all the models are of low quality. I think data information is best for base models. If instead, what we're saying is that OSI's definition is more focused on software incorporating models, and it is okay to use a model without fully specified data is an input so long as you give all the data for what you do to that model in your program, I could agree. If we are saying that to be open source software, any model you use needs to provide full training data up to the original training run with random parameters, I think that would harm our community.
signature.asc
Description: PGP signature