Clint Adams <cl...@debian.org> writes: > I'm not sure that these are quite the right terms. This email itself > is non-free software, but if Sam wants to train some kind of deep > learning model on it and release the model, without training data, > under the Expat license, I definitely would not refer to the model > as non-free. Would I prefer that copyright law be abolished and > there be no impediments to providing the training data as well? > Of course I would. But, absent that, there would be no way for Sam > to distribute the training data as free software.
I'm not sure that I agree that it would be great if copyright law were abolished. I think it's deeply flawed and I can certainly imagine different legal structures for achieving some of the same goals that I think would be superior, but right now, for all of its many problems, copyright law is one of the few tools that we have for consent. One of my problems with the stance that Aigars has summarized (not his fault -- it's a common view) is that consent should not be required to train models. I think your point is that someone training a Bayesian filter on my email messages should not require my consent. My views on that are more complicated. I think there are circumstances when it shouldn't require that consent and circumstances when it should, and it's a tricky moral question that, for me, is heavily influenced by how the model is used. But let me slide down the slippery slope a bit farther and present a case that I think is a natural extension of that position. Suppose that instead of training a Bayesian spam filter on a bunch of mail messages without explicit consent, someone instead gathered every email message that I had ever sent to a public mailing list and used them to train an LLM to impersonate me. I don't think someone should be allowed to do that without my consent. Right now, the tool I have for expressing that consent is based on copyright law, for better or worse. Now, there is a pretty good argument here that copyright law is the wrong tool to prevent that and we should have other laws that tackle that directly, such as the laws now being passed to prohibit "nudification" image transformation models that do not rely on copyright law. And I would agree! But those laws largely don't exist right now and copyright law does and until someone fixes the problem in some other way, I don't want to give up the protection that I may still have, even if it's murky and contingent. This is about larger questions of morality and law, but what I would say about Debian's rules specifically is that we should have some obligation to behave ethically. That's going to mean different things to different people, and we quite rightly don't incorporate in our foundation documents ethical principles beyond the scope of free software. But I still have my personal ethics and those will guide my vote on questions of what ethics Debian should adopt around free software. I think using other people's work without their consent is sometimes unethical. It depends a *lot* on the circumstances to me, but I think machine learning models, and LLMs and image manipulation models in particular, have opened new frontiers for unethical things that can be done using other people's work. This is not equivalent to the existing human capability to do the same thing manually precisely because the whole point of writing computer programs to do something is that you can do that thing cheaply and at scale. Some other human being can, today, study my writing style and try to impersonate me, and I can't stop that with copyright law. I understand that. But also this is hard and manual and it's very difficult for someone to keep that up at length. An LLM trained on my writing can potentially impersonate me trivially and extensively, for essentially free. Debian's free software principles cannot solve all, or even most, problems in this world. But I think they are both directly relevant and rather good at addressing at least Debian's involvement in this sort of activity. Applying free software rules to training data is a bit of a heavy hammer and maybe it's too much, but it does hold an ethical line about consent that I think we should hold. Maybe there's a different way to hold that line, and I'm open to being convinced by a different approach, but I don't want to give up this ethical line completely. -- Russ Allbery (r...@debian.org) <https://www.eyrie.org/~eagle/>