On Sat, 10 May 2025 at 01:22, Russ Allbery <r...@debian.org> wrote: > > Matthias Urlichs <matth...@urlichs.de> writes: > > > I'm not disputing any of that. *Of course* we should write our rules and > > laws to benefit humans / humanity, not robots or AIs or corporate > > profiteering or what-have-you. > > > All I'm saying is that the idea "a human can examine a lot of > > copyrighted stuff and then produce non-copyrighted output but a computer > > cannot" might still hold some water today, but the bucket is leaky and > > getting leakier every couple of months, if not weeks.
The thing is - I do not believe that "bucket" has ever existed. If you look at it starting from fundamentals, I believe that it becomes very obvious. Say for example the output of: $ cat /usr/share/common-licenses/GPL-2 | wc --bytes 18092 Is that number a derived work of the GPL licence? It is not. In fact it is not creative or expressive enough to even have copyright. aigarius@home:~$ cat /usr/share/common-licenses/GPL-2 | wc --words 2968 Same here. $ sha256sum /usr/share/common-licenses/GPL-2 8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643 /usr/share/common-licenses/GPL-2 Again - not really copyrightable and not a derivative work. And how about: $ cat /usr/share/common-licenses/GPL-2 | tr ' ' '\12' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | head 503 194 the 106 to 104 of 72 you 64 and 63 or 55 a 52 is 50 program The words themselves are not copyrightable - we already have a bunch of wordlist packages in Debian main. And their frequencies in the document are no different from the wc example above, just after filtering. AI training and LLM training uses the same kind of statistical transformations, just a couple steps more advanced than this, like tracking word pairs or tracking the statistical chance of one word following another if a certain third word appears in the attention context window. But always an intermediate step is a bunch of non-copyrightable statistical data. So yes - computers *can* examine a copyrighted work and produce a non-copyrighted result. The only thing that is changing is that this non-copyrighted result is becoming more and more complex and also more and more useful. As soon as you have a single intermediate step where the copyright of the source does not survive to the output of that step (and assuming you only use the output of that step and no data from previous steps further in your pipeline), then *regardless* of further processing, the copyright of the input date before that step no longer matters. The result may acquire a new copyright (yours) as you do something creative enough with it, or amass sufficient amount of it to qualify for a database copyright. But the copyright of the training data simply does not survive the step of completely destructive statistical analysis. -- Best regards, Aigars Mahinovs