On Fri, 24 Feb 2023 at 05:23, Charles Plessy <ple...@debian.org> wrote: > > Dear Mo, > > thank you for the heads-up. > > I was using permissive licenses in the past thinking about making life > easier to individuals, but I feel robbed by massive scrapping to train > AI models. > > Just in case I updated my email signature. > > Also, is there a DFSG-free license that forces the training dataset and > the result of the training process to be open source if a work under that > license is present in the training data? Would GPLv3 be sufficient? >
Dear Charles, imagine that you have a collection of data files, images for example and each of them has its own license and copyright owner. The ML/AI trained on that image set will produce a neural network which for the purpose of this example is a N-dimensional grid of floating point numbers coded in 64bit (NN weights). To be even more clear for those who have never trained a NN - almost all the lawyers - I will present here a very simple and explicit example. - 100 images is the dataset protected by GPLv3 as composition, like a database - Joe is the ML trainer, the employee that is going to train the ML - Joe legally acquired the dataset because all the licenses allow it - Joe wrote a script that renames the images in 00.jpg .. 99.jpg and run it, this new dataset is still protected by the GPLv3 - Joe wrote a script that randomly chooses 90 images as learning set (LS) and the others as test set (TS): these are two sub-compositions and both are covered by GPLv3 because both have more than one file/piece of the original composition. In fact, in the way I adopted the GPLv3 on the composition I cannot enforce it over the single file/piece because in that case I will change the license terms decided by the original author of that file/piece and I do not want to do that even if I can do that (ethics). - The ML has the aim to decide if the image contains at least a dog or not: image input, binary output. Thus Joe can add his dog image to the TS and then that image becomes part of the TS composition thus he should share it under a license that can be acceptable with the GPLv3 on the composition. However, Joe is smart and he did not want to share his dog image which is equivalent to saying that we cannot prove that Joe put that image into the TS composition by moving the file in that folder. However from a legal point of view the simple fact that it is used as part of TS clearly states his will to use his dog image as part of the training set. So, in principle Joe is smart but honest and to avoid legal issues for his employer will share his dog image. - So, now the sharing pool brings a little information: the LS, the TS, and the Joe's dog picture. However, one more image is +1% but that image can be very tricky/important for ML in the same way some patches are a single line but make a huge difference. So quantity is not a universal metric of contribution. Moreover, now we know which TS/LS, Joe used to obtain the NN which in some cases could be relevant information. - Joe needs also to tag every LS image in order to back-propagate the feedback to the NN and train it. This can be a file in which filenames are associated with a binary label dog or not. Again, Joe did not put this file into the LS folder but as described above that file is part of the training set in which a GPLv3 composition belongs. IMHO this means that Joe should also share this file. Also this information could be relevant because the most expensive job is labelling the data. - Joe trains the NN with a ML engine which produces the NN weights matrix (BIN). This binary object is a derived work of a GPLv3 composition like a binary executable is a derived work of a GPLv3 source code. Thus Joe should share the BIN as well under GPLv3 terms which enforces him also to explain the inner coding (BIN + format specifications). As you can imagine this is another step towards freedom even if that BIN is supposed to run on a patented hardware because we know the format specification we can write an emulator -much slower and without a commercial value due the performances but it can be used for learning purposes or check a questionable NN output. - Joe tests the NN with the 10+1 images of TS and decides if the NN is fine or not. If he decides that it is fine and it can go into production, then Joe's employer should share all above stated. Instead, if he decides that it is crap, he will trash it and he can not share anything because the sharing will have zero value for anyone. This is compliant with the clause of fair use in which I explicitly added "testing" as a condition to avoid sharing. After all, if there is no value produced why should we force Joe to share his failure? In particular cases a failure (vulnerability) is valuable information but for security reasons it is better that Joe is not forced to comply with the GPLv3 terms. It is better to give Joe the freedom to share only those information that he considers safe to share in public. However, if Joe's company does a business with this - providing a PoC to a client - then they have to comply with GPLv3 because the statements for which commercial and business are covered by GPLv3. - Joe is a student at university and his work has nothing to do with commercial / business purposes. However, if his university decides to use Joe's work for doing commercial or business then they should ask Joe all the information which needs to be shared under GPLv3 terms. This forces Joe to share that information when he delivers his work to his teacher in such a way the university can also store the information that might or might not be shared in the future. Again, no value produced then no need to share. After all, the work of Joe could be a completely useless failure and then rejected. We do not need to know about it. To invalidate the GPLv3 application to the NN binary someone should explain in a legal terms compliant with some law that training a neural network is a completely different thing than compiling a binary from source code. In the same analogy, compiling GPLv3 source code does not imply that you have to share under GPLv3 the proprietary compiler that it has been used for, right? So the same for the ML training engine. Please feel free to contact me in person in order to get deep into some aspects which as AI experts or law experts you might want to challenge or improve. I will be happy to read/hear about you. Just take in consideration that every relevant discovery (good or bad) about this new way of using the GPLv3, will be shared here or everywhere I decide to share it. So, if you are under NDA, I am not and thus do not write/talk to me or otherwise do it at your own risk. :-) I hope this helps, -- Roberto A. Foglietta +49.176.274.75.661 +39.349.33.30.697