On Fri, Nov 24, 2023 at 10:21:17AM +0000, Alex Bennée wrote: > Daniel P. Berrangé <berra...@redhat.com> writes: > > > On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote: > >> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote: > >> > The license of a code generation tool itself is usually considered > >> > to be not a factor in the license of its output. > >> > >> Really? I would find it very surprising if a code generation tool that > >> is not a language model and so is not understanding the code it's > >> generating did not include some code snippets going into the output. > >> It is also possible to unintentionally run afoul of GPL's definition of > >> source > >> code which is "the preferred form of the work for making modifications to > >> it". > >> So even if you have copyright to input, dumping just output and putting > >> GPL on it might or might not be ok. > > > > Consider the C pre-processor. This takes an input .c file, and expands > > all the macros, to split out a new .c file. > > > > The license of the output .c file is determined by the license of the > > input .c file. The license of the CPP impl (whether OSS or proprietary) > > doesn't have any influence on the license of the output file, it cannot > > magically force the output file to be proprietary any more than it can > > force it to be output file GPL. > > LLM's are just a tool like a compiler (albeit with spookier different > internals). The prompt and the instructions are arguably the more > important part of how to get good results from the LLM transformation. > In fact most of the way I've been using them has been by pasting some > existing code and asking for review or transformation of it. > > However I totally get that using the various online LLMs you have very > little transparency about what has gone into their training and therefor > there is a danger of proprietary code being hallucinated out of their > matricies. Conversely what if I use an LLM like OpenLLaMa: > > https://github.com/openlm-research/open_llama > > I have fairly exhaustive definitions of what went into the training data > which of most interest is probably the StarCoder dataset (paper): > > https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view > > where there are tools to detect if generated code has been lifted > directly from the dataset or is indeed a transformation.
I've not looked at the links above, but I think if someone can make an compelling argument that *specific* tools have sufficient transparency to be compatible with signing the DCO, then I think we could maintain a list of exceptions in the policy. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|