LLM code generators

Daniel P . Berrangé Fri, 24 Nov 2023 03:42:41 -0800

On Fri, Nov 24, 2023 at 10:21:17AM +0000, Alex Bennée wrote:
> Daniel P. Berrangé <berra...@redhat.com> writes:
> 
> > On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote:
> >> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
> >> > The license of a code generation tool itself is usually considered
> >> > to be not a factor in the license of its output.
> >> 
> >> Really? I would find it very surprising if a code generation tool that
> >> is not a language model and so is not understanding the code it's
> >> generating did not include some code snippets going into the output.
> >> It is also possible to unintentionally run afoul of GPL's definition of 
> >> source
> >> code which is "the preferred form of the work for making modifications to 
> >> it". 
> >> So even if you have copyright to input, dumping just output and putting
> >> GPL on it might or might not be ok.
> >
> > Consider the C pre-processor. This takes an input .c file, and expands
> > all the macros, to split out a new .c file.
> >
> > The license of the output .c file is determined by the license of the
> > input .c file. The license of the CPP impl (whether OSS or proprietary)
> > doesn't have any influence on the license of the output file, it cannot
> > magically force the output file to be proprietary any more than it can
> > force it to be output file GPL.
> 
> LLM's are just a tool like a compiler (albeit with spookier different
> internals). The prompt and the instructions are arguably the more
> important part of how to get good results from the LLM transformation.
> In fact most of the way I've been using them has been by pasting some
> existing code and asking for review or transformation of it.
> 
> However I totally get that using the various online LLMs you have very
> little transparency about what has gone into their training and therefor
> there is a danger of proprietary code being hallucinated out of their
> matricies. Conversely what if I use an LLM like OpenLLaMa:
> 
>   https://github.com/openlm-research/open_llama
> 
> I have fairly exhaustive definitions of what went into the training data
> which of most interest is probably the StarCoder dataset (paper):
> 
>   https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view
> 
> where there are tools to detect if generated code has been lifted
> directly from the dataset or is indeed a transformation.


I've not looked at the links above, but I think if someone can make an
compelling argument that *specific* tools have sufficient transparency
to be compatible with signing the DCO, then I think we could maintain a
list of exceptions in the policy.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Reply via email to