LLM code generators

Daniel P . Berrangé Thu, 23 Nov 2023 10:00:07 -0800

On Thu, Nov 23, 2023 at 09:35:43AM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> > There has been an explosion of interest in so called "AI" (LLM)
> > code generators in the past year or so. Thus far though, this is
> > has not been matched by a broadly accepted legal interpretation
> > of the licensing implications for code generator outputs. While
> > the vendors may claim there is no problem and a free choice of
> > license is possible, they have an inherent conflict of interest
> > in promoting this interpretation. More broadly there is, as yet,
> > no broad consensus on the licensing implications of code generators
> > trained on inputs under a wide variety of licenses.
> > 
> > The DCO requires contributors to assert they have the right to
> > contribute under the designated project license. Given the lack
> > of consensus on the licensing of "AI" (LLM) code generator output,
> > it is not considered credible to assert compliance with the DCO
> > clause (b) or (c) where a patch includes such generated code.
> > 
> > This patch thus defines a policy that the QEMU project will not
> > accept contributions where use of "AI" (LLM) code generators is
> > either known, or suspected.
> > 
> > Signed-off-by: Daniel P. Berrangé <berra...@redhat.com>
> > ---
> >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> > 
> > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > index b4591a2dec..a6e42c6b1b 100644
> > --- a/docs/devel/code-provenance.rst
> > +++ b/docs/devel/code-provenance.rst
> > @@ -195,3 +195,43 @@ example::
> >    Signed-off-by: Some Person <some.per...@example.com>
> >    [Rebased and added support for 'foo']
> >    Signed-off-by: New Person <new.per...@example.com>
> > +
> > +Use of "AI" (LLM) code generators
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +TL;DR:
> > +
> > +  **Current QEMU project policy is to DECLINE any contributions
> > +  which are believed to include or derive from "AI" (LLM)
> > +  generated code.**
> > +
> > +The existence of "AI" (`Large Language Model 
> > <https://en.wikipedia.org/wiki/Large_language_model>`__
> > +/ LLM) code generators raises a number of difficult legal questions, a
> > +number of which impact on Open Source projects. As noted earlier, the
> > +QEMU community requires that contributors certify their patch submissions
> > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > +patch contains "AI" generated code this raises difficulties with code
> > +provenence and thus DCO compliance.
> > +
> > +To satisfy the DCO, the patch contributor has to fully understand
> > +the origins and license of code they are contributing to QEMU. The
> > +license terms that should apply to the output of an "AI" code generator
> > +are ill-defined, given that both training data and operation of the
> > +"AI" are typically opaque to the user. Even where the training data
> > +is said to all be open source, it will likely be under a wide variety
> > +of license terms.
> > +
> > +While the vendor's of "AI" code generators may promote the idea that
> > +code output can be taken under a free choice of license, this is not
> > +yet considered to be a generally accepted, nor tested, legal opinion.
> > +
> > +With this in mind, the QEMU maintainers does not consider it is
> > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > +generated code.
> > +
> > +The QEMU maintainers thus require that contributors refrain from using
> > +"AI" code generators on patches intended to be submitted to the project,
> > +and will decline any contribution if use of "AI" is known or suspected.
> > +
> > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > +and ChatGPT, amongst many others which are less well known.
> 
> 
> So you called out these two by name, fine, but given "AI" is in scare
> quotes I don't really know what is or is not allowed and I don't know
> how will contributors know.  Is the "AI" that one must not use
> necessarily an LLM?  And how do you define LLM even? Wikipedia says
> "general-purpose language understanding and generation".


I used "AI" in quotes, because I think it can mean different things to
different people. In practical terms it has become a bit of a catch
all term for a wide variety of tools. Thus I think the quote serve to
express this as a loose generalization, rather than a precise definition.

The same for "LLM", I don't want to try to define it, as it has also
become somewhat of a general term. 

> All this seems vague to me.

Delibrately so, as there are a wide variety of tools working in
varying ways, but all with similar caveats around the licensing
of the output "derivative" work.

> However, can't we define a simpler more specific policy?
> For example, isn't it true that *any* automatically generated code
> can only be included if the scripts producing said code
> are also included or otherwise available under GPLv2?

The license of a code generation tool itself is usually considered
to be not a factor in the license of its output.

In most cases the license of the input data will determine the
license of the output data, since the latter is a derivative
work of the former. The person runing the tool will typically
know exact what the input data is, and so have confidence over
the license of the output.

If there are questions about whether the output is a derivative
of the tool's code itself, then the tool author can provide an
disclaimer for this.  Such a disclaimer though, would not erase
the derivative link between input data and output data. One
example is GCC where the output .o/exe is a derivative of the
input .c.  The output, however, may also link the gcc runtime
library, and so GCC has a license exception saying that this
runtime linkage doesn't affect the license of the output
program. This is OK, since the GCC authors who added this
exception owned copyright over the runtime library they're
adding an exception for.

If we apply this to LLMs, the output of the LLM is a derivative
of the training data. The output is not a derivative of the LLM
code. The LLM copyright holders could make this latter point
explicit since they own copyright of the LLM code, but they do
not own copyright of the training data, and neither does the
person using the LLM, hence the legal uncertainty.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Reply via email to