On 4/6/25 10:40, Daniel P. Berrangé wrote:
On Wed, Jun 04, 2025 at 09:54:33AM +0200, Philippe Mathieu-Daudé wrote:
On 4/6/25 09:15, Daniel P. Berrangé wrote:
On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
Stefan Hajnoczi <stefa...@gmail.com> writes:
On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <arm...@redhat.com> wrote:
From: Daniel P. Berrangé <berra...@redhat.com>
>> +
+The increasing prevalence of AI code generators, most notably but not limited
More detail is needed on what an "AI code generator" is. Coding
assistant tools range from autocompletion to linters to automatic code
generators. In addition there are other AI-related tools like ChatGPT
or Gemini as a chatbot that can people use like Stackoverflow or an
API documentation summarizer.
I think the intent is to say: do not put code that comes from _any_ AI
tool into QEMU.
It would be okay to use AI to research APIs, algorithms, brainstorm
ideas, debug the code, analyze the code, etc but the actual code
changes must not be generated by AI.
The scope of the policy is around contributions we receive as
patches with SoB. Researching / brainstorming / analysis etc
are not contribution activities, so not covered by the policy
IMHO.
The existing text is about "AI code generators". However, the "most
notably LLMs" that follows it could lead readers to believe it's about
more than just code generation, because LLMs are in fact used for more.
I figure this is your concern.
We could instead start wide, then narrow the focus to code generation.
Here's my try:
The increasing prevalence of AI-assisted software development results
in a number of difficult legal questions and risks for software
projects, including QEMU. Of particular concern is code generated by
`Large Language Models
<https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
Documentation we maintain has the same concerns as code.
So I'd suggest to substitute 'code' with 'code / content'.
Why couldn't we accept documentation patches improved using LLM?
I would flip it around and ask why would documentation not be held
to the same standard as code, when it comes to licensing and legal
compliance ?
This is all copyright content that we merge & distribute under the
same QEMU licensing terms, and we have the same legal obligations
whether it is "source code" or "documentation" or other content
that is not traditional "source code" (images for example).
As a non-native English speaker being often stuck trying to describe
function APIs, I'm very tempted to use a LLM to review my sentences
and make them better understandable.
I can understand that desire, and it is an admittedly tricky situation
and tradeoff for which I don't have a great answer.
As a starting point we (as reviewers/maintainers) must be broadly
very tolerant & accepting of content that is not perfect English,
because we know many (probably even the majority of) contributors
won't have English as their first language.
As a reviewer I don't mind imperfect language in submissions. Even
if language is not perfect it is at least a direct expression of
the author's understanding and thus we can have a level of trust
in the docs based on our community experience with the contributor.
If docs have been altered in any significant manner by an LLM,
even if they are linguistically improved, IMHO, knowing that use
of LLM would reduce my personal trust in the technically accuracy
of the contribution.
This is straying into the debate around the accuracy of LLMs though,
which is interesting, but tangential from the purpose of this policy
which aims to focus on the code provenance / legal side.
So, back on track, a important point is that this policy (& the
legal concerns/risks it attempts to address) are implicitly
around contributions that can be considered copyrightable.
Some so called "trivial" work can be so simplistic as to not meet
the threshold for copyright protection, and it is thus easy for the
DCO requirements to be satisfied.
As a person, when you write the API documentation from scratch,
your output would generally be considered to be copyrightable
contribution by the author.
When a reviewer then suggests changes to your docs, most of the
time those changes are so trivial, that the reviewer wouldn't be
claiming copyright over the resulting work.
If the reviewer completely rewrites entire sentences in the
docs though, though would be able to claim copyright over part
of the resulting work.
The tippping point between copyrightable/non-copyrightable is
hard to define in a policy. It is inherantly fuzzy, and somewhat
of a "you'll know it when you see it" or "lets debate it in court"
situation...
So back to LLMs.
If you ask the LLM (or an agent using an LLM) to entirely write
the API docs from scratch, I think that should be expected to
fall under this proposed contribution policy in general.
If you write the API docs yourself and ask the LLM to review and
suggest improvements, that MAY or MAY NOT fall under this policy.
If the LLM suggested tweaks were minor enough to be considered
not to meet the threshold to be copyrightable it would be fine,
this is little different to a human reviewer suggesting tweaks.
Good.
If the LLM suggested large scale rewriting that would be harder
to draw the line, but would tend towards falling under this
contribution policy.
So it depends on the scope of what the LLM suggested as a change
to your docs.
IOW, LLM-as-sparkling-auto-correct is probably OK, but
LLM-as-book-editor / LLM-as-ghost-writer is probably NOT OK
OK.
This is a scenario where the QEMU contributor has to use their
personal judgement as to whether their use of LLM in a docs context
is compliant with this policy, or not. I don't think we should try
to describe this in the policy given how fuzzy the situation is.
Thank you very much for this detailed explanation!
NB, this copyrightable/non-copyrightable situation applies to source
code too, not just docs.
With regards,
Daniel