Hi all,

It looks like we haven't discussed this much and haven't settled on a policy 
for what kinds of AI generated contributions we accept and what vetting is 
required for them.

https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results.

```
Given the above, code generated in whole or in part using AI can be contributed 
if the contributor ensures that:

1. The terms and conditions of the generative AI tool do not place any 
restrictions on use of the output that would be inconsistent with the Open 
Source Definition.
2. At least one of the following conditions is met:
    2.1 The output is not copyrightable subject matter (and would not be even 
if produced by a human).
    2.2 No third party materials are included in the output.
    2.3 Any third party materials that are included in the output are being 
used with permission (e.g., under a compatible open-source license) of the 
third party copyright holders and in compliance with the applicable license 
terms.
3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3 are 
met if the AI tool itself provides sufficient information about output that may 
be similar to training data, or from code scanning results.
```

There is a lot to unpack there, but it seems like any one of 2 needs to be met, 
and 3 describes how 2.2 and 2.3 can be satisfied.

2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 is a pretty high 
bar in that it's hard to know if you have met it. Do we have anyone in the 
community running any code scanning tools already?

Here is the JIRA for addition of the generative AI policy: 
https://issues.apache.org/jira/browse/LEGAL-631
Legal mailing list discussion of the policy: 
https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s
Legal mailing list discussion of compliant tools: 
https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr
Legal mailing list discussion about how Open AI terms are not Apache 
compatible: https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16
Hadoop mailing list message hinting that they accept contributions but ask 
which tool: https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj
Spark mailing list message where they have given up on stopping people: 
https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr

I didn't see other projects discussing and deciding how to handle these 
contributions, but I also didn't check that many of them only Hadoop, Spark, 
Druid, Pulsar. I also can't see their PMC mailing list.

I asked O3 to deep research what is done to avoid producing copyrighted code: 
https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d

To summarize training deduplicates training so the model is less likely to spit 
reproduce it verbatim, prompts and fine tuning encourage not reproducing things 
verbatim, the inference is biased to not pick the best option but some 
neighboring one encouraging originality, and in some instances the output is 
checked to make sure it doesn't match the training data. So to some extent 2.2 
is being done to different degrees depending on what product you are using.

It's worth noting that scanning the output can be probabilistic in the case of 
say Anthropic and they still recommend code scanning.

Quite notably Anthropic for its enterprise users indemnifies them against 
copyright claims. It's not perfect, but it does mean they have an incentive to 
make sure there are fewer copyright claims. We could choose to be picky and 
only accept specific sources of LLM generated code based on perceived safety.

I think not producing copyrighted output from your training data is a 
technically feasible achievement for these vendors so I have a moderate level 
of trust they will succeed at it if they say they do it.

I could send a message to the legal list asking for clarification and a set of 
tools, but based on Roman's communication 
(https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) I think this 
is kind of what we get. It's on us to ensure the contributions are kosher 
either by code scanning or accepting that the LLM vendors are doing a good job 
at avoiding copyrighted output.

My personal opinion is that we should at least consider allow listing a few 
specific sources (any vendor that scans output for infringement) and add that 
to the PR template and in other locations (readme, web site). Bonus points if 
we can set up code scanning (useful for non-AI contributions!).

Regards,
Ariel

Reply via email to