Legal’s opinion is that this is not an acceptable workaround to the policy. On 22 Sep 2023, at 23:51, German Eichberger via dev <dev@cassandra.apache.org> wrote:
+1 with taking it to legal
As anyone else I enjoy speculating about legal stuff and I think for jars you probably need possible deniability aka no paper trail that we knowingly... but that horse is out of the barn. So really interested in what legal says
🙂
Thanks,
German
From: Josh McKenzie <jmcken...@apache.org>
Sent: Friday, September 22, 2023 7:43 AM
To: dev <dev@cassandra.apache.org>
Subject: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30
I highly doubt liability works like that in all jurisdictions
That's a fantastic point. When speculating there, I overlooked the fact that there are literally dozens of legal jurisdictions in which this project is used and the foundation operates.
As a PMC let's take this to legal.
On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:
To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, concrete) opinion.
- my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright
I highly doubt liability works like that in all jurisdictions, even if it might in some. I can even think of some historic cases related to Linux where patent trolls went after users of Linux, though I’m not sure where that got
to and I don’t remember all the details.
But anyway, none of us are lawyers and we shouldn’t be depending on this kind of analysis. At minimum we should invite legal to proffer an opinion on whether dependencies are a valid loophole to the policy.
This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread? It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related
code contributions.
So if we're going to chat about GenAI on this thread here, 2 things:
- A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not
sticky). Easier to transition to a different dep if there's something API compatible or similar.
- With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.
For this thread, here's an excerpt from the ASF policy:
a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license
of that content.
Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:
-
The terms and conditions of the generative AI tool do not place
any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).
-
At least one of the following conditions
is met:
-
The output is not copyrightable subject matter (and would not
be even if produced by a human)
-
No third party materials are included in the output
-
Any third party materials that are included in the output are
being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license terms
-
A contributor obtain reasonable
certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning results
-
E.g. AWS CodeWhisperer recently added a feature that provides
notice and attribution
When providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token
in the source control commit message, for example including the phrase “Generated-by
I think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That
plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal liability on either the users of the tools or the creators of the models for some kind of
copyright infringement. That can be sticky; if we take PR's that end up with that liability exposure, we end up in a place where either the foundation could be legally exposed and/or we'd need to revert some pretty invasive code / changes.
So while the usage of these things is apparently incredibly pervasive right now, "everybody is doing it" is a pretty high risk legal defense. :)
On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:
At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really
cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated
to the dependency (or affects it at all?)
Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.
My reading of the ASF's GenAI policy is that any generated work in the jvector library (and cep-30 ?) are not copyrightable, and that makes them ok for us to include.
If there was a trace to copyrighted work, or the tooling imposed a copyright or restrictions, we would then have to take considerations.
|