Legal’s opinion is that this is not an acceptable workaround to the policy.

On 22 Sep 2023, at 23:51, German Eichberger via dev <dev@cassandra.apache.org> wrote:

+1 with taking it to legal

As anyone else I enjoy speculating about legal stuff and I think for jars you probably need possible deniability aka no paper trail that we knowingly... but that horse is out of the barn. So really interested in what legal says 🙂

If you can stomach non Java here is an alternate DiskANN implementation: microsoft/DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search (github.com)

Thanks,

German

From: Josh McKenzie <jmcken...@apache.org>
Sent: Friday, September 22, 2023 7:43 AM
To: dev <dev@cassandra.apache.org>
Subject: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

I highly doubt liability works like that in all jurisdictions

That's a fantastic point. When speculating there, I overlooked the fact that there are literally dozens of legal jurisdictions in which this project is used and the foundation operates.

As a PMC let's take this to legal.

On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:

To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, concrete) opinion.

On Fri, Sep 22, 2023 at 5:59 AM Benedict <bened...@apache.org> wrote:

my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright

I highly doubt liability works like that in all jurisdictions, even if it might in some. I can even think of some historic cases related to Linux where patent trolls went after users of Linux, though I’m not sure where that got to and I don’t remember all the details.

But anyway, none of us are lawyers and we shouldn’t be depending on this kind of analysis. At minimum we should invite legal to proffer an opinion on whether dependencies are a valid loophole to the policy.

On 22 Sep 2023, at 13:48, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:

This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread? It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related code contributions.

On Sep 22, 2023, at 7:29 AM, Josh McKenzie <jmcken...@apache.org> wrote:

So if we're going to chat about GenAI on this thread here, 2 things:

A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not sticky). Easier to transition to a different dep if there's something API compatible or similar.

With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.

For this thread, here's an excerpt from the ASF policy:

a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license of that content.

Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:

The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).

At least one of the following conditions is met:

The output is not copyrightable subject matter (and would not be even if produced by a human)

No third party materials are included in the output

Any third party materials that are included in the output are being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license terms

A contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning results

E.g. AWS CodeWhisperer recently added a feature that provides notice and attribution

When providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-by

I think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal liability on either the users of the tools or the creators of the models for some kind of copyright infringement. That can be sticky; if we take PR's that end up with that liability exposure, we end up in a place where either the foundation could be legally exposed and/or we'd need to revert some pretty invasive code / changes.

For example, Microsoft and OpenAI have publicly committed to paying legal fees for people sued for copyright infringement for using their tools: https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view. Pretty interesting, and not a step a provider would take in an environment where things were legally clear and settled.

So while the usage of these things is apparently incredibly pervasive right now, "everybody is doing it" is a pretty high risk legal defense. :)

On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:

On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote:

At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)

Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.

[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/

[2] https://www.apache.org/legal/generative-tooling.html

My reading of the ASF's GenAI policy is that any generated work in the jvector library (and cep-30 ?) are not copyrightable, and that makes them ok for us to include.

If there was a trace to copyrighted work, or the tooling imposed a copyright or restrictions, we would then have to take considerations.

Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

Reply via email to