Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

German Eichberger via dev Fri, 22 Sep 2023 15:51:02 -0700

+1 with taking it to legal

As anyone else I enjoy speculating about legal stuff and I think for jars you 
probably need possible deniability aka no paper trail that we knowingly... but 
that horse is out of the barn. So really interested in what legal says 🙂


If you can stomach non Java here is an alternate DiskANN implementation: 
microsoft/DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and 
Filtered Approximate Nearest Neighbor Search 
(github.com)<https://github.com/microsoft/DiskANN>

Thanks,
German

________________________________
From: Josh McKenzie <jmcken...@apache.org>
Sent: Friday, September 22, 2023 7:43 AM
To: dev <dev@cassandra.apache.org>
Subject: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

I highly doubt liability works like that in all jurisdictions
That's a fantastic point. When speculating there, I overlooked the fact that 
there are literally dozens of legal jurisdictions in which this project is used 
and the foundation operates.

As a PMC let's take this to legal.

On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:
To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, 
concrete) opinion.


On Fri, Sep 22, 2023 at 5:59 AM Benedict 
<bened...@apache.org<mailto:bened...@apache.org>> wrote:


  1.  my understanding is that with the former the liability rests on the 
provider of the lib to ensure it's in compliance with their claims to copyright

I highly doubt liability works like that in all jurisdictions, even if it might 
in some. I can even think of some historic cases related to Linux where patent 
trolls went after users of Linux, though I’m not sure where that got to and I 
don’t remember all the details.

But anyway, none of us are lawyers and we shouldn’t be depending on this kind 
of analysis. At minimum we should invite legal to proffer an opinion on whether 
dependencies are a valid loophole to the policy.



On 22 Sep 2023, at 13:48, J. D. Jordan 
<jeremiah.jor...@gmail.com<mailto:jeremiah.jor...@gmail.com>> wrote:


This Gen AI generated code use thread should probably be its own mailing list 
DISCUSS thread?  It applies to all source code we take in, and accept copyright 
assignment of, not to jars we depend on and not only to vector related code 
contributions.

On Sep 22, 2023, at 7:29 AM, Josh McKenzie 
<jmcken...@apache.org<mailto:jmcken...@apache.org>> wrote:

So if we're going to chat about GenAI on this thread here, 2 things:

  1.  A dependency we pull in != a code contribution (I am not a lawyer but my 
understanding is that with the former the liability rests on the provider of 
the lib to ensure it's in compliance with their claims to copyright and it's 
not sticky). Easier to transition to a different dep if there's something API 
compatible or similar.
  2.  With code contributions we take in, we take on some exposure in terms of 
copyright and infringement. git revert can be painful.

For this thread, here's an excerpt from the ASF policy:

a recommended practice when using generative AI tooling is to use tools with 
features that identify any included content that is similar to parts of the 
tool’s training data, as well as the license of that content.

Given the above, code generated in whole or in part using AI can be contributed 
if the contributor ensures that:

  1.  The terms and conditions of the generative AI tool do not place any 
restrictions on use of the output that would be inconsistent with the Open 
Source Definition (e.g., ChatGPT’s terms are inconsistent).
  2.
At least one of the following conditions is met:
     *   The output is not copyrightable subject matter (and would not be even 
if produced by a human)
     *   No third party materials are included in the output
     *   Any third party materials that are included in the output are being 
used with permission (e.g., under a compatible open source license) of the 
third party copyright holders and in compliance with the applicable license 
terms
  3.
A contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if 
the AI tool itself provides sufficient information about materials that may 
have been copied, or from code scanning results
     *   E.g. AWS CodeWhisperer recently added a feature that provides notice 
and attribution

When providing contributions authored using generative AI tooling, a 
recommended practice is for contributors to indicate the tooling used to create 
the contribution. This should be included as a token in the source control 
commit message, for example including the phrase “Generated-by

I think the real challenge right now is ensuring that the output from an LLM 
doesn't include a string of tokens that's identical to something in its input 
training dataset if it's trained on non-permissively licensed inputs. That plus 
the risk of, at least in the US, the courts landing on the side of saying that 
not only is the output of generative AI not copyrightable, but that there's 
legal liability on either the users of the tools or the creators of the models 
for some kind of copyright infringement. That can be sticky; if we take PR's 
that end up with that liability exposure, we end up in a place where either the 
foundation could be legally exposed and/or we'd need to revert some pretty 
invasive code / changes.

For example, Microsoft and OpenAI have publicly committed to paying legal fees 
for people sued for copyright infringement for using their tools: 
https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view<https://urldefense.com/v3/__https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view__;!!PbtH5S7Ebw!ayp8v3C0XGwLhCQCu_FuLfvUz7V4Jgg5JGVkJGJl6DenfyeGqFvD_RAERDUr7koCoiLAnkz8q3QoF3fBz7fZ$>.
 Pretty interesting, and not a step a provider would take in an environment 
where things were legally clear and settled.

So while the usage of these things is apparently incredibly pervasive right 
now, "everybody is doing it" is a pretty high risk legal defense. :)

On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:


On Thu, 21 Sept 2023 at 10:41, Benedict 
<bened...@apache.org<mailto:bened...@apache.org>> wrote:

At some point we have to discuss this, and here’s as good a place as any. 
There’s a great news article published talking about how generative AI was used 
to assist in developing the new vector search feature, which is itself really 
cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on 
use for contributions to the project. This proposal is to include a dependency, 
but I’m not sure if that avoids the issue, and I’m equally uncertain how much 
this issue is isolated to the dependency (or affects it at all?)

Anyway, this is an annoying discussion we need to have at some point, so 
raising it here now so we can figure it out.

[1] 
https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/<https://urldefense.com/v3/__https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/__;!!PbtH5S7Ebw!fi6r5DJcCCQ5zE54pLuUNDEXRSukUWsbj9dtHaXQX2Fcr-xkwsPUZz4QJu_3z5VOCKTSUIeupeClXoy0$>
[2] https://www.apache.org/legal/generative-tooling.html



My reading of the ASF's GenAI policy is that any generated work in the jvector 
library (and cep-30 ?) are not copyrightable, and that makes them ok for us to 
include.

If there was a trace to copyrighted work, or the tooling imposed a copyright or 
restrictions, we would then have to take considerations.

Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

Reply via email to