Re: [DISCUSS] Add JVector as a dependency for CEP-30

Josh McKenzie Fri, 22 Sep 2023 05:29:40 -0700

So if we're going to chat about GenAI on this thread here, 2 things:
 1. A dependency we pull in != a code contribution (I am not a lawyer but my 
understanding is that with the former the liability rests on the provider of 
the lib to ensure it's in compliance with their claims to copyright and it's 
not sticky). Easier to transition to a different dep if there's something API 
compatible or similar.
 2. With code contributions we take in, we take on some exposure in terms of 
copyright and infringement. git revert can be painful.
For this thread, here's an excerpt from the ASF policy:
> a recommended practice when using generative AI tooling is to use tools with 
> features that identify any included content that is similar to parts of the 
> tool’s training data, as well as the license of that content.
> 
> Given the above, code generated in whole or in part using AI can be 
> contributed if the contributor ensures that:
> 
>  1. The terms and conditions of the generative AI tool do not place any 
> restrictions on use of the output that would be inconsistent with the Open 
> Source Definition (e.g., ChatGPT’s terms are inconsistent).
>  2. At least one of the following conditions is met:
>    1. The output is not copyrightable subject matter (and would not be even 
> if produced by a human)
>    2. No third party materials are included in the output
>    3. Any third party materials that are included in the output are being 
> used with permission (e.g., under a compatible open source license) of the 
> third party copyright holders and in compliance with the applicable license 
> terms
>  3. A contributor obtain reasonable certainty that conditions 2.2 or 2.3 are 
> met if the AI tool itself provides sufficient information about materials 
> that may have been copied, or from code scanning results
>    1. E.g. AWS CodeWhisperer recently added a feature that provides notice 
> and attribution
> When providing contributions authored using generative AI tooling, a 
> recommended practice is for contributors to indicate the tooling used to 
> create the contribution. This should be included as a token in the source 
> control commit message, for example including the phrase “Generated-by
>

I think the real challenge right now is ensuring that the output from an LLM 
doesn't include a string of tokens that's identical to something in its input 
training dataset if it's trained on non-permissively licensed inputs. That plus 
the risk of, at least in the US, the courts landing on the side of saying that 
not only is the output of generative AI not copyrightable, but that there's 
legal liability on either the users of the tools or the creators of the models 
for some kind of copyright infringement. That can be sticky; if we take PR's 
that end up with that liability exposure, we end up in a place where either the 
foundation could be legally exposed and/or we'd need to revert some pretty 
invasive code / changes.

For example, Microsoft and OpenAI have publicly committed to paying legal fees 
for people sued for copyright infringement for using their tools: 
https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view.
 Pretty interesting, and not a step a provider would take in an environment 
where things were legally clear and settled.

So while the usage of these things is apparently incredibly pervasive right 
now, "everybody is doing it" is a pretty high risk legal defense. :)

On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:
> 
> 
> On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote:
>> 
>> At some point we have to discuss this, and here’s as good a place as any. 
>> There’s a great news article published talking about how generative AI was 
>> used to assist in developing the new vector search feature, which is itself 
>> really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal 
>> policy on use for contributions to the project. This proposal is to include 
>> a dependency, but I’m not sure if that avoids the issue, and I’m equally 
>> uncertain how much this issue is isolated to the dependency (or affects it 
>> at all?)
>> 
>> Anyway, this is an annoying discussion we need to have at some point, so 
>> raising it here now so we can figure it out.
>> 
>> [1] 
>> https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/
>>  
>> <https://urldefense.com/v3/__https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/__;!!PbtH5S7Ebw!fi6r5DJcCCQ5zE54pLuUNDEXRSukUWsbj9dtHaXQX2Fcr-xkwsPUZz4QJu_3z5VOCKTSUIeupeClXoy0$>
>> [2] https://www.apache.org/legal/generative-tooling.html
>> 
> 
> 
> My reading of the ASF's GenAI policy is that any generated work in the 
> jvector library (and cep-30 ?) are not copyrightable, and that makes them ok 
> for us to include.
> 
> If there was a trace to copyrighted work, or the tooling imposed a copyright 
> or restrictions, we would then have to take considerations.

Re: [DISCUSS] Add JVector as a dependency for CEP-30

Reply via email to