So the general concern we're talking about is identifying/avoiding cases in which a community member has contributed code generated with the support of a model that has reproduced training data verbatim, posing copyright risk to the Apache Cassandra project.
Work in this area seems early, but there are a couple examples we should know about. Copilot and Amazon Q seem to have the most advanced capability here unless I've missed others. – Copilot offers a feature called Code Referencing, which provides inline feedback indicating that generated source may match publicly-available source code; the source license; and source references: https://github.blog/news-insights/product-news/code-referencing-now-generally-available-in-github-copilot-and-with-microsoft-azure-ai/ – Amazon Q / CodeWhisperer implement a similar feature called Reference Tracker. This helps detect whether a suggestion is similar to publicly-available code, along with a repo link and license information: https://www.infoworld.com/article/2337694/amazon-q-developer-review-code-completions-code-chat-and-aws-skills.html I'm not aware of specific functionality in models from OpenAI or Anthropic, though the Claude system prompt apparently directs the model not to reproduce copyrighted material in runs greater than 15 words. It’s not clear that there’s a model feature that explicitly guarantees this, though: https://simonwillison.net/2025/May/25/claude-4-system-prompt/#seriously-don-t-regurgitate-copyrighted-content A spectrum of approaches we could consider to manage the project’s exposure to this risk: (1) Requiring contributors using generative models to disclose that an assistive model was used, and what steps were taken to mitigate the risk of inclusion of copyrighted output. This could include requiring disclosure of models used and whether features like code referencing / reference tracking were applied to identify material similar to known OSS code. (2) Requiring contributors using generative models to adopt features that proactively attempt to identify similarities to publicly-available source code as part of their development workflow and warrant that either no instances of code referencing were reported; or that any generated suggestions were subject to transformation significant enough that they constitute a transformative work. (3) Implementing proactive scanning similar to approaches used in education like “TurnItIn,” which are deeply disliked by honest and dishonest students alike and policing this; or (4) Disallowing contribution of source code developed with the support of generative models as a matter of policy. Speaking for myself, I’d support an approach that: (a) Requires disclosure of source code generated with the support of a coding LLM including which model and version. (b) Requires contributors to enumerate steps taken to minimize the risk of inclusion of copyrighted content - e.g., use of features similar to the ones above; specificity of the work to the Cassandra codebase; or generality of the code generated (boilerplate). (c) Requires contributors using assistive LLMs to warrant that they are the primary author of the patch - primarily to distinguish between use of language models as “enhanced IntelliSense” vs. unconscious vibe coding. Note that the above a/b/c items are specific to the copyright concern and don’t approach other questions, such as whether we should be using generative tooling to develop a database used industry-wide as a system of record, or what burden this places on reviewers. My own bread is buttered on the side of “I can’t be bothered to read something that someone didn’t bother to write.” But my hope is that any use of LLM-based tooling is sufficiently transformative as to become the original work of the contributor. In the spirit of disclosure, every word above was written by me. I used Kagi and ChatGPT-4 turbo (free) very lightly as search aids to look up vendor-specific implementations of the code referencing and reference tracking features because I wasn’t aware of the name of the relevant product category. – Scott > On Jun 2, 2025, at 4:54 PM, Jeremiah Jordan <jerem...@apache.org> wrote: > > I don’t think I said we should abdicate responsibility? I said the key point > is that contributors, and more importantly reviewers and committers > understand the ASF guidelines and hold all code to those standards. Any > suspect code should be blocked during review. As Roman says in your quote, > this isn’t about AI, it’s about copyright. If someone submits copyrighted > code to the project, whether an AI generated it or they just grabbed it from > a Google search, it’s on the project to try not to accept it. > I don’t think anyone is going to be able to maintain and enforce a list of > acceptable tools for contributors to the project to stick to. We can’t know > what someone did on their laptop, all we can do is evaluate the code they > submit. > > -Jeremiah > > On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws > <mailto:ar...@weisberg.ws>> wrote: >> Hi, >> >> As PMC members/committers we aren't supposed to abdicate this to legal or to >> contributors. Despite the fact that we aren't equipped to solve this problem >> we are supposed to be making sure that code contributed is non-infringing. >> >> This is a quotation from Roman Shaposhnik from this legal thread >> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd >> >>> Yes, because you have to. Again -- forget about AI -- if a drive-by >>> contributor submits a patch that has huge amounts of code stolen from some >>> existing copyright holder -- it is very much ON YOU as a committer/PMC to >>> prevent that from happening. >> >> We aren't supposed to knowingly allow people to use AI tools that are known >> to generate infringing contributions or contributions which are not license >> compatible (such as OpenAI terms of use). >> >> Ariel >> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote: >>> > Ultimately it's the contributor's (and committer's) job to ensure that >>> > their contributions meet the bar for acceptance >>> To me this is the key point. Given how pervasive this stuff is becoming, I >>> don’t think it’s feasible to make some list of tools and enforce it. Even >>> without getting into extra tools, IDEs (including IntelliJ) are doing more >>> and more LLM based code suggestion as time goes on. >>> I think we should point people to the ASF Guidelines around such tools, and >>> the guidelines around copyrighted code, and then continue to review patches >>> with the high standards we have always had in this project. >>> >>> -Jeremiah >>> >>> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws >>> <mailto:ar...@weisberg.ws>> wrote: >>> >>> Hi, >>> >>> To clarify are you saying that we should not accept AI generated code until >>> it has been looked at by a human and then written again with different >>> "wording" to ensure that it doesn't directly copy anything? >>> >>> Or do you mean something else about the quality of "vibe coding" and how we >>> shouldn't allow it because it makes bad code? Ultimately it's the >>> contributor's (and committer's) job to ensure that their contributions meet >>> the bar for acceptance and I don't think we should tell them how to go >>> about meeting that bar beyond what is needed to address the copyright >>> concern. >>> >>> I agree that the bar set by the Apache guidelines are pretty high. They are >>> simultaneously impossible and trivial to meet depending on how you >>> interpret them and we are not very well equipped to interpret them. >>> >>> It would have been more straightforward for them to simply say no, but they >>> didn't opt to do that as if there is some way for PMCs to acceptably take >>> AI generated contributions. >>> >>> Ariel >>> >>> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote: >>>>> fine tuning encourage not reproducing things verbatim >>>>> I think not producing copyrighted output from your training data is a >>>>> technically feasible achievement for these vendors so I have a moderate >>>>> level of trust they will succeed at it if they say they do it. >>>> >>>> Some team members and I discussed this in the context of my documentation >>>> patch (which utilized Claude during composition). I conducted an >>>> experiment to pose high-level Cassandra-related questions to a model >>>> without additional context, while adjusting the temperature parameter >>>> (tested at 0.2, 0.5, and 0.8). The results revealed that each test >>>> generated content copied verbatim from a specific non-Apache (and non-DSE) >>>> website. I did not verify whether this content was copyrighted, though it >>>> was easily identifiable through a simple Google search. This occurred as a >>>> single sentence within the generated document, and as I am not a legal >>>> expert, I cannot determine whether this constitutes a significant issue. >>>> >>>> The complexity increases when considering models trained on different >>>> languages, which may translate content into English. In such cases, a >>>> Google search would fail to detect the origin. Is this still considered >>>> plagiarism? Does it violate copyright laws? I am uncertain. >>>> >>>> Similar challenges arise with code generation. For instance, if a model is >>>> trained on a GPL-licensed Python library that implements a novel data >>>> structure, and the model subsequently rewrites this structure in Java, a >>>> Google search is unlikely to identify the source. >>>> >>>> Personally, I do not assume these models will avoid producing copyrighted >>>> material. This doesn’t mean I am against AI at all, but rather reflects my >>>> belief that the requirements set by Apache are not easily “provable” in >>>> such scenarios. >>>> >>>> >>>>> My personal opinion is that we should at least consider allow listing a >>>>> few specific sources (any vendor that scans output for infringement) and >>>>> add that to the PR template and in other locations (readme, web site). >>>>> Bonus points if we can set up code scanning (useful for non-AI >>>>> contributions!). >>>> >>>> >>>> My perspective, after trying to see what AI can do is the following: >>>> >>>> Strengths >>>> * Generating a preliminary draft of a document and assisting with >>>> iterative revisions >>>> >>>> * Documenting individual methods >>>> >>>> * Generation of “simple” methods and scripts, provided the underlying >>>> libraries are well-documented in public repositories >>>> >>>> * Managing repetitive or procedural tasks, such as “migrating from X to Y” >>>> or “converting serializations to the X interface” >>>> >>>> Limitations >>>> * Producing a fully functional document in a single attempt that meets >>>> merge standards. When documenting Gens.java and Property.java, the output >>>> appeared plausible but contained frequent inaccuracies. >>>> * Addressing complex or ambiguous scenarios (“gossip”), though this >>>> challenge is not unique to AI—Matt Byrd and I tested Claude for >>>> CASSANDRA-20659, where it could identify relevant code but proposed >>>> solutions that risked corrupting production clusters. >>>> >>>> * Interpreting large-scale codebases. Beyond approximately 300 lines of >>>> actual code (excluding formatting), performance degrades significantly, >>>> leading to a marked decline in output quality. >>>> >>>> Note: When referring to AI/LLMs, I am not discussing interactions with a >>>> user interface to execute specific tasks, but rather leveraging code >>>> agents like Roo and Aider to provide contextual information to the LLM. >>>> >>>> Given these observations, it remains challenging to determine optimal >>>> practices. In some contexts its very clear to tell that nothing was taking >>>> from external work (e.g., “create a test using our BTree class that >>>> inserts a row with a null column,” “analyze this function’s purpose”). >>>> However, for substantial tasks, the situation becomes more complex. If the >>>> author employed AI as a collaborative tool during “pair programming,” >>>> concerns are not really that different than google searches (unless the >>>> work involves unique elements like introducing new data structures or >>>> indexes). Conversely, if the author “vibe coded” the entire patch, two >>>> primary concerns arise: does the author have writes to the code and >>>> whether its quality aligns with requirements. >>>> >>>> >>>> TL;DR - I am not against AI contributions, but strongly prefer its done as >>>> “pair programing”. My experience with “vibe coding” makes me worry about >>>> the quality of the code, and that the author is less likely to validate >>>> that the code generated is safe to donate. >>>> >>>> This email was generated with the help of AI =) >>>> >>>> >>>>> On May 30, 2025, at 3:00 PM, Ariel Weisberg <ar...@weisberg.ws >>>>> <mailto:ar...@weisberg.ws>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> It looks like we haven't discussed this much and haven't settled on a >>>>> policy for what kinds of AI generated contributions we accept and what >>>>> vetting is required for them. >>>>> >>>>> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results. >>>>> >>>>> ``` >>>>> Given the above, code generated in whole or in part using AI can be >>>>> contributed if the contributor ensures that: >>>>> >>>>> 1. The terms and conditions of the generative AI tool do not place any >>>>> restrictions on use of the output that would be inconsistent with the >>>>> Open Source Definition. >>>>> 2. At least one of the following conditions is met: >>>>> 2.1 The output is not copyrightable subject matter (and would not be >>>>> even if produced by a human). >>>>> 2.2 No third party materials are included in the output. >>>>> 2.3 Any third party materials that are included in the output are >>>>> being used with permission (e.g., under a compatible open-source license) >>>>> of the third party copyright holders and in compliance with the >>>>> applicable license terms. >>>>> 3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3 >>>>> are met if the AI tool itself provides sufficient information about >>>>> output that may be similar to training data, or from code scanning >>>>> results. >>>>> ``` >>>>> >>>>> There is a lot to unpack there, but it seems like any one of 2 needs to >>>>> be met, and 3 describes how 2.2 and 2.3 can be satisfied. >>>>> >>>>> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 is a >>>>> pretty high bar in that it's hard to know if you have met it. Do we have >>>>> anyone in the community running any code scanning tools already? >>>>> >>>>> Here is the JIRA for addition of the generative AI policy: >>>>> https://issues.apache.org/jira/browse/LEGAL-631 >>>>> Legal mailing list discussion of the policy: >>>>> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s >>>>> Legal mailing list discussion of compliant tools: >>>>> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr >>>>> Legal mailing list discussion about how Open AI terms are not Apache >>>>> compatible: >>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 >>>>> Hadoop mailing list message hinting that they accept contributions but >>>>> ask which tool: >>>>> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj >>>>> Spark mailing list message where they have given up on stopping people: >>>>> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr >>>>> >>>>> I didn't see other projects discussing and deciding how to handle these >>>>> contributions, but I also didn't check that many of them only Hadoop, >>>>> Spark, Druid, Pulsar. I also can't see their PMC mailing list. >>>>> >>>>> I asked O3 to deep research what is done to avoid producing copyrighted >>>>> code: https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d >>>>> >>>>> To summarize training deduplicates training so the model is less likely >>>>> to spit reproduce it verbatim, prompts and fine tuning encourage not >>>>> reproducing things verbatim, the inference is biased to not pick the best >>>>> option but some neighboring one encouraging originality, and in some >>>>> instances the output is checked to make sure it doesn't match the >>>>> training data. So to some extent 2.2 is being done to different degrees >>>>> depending on what product you are using. >>>>> >>>>> It's worth noting that scanning the output can be probabilistic in the >>>>> case of say Anthropic and they still recommend code scanning. >>>>> >>>>> Quite notably Anthropic for its enterprise users indemnifies them against >>>>> copyright claims. It's not perfect, but it does mean they have an >>>>> incentive to make sure there are fewer copyright claims. We could choose >>>>> to be picky and only accept specific sources of LLM generated code based >>>>> on perceived safety. >>>>> >>>>> I think not producing copyrighted output from your training data is a >>>>> technically feasible achievement for these vendors so I have a moderate >>>>> level of trust they will succeed at it if they say they do it. >>>>> >>>>> I could send a message to the legal list asking for clarification and a >>>>> set of tools, but based on Roman's communication >>>>> (https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) I >>>>> think this is kind of what we get. It's on us to ensure the contributions >>>>> are kosher either by code scanning or accepting that the LLM vendors are >>>>> doing a good job at avoiding copyrighted output. >>>>> >>>>> My personal opinion is that we should at least consider allow listing a >>>>> few specific sources (any vendor that scans output for infringement) and >>>>> add that to the PR template and in other locations (readme, web site). >>>>> Bonus points if we can set up code scanning (useful for non-AI >>>>> contributions!). >>>>> >>>>> Regards, >>>>> Ariel >>> >>