Re: [DISCUSS] Forking Cassandra utilities into a separately released library

David Capwell Mon, 08 Jun 2026 15:24:58 -0700

> but that introduces the inverse problem where you'd have to make a change 
> across N branches on the shared library if you have a patch that introduces 
> testing that hits all our GA C* and need to backport that functionality 
> instead of changing it in one place.


In the case I was talking about its the Property, Gen, and Gens classes, and 
not cluster level tests (similar to python dtest); so don’t think that would 
happen?


> Do we expect the shared functionality in this lib would change frequently in 
> ways that would impact multiple branches, or do we think it would be mostly 
> stable for older branches and mutate more frequently on trunk?

I went through our mailing list to see where this has been brought up and a 
common set brought up are "executors/futures/collections/concurrency 
utilities”.  These cases I feel should be the same, that new features are for 
trunk and we don’t really need to back port to older branches unless there are 
bug fixes (in which case we bump the version).  So I work with the assumption 
that back port to older branches isn’t that likely.  Bug fixes might need a 
version bump but should be backwards compatible, new features should also not 
break the public API.

One advantage of being a separate and versioned dependency is its easier to 
track when the API is broken, in tree makes this more painful.

Now, going through the history of this topic there is a group of things that I 
don’t think make sense to fork, and its stuff like AbstractType / Index / 
IAuthentorictor, etc… plugin authors want a way to handle building their 
plugins without Cassandra-all and these APIs are structurally cassandra 
related.  The stuff I propose extracting out of the code base are generic and 
unaware of cassandra as a project.

> If the latter (mostly stable, trunk only changes) then having a branch of 
> tools per GA branch would be optimal
What do you mean by this?  A "branch of tools per GA branch” I don’t follow.

> From a workflow perspective, a shared library factored out to its own repo 
> and embedded into C* branches as a submodule has some attractive properties 
> either way. It gives you "best of both worlds" (or least-worst-option) by 
> allowing you to work on things seamlessly as though they were one project but 
> keep the branching strategies of the tooling and the dependents decoupled. 
> Even if we only had 1 branch of the test tooling that all C* versions 
> depended on, having it separate and embedded as a submodule should give us 
> the same devx ergonomics while preserving the option to customize per C* 
> branch fairly easily.


Yep!  While working on accord I never needed 2 different IDEs open, one for 
accord and one for cassandra; I was able to make changes as if it was a single 
project and the only complexity for development was making sure CI knew about 
my accord branch (we have a script in tree for that) and merge is 3 steps 
rather than 1 (merge accord, update cassandra to point to latest accord, merge 
cassandra).

Sub modules do have down sides we are currently living with (as you have seen 
working with CI) and I do hope its been mostly seamless for people… 

I can also see us trying out a hybrid model… trunk is submodule but once we 
fork a major branch we switch to release jars instead; we get the trunk level 
velocity and loose all the pain points of submodules when working in a release 
branch.

> On Jun 8, 2026, at 7:25 AM, Josh McKenzie <[email protected]> wrote:
> 
>> One other motivation for forking is that we can fix issues one time rather 
>> than have to fix in 5 branches that have slightly different versions of our 
>> libraries. 
> The pain on this one is real. Spit-balling, but I wonder if there'd be a way 
> to sustainably have all GA branches depend on this code from trunk and we use 
> testing and validation to ensure the code on trunk stays compatible with 
> older releases.
> 
> There's a lot of complexity there since we'd need CI updated to run that 
> subset of tooling tests across all GA branches before a commit (i.e. trunk 
> only changes would then potentially impact all GA branches), but maybe that 
> actually wouldn't be so bad if we just had a new pipeline that pulled and 
> built all GA branches from HEAD and ran through the tooling test suites 
> against those releases. That, and it'd only really be in scope if you were 
> making changes to that tooling. That said, it would seem pretty weird for 5.0 
> to need to check out code from the trunk branch to build and run tests 
> against though... =/
> 
>> My primary need is for test utilities so my focus is there.
> Hm. Yeah, the more I think through this, having a versioned set of test 
> utilities in trunk for instance would definitely feel like "crossing the 
> streams" (i.e. PropertyTestingBase4.0, PropertyTestingBase4.1, etc). Big 
> separation of concerns / scope failure if people working on a trunk branch in 
> C* are having to think about other branches and API breakage with them 
> (moreso than we already have to w/mixed version upgrades etc.)
> 
> Having things like that in a separate repo where we could cut iterate on 
> things to update for a single branch would alleviate that immediate 
> versioning / mismatch context leak, but that introduces the inverse problem 
> where you'd have to make a change across N branches on the shared library if 
> you have a patch that introduces testing that hits all our GA C* and need to 
> backport that functionality instead of changing it in one place.
> 
> Blech.
> 
> So as I was drafting the above, my thinking has distilled down to the 
> following as being important to have a shared mental model on:
> Do we expect the shared functionality in this lib would change frequently in 
> ways that would impact multiple branches, or do we think it would be mostly 
> stable for older branches and mutate more frequently on trunk?
> If the former (multi-branch impacting blast radius, we keep older GA branches 
> in sync / compatible with test harness changes), a single golden copy of the 
> shared code that each branch shares would minimize toil
> If the latter (mostly stable, trunk only changes) then having a branch of 
> tools per GA branch would be optimal
> 
> From a workflow perspective, a shared library factored out to its own repo 
> and embedded into C* branches as a submodule has some attractive properties 
> either way. It gives you "best of both worlds" (or least-worst-option) by 
> allowing you to work on things seamlessly as though they were one project but 
> keep the branching strategies of the tooling and the dependents decoupled. 
> Even if we only had 1 branch of the test tooling that all C* versions 
> depended on, having it separate and embedded as a submodule should give us 
> the same devx ergonomics while preserving the option to customize per C* 
> branch fairly easily.
> 
> On Fri, Jun 5, 2026, at 9:25 AM, David Capwell wrote:
>> One other motivation for forking is that we can fix issues one time rather 
>> than have to fix in 5 branches that have slightly different versions of our 
>> libraries. A recent example is CASSANDRA-21216 which was a bug fix for 
>> btree.  
>> 
>> One of the other reasons brought up in the past is that many libraries are 
>> needed by accord but accord can’t depend on Cassandra else we have a 
>> cyclical dependency, so forking off let’s accord use our libraries.  For the 
>> time being accord had to fork many libraries in accord to make progress; 
>> this is a common issue right now.
>> 
>> 
>> 
>> Sent from my iPhone
>> 
>>> On Jun 3, 2026, at 1:45 PM, Josh McKenzie <[email protected]> wrote:
>>> 
>>>> delays this effort for years as we need time to get people on board and 
>>>> used to gradle before we flip that switch. 
>>> Oof. I'm way more optimistic on this one; if we can get a PR that has ant 
>>> targets as dumb wrappers that instead call gradle targets (i.e. all 
>>> workflows and local scripting Just Work), I don't see why we couldn't merge 
>>> that as soon as we ironed out kinks.
>>> 
>>> Is there anyone that's broadly against that approach? Or did I just 
>>> misunderstand the other thread / JIRA you'd created David?
>>> 
>>> On Wed, Jun 3, 2026, at 1:21 PM, David Capwell wrote:
>>>> Fair point but one thing to point out, if this work depends on gradle that 
>>>> delays this effort for years as we need time to get people on board and 
>>>> used to gradle before we flip that switch.  So leaving in tree means we 
>>>> have to hand roll all that logic in ant. 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Jun 3, 2026, at 12:33 PM, Jon Haddad <[email protected]> wrote:
>>>>> 
>>>>> Josh is right.  Gradle subprojects could allow this without dealing with 
>>>>> separate repo.  I've done this before and am about to again for some 
>>>>> stuff I maintain.  I spent a long time agonozing over this for my other 
>>>>> projects and found it works exceptionally well, especially bc you 
>>>>> frequently develop things that are tightly coupled.  
>>>>> 
>>>>> Juggling repos sucks, this solves it (imo) perfectly.
>>>>> 
>>>>> Jon
>>>>> 
>>>>> On Tue, Jun 2, 2026 at 1:18 PM Josh McKenzie <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>>> Is there a reason not to use a folder in the current repo that becomes 
>>>>>> its own jar?  It can even be published separately if we like?
>>>>> 
>>>>>> Mostly to decouple from Cassandra release.
>>>>> I think we could just have that .jar release on its own cadence 
>>>>> independently of the parent C* project.
>>>>> 
>>>>> Some of us have talked about taking this same approach to making some 
>>>>> code from C* available to the ecosystem (think I/O .jar that has SSTable 
>>>>> read/write, CommitLog read/write, etc). This feels like a very similarly 
>>>>> shaped thing.
>>>>> 
>>>>> I assume w/a modern build / publish / etc system we'd be able to publish 
>>>>> a release that represents a strict subset of the parent project out of 
>>>>> the repo right?
>>>>> 
>>>>> On Mon, Jun 1, 2026, at 8:18 PM, David Capwell wrote:
>>>>>> Mostly to decouple from Cassandra release.  If there is a feature added 
>>>>>> does it have to wait for the next major release of Cassandra so others 
>>>>>> can consume?  Even if we can get to yearly releases that’s still a long 
>>>>>> wait.
>>>>>> 
>>>>>> For example Alex and I have been talking about proper fuzz testing, so 
>>>>>> best case is a year before 3rd parties could use.
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Jun 1, 2026, at 4:32 PM, Jeremiah Jordan <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Does it need to be a separate repo? Is there a reason not to use a 
>>>>>>> folder in the current repo that becomes its own jar?  It can even be 
>>>>>>> published separately if we like?
>>>>>>> 
>>>>>>> -Jeremiah
>>>>>>> 
>>>>>>> On Jun 1, 2026 at 10:00:15 AM, David Capwell <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> We've discussed pulling utilities out of trunk before. I'd like to 
>>>>>>>> actually start.  My primary need is for test utilities so my focus is 
>>>>>>>> there.
>>>>>>>> 
>>>>>>>> This isn't just my need. Sidecar wants property/stateful tests but 
>>>>>>>> can't use ours without a published jar.
>>>>>>>> 
>>>>>>>> Proposed approach:
>>>>>>>> 
>>>>>>>> 1. Define scope — start with property/stateful test utilities
>>>>>>>> 2. Set up the repo and release independently of Cassandra
>>>>>>>> 3. ...
>>>>>>>> 4. Cassandra depends on the library
>>>>>>>> 
>>>>>>>> I'd focus on the fork first, before making Cassandra depend on it — 
>>>>>>>> keeps our builds simple and gives the lib room to stabilize. We can 
>>>>>>>> sort out the dependency question later (wait on releases, or use 
>>>>>>>> submodules?).
>>>>>>>> 
>>>>>>>> Happy to drive this if there's interest.
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>> 
>>> 
>

Re: [DISCUSS] Forking Cassandra utilities into a separately released library

Reply via email to