Re: Intra-project dependencies

David Capwell Wed, 18 Jan 2023 12:44:19 -0800

Been out, sorry for just catching up now…

I feel this thread pidgin hold on the word Accord and ignored the fact we are 
dealing with this pain today with python/jvm dtest and trying to improve that 
would help the project…. We also have other related projects that we are 
developing in parallel to Cassandra such as Harry, and there is interest in 
exporting our utils + simulator for other projects to use…. We also depend on 
related projects such as JAMM which clog us from bumping JDK versions...

Accord is just 1 example of a Cassandra dependency needed for a release… by 
only focusing on Accord and “should it be external” this thread is ignoring the 
pain we face today and how we could improve.

We tried in-tree for in-jvm dtest and found that this broke every other commit… 
maintaining the APIs across all our supported branches was too hard to do and 
moving it outside of the tree helped make the upgrade tests more stable (there 
were breakage but less frequent)…. We currently have to release this for every 
patch, which has actually caused us to rely on class path ordering to have some 
branches fork the classes so they can avoid this….  We tried to do snapshot 
builds where the version contained the SHA, but this has the issue that 
snapshot builds “may” go away over time and made older SHAs no longer building… 
Jvm-dtest is in bad shape and really could benefit from us looking to improve 
this process…

We break python-dtest when cross-cutting changes are added as CI is hard to do 
correctly or not supported (testing downstream users (our 4 supported branches) 
is rarely done).  

We want to start using Harry as part of our test suite, so if a patch needs to 
change harry then what “should” we do? Do we block merging into Cassandra until 
we vote on a Harry release?

Maybe we should be asking what capabilities we need and how to address each?  I 
believe Mick has focused on this capabilities conversation and feel its 100% 
the best route to do, we should be listing out what we need to do our work and 
if/how the different solutions address this.

For me I need the following:

* be able to make cross-cutting changes in 1 ticket
** in my PR override CI to use my PRs for sub-projects
* commits to Cassandra should be reproducible and buildable
* downstream testing support… if we make a change to python-dtest or Harry we 
should know if this breaks Cassandra before merging and which supported branches
* [nice to have] be able to work with all subprojects in one IDE and not have 
to switch between windows while making cross-cutting changes
* [nice to have] commit understand dependencies and commits things in correct 
order

Now, for the “how”, I am open but see the two leading cases are: git submodule 
and script that mimics git submodules…. I have used other tools that boil down 
to fetching a list of repo/sha into specific directories and find them more 
annoying than git submodules…

For me, both ways address my needs above; I can make cross cutting change with 
easy and could change CI to build my changes rather than the HEAD of a specific 
branch.

To address Mick’s capabilities I think I saw the following (correct me if 
missing any):

>  - you can no longer just `git clone …`  (and we clone automatically in a 
> number of places)

But submodules and script that no longer works, but we can make this less 
painful by enhancing build.xml to make sure it builds out the gate; we can’t 
see all the code on a fresh commit but we would still be buildable

>  - same with `git pull …` (easy to be left with out-of-sync submodules)

Correct, if you use submodules/script you have a text file saying what we 
“should” use, but this does not enforce actually using them… again we could 
make sure build.xml does the right thing, but this can be confusing for people 
who mainly build in IDE and don’t depend on build.xml until later in 
development… this is something we should think about…

A project I am familiar with has their build auto-inject git hooks to make sure 
things “just work”, we may be able to solve this in a similar way?

>  - permanence from a git SHA no longer exists

Why is this?  The SHA points to other SHAs, so it is still immutable.  If we 
claim that pointing to other SHAs doesn’t count then why do library versions?  
Both are immutable snapshots of code at a specific point in time?

>  - our releases get more complicated (our source tarballs are the asf 
> releases)

We don’t include our dependencies do we?  If so, then does it really?  If 
Accord is a library we use, why would we include it’s source in the build?  
Isn’t it just another library from this point of view?

>  - handling patches cover submodules

I don’t know what you mean by this, do you mean how do we submit cross-cutting 
patches?  How I do this in the cep-15-accord branch is by updating the pointer 
to point to my dependency PR, that way the build “does the right thing”, I just 
have to fix this up before merging into Cassandra (have to commit in the 
“correct" order)

>  - switching branches, and using git worktrees, during dv

What is the concern her?  I am using work trees for PR review and cep-15-accord 
development and have zero issues with this; can you expand more on this?

> And who would be fixing our build/test/release scripts to accommodate?

100% valid question to ask.  I personally am in favor of the proposer doing the 
work and not depend on specific CI people to do the work for them….  But cool 
with others helping out… I do feel its not good to depend on a single CI person 
to do all this; w/e it is we define.

> I'm thinking about reproducible builds,

Is the concern that checking out a sub-modules’s SHA may not compile, breaking 
C*?  Is there another concern here?  Want to fully understand

> switching between branches,

This is a pain point that I feel should be handled by git hooks.  We have this 
issue when you try to reuse the same directory for different release branches, 
and its super annoying when you go back in time when jars were in-tree as you 
need to cleanup after switching back…. I do agree that we should flesh this out 
as its going to be the common case, so how do we “do the right thing” should be 
handled

> and git bisecting

Isn’t this just another example of switching branches?  If we solve that case 
then doesn’t git bisect come in for free?  

> To include forward-merging

What is the concern here?

> Rather that you need to know in advance when the SHA is not HEAD.

Do you?  Or do you really need to know which “branch” it is following?  For 
example, lets say we release 5.0 then 5.1 then 5.2, and there are accord 
versions for each: 1.0, 1.2, 2.0… do we not need to really know which branch it 
is following, and only when you are trying to do a cross-cutting change?

For example, if I want to fix a bug in 5.1 that modifies accord, I need to know 
to use the accord-1.2 branch?  I think this is a solvable progress with 
submodules and script, but green we should think about this case as its going 
to come up

> Correct. submodules does not solve/remove the need to commit to multiple 
> branches and forward merge. Furthermore submodules means at least one 
> additional commit, and possibly twice as many commits.

We have 4 maintained branches atm, so if there is a bug in accord that needs to 
be fixed in all 4 we need
4 commits for C*
1 to 4 for Accord, depending on release history.

If we make sure all branches are using the latest “stable” accord then this is 
6 commits (4 for C*, 1 for accord the stable branch, then 1 to merge into trunk)

Our current commit process is human controlled, so every commit is a chance for 
human error.  Maybe we should look to improve this?  I know I have my own 
script to avoid human error (which supports jvm/python dtest), maybe it would 
be best if the project had automation to make sure everyone “does the right 
thing”?

> On Jan 18, 2023, at 3:06 AM, Benedict <bened...@apache.org> wrote:
> 
>> Linking or merging while it is still also being a separate library and repo.
> 
> I am still unclear why you think this is “a significant thing”?
> 
>> On 18 Jan 2023, at 10:41, Mick Semb Wever <m...@apache.org> wrote:
>> 
>> 
>> 
>> 
>>> You would reference the snapshot dependency by the timestamped snapshot. 
>>> This makes it a reproducible build.
>> 
>> How confident are we that the repository will not alter or delete them?
>> 
>> 
>> They cannot be altered.
>> 
>> I see artefacts there that are more than a decade old. But we cannot rely on 
>> their permanence. 
>> 
>> Putting the SHA into the jar's manifest is easy.  And this blog post shows 
>> how you can also expose this info on the command line: 
>> https://medium.com/liveramp-engineering/identifying-maven-snapshot-artifacts-by-git-revision-15b860d6228b
>>  
>> <https://medium.com/liveramp-engineering/identifying-maven-snapshot-artifacts-by-git-revision-15b860d6228b>
>>  
>> 
>> Given there's no guaranteed permanence to the snapshots, we would need to 
>> have the git sha in the version, so if much older versions can't be 
>> downloaded it can still be rebuilt.
>> 
>> This is done like: <revision>1.0.0_${sha1}-SNAPSHOT</revision>
>> 
>>  
>>> linking in the source code into in-tree is a significant thing to do
>> 
>> Could you explain why? I thought your preferred alternative was merging the 
>> source trees permanently
>> 
>> 
>> Linking or merging while it is still also being a separate library and repo.
>> If we are really not that interested in it as a separate library, and dev 
>> change is high, or the code is somewhere less accessible, then in tree makes 
>> sense IMHO.
>>

Re: Intra-project dependencies

Reply via email to