Re: [DISCUSS] Potential circleci config and workflow changes

Josh McKenzie Thu, 20 Oct 2022 11:45:16 -0700

My high level hope is that we can:

1. Unstick mainline yearly releases (vote to accept circle results, make circle 
more robust <- WE ARE HERE)
2. Invest resources into the ASF CI environment to get it to being a viable 
replacement for circle (requirements for this qualification TBD)
3. Deprecate circle


I'm doing my best to help make the above a reality. There's a lot of 
hand-waving in "ASF CI as viable replacement" but it's 2022 and there's a lot 
of modern build and ci system's learning our industry has gained in the last 
decade we're not yet taking advantage of.

~Josh

On Thu, Oct 20, 2022, at 1:17 PM, Derek Chen-Becker wrote:
> +1 to all of these, especially improving CircleCI generation and ergonomics. 
> I still have a bunch of reservations around CircleCI in general, but in the 
> short term we can make it less painful (to a point).
> 
> Cheers,
> 
> Derek
> 
> On Thu, Oct 20, 2022 at 6:38 AM Ekaterina Dimitrova <[email protected]> 
> wrote:
>> Yes, they do. This is the only test suite that gets max resources with -m. 
>> Probably you had some other issue Berenguer as I can confirm I was running 
>> them successfully these days
>> 
>> On Thu, 20 Oct 2022 at 6:54, Brandon Williams <[email protected]> wrote:
>>> They passed with -m for me recently.
>>> 
>>> Kind Regards,
>>> Brandon
>>> 
>>> On Thu, Oct 20, 2022 at 12:03 AM Berenguer Blasi
>>> <[email protected]> wrote:
>>> >
>>> > Can python upgrade tests be ran without -h? Last time I tried iirc they 
>>> > fail on -m
>>> >
>>> > On 20/10/22 4:11, Ekaterina Dimitrova wrote:
>>> >
>>> > Thank you Josh. Glad to see that our CI is getting more attention. As no 
>>> > Cassandra feature will be there if we don't do proper testing, right? 
>>> > Important as all the suites and tools we have. With that being said I am 
>>> > glad to see Derek is volunteering to spend more time on this as I believe 
>>> > this is always the main issue - ideas and willingness for improvements 
>>> > are there but people are swamped with other things and we lack manpower 
>>> > for something so important.
>>> > 1. Tune parallelism levels per job (David and Ekaterina have insight on 
>>> > this)
>>> > Question for David, do you tune only parallelism and use only xlarge? If 
>>> > yes, we need to talk :D
>>> > Reading what Stefan shared as experience/feedback, I think we can revise 
>>> > the current config and move to a more reasonable config that can work for 
>>> > most people but there will always be someone who needs something a bit 
>>> > different. With that said maybe we can add to our scripts/menu an option 
>>> > to change from command line through parameters parallelism and/or 
>>> > resources? For those who want further customization? I see this as a 
>>> > separate additional ticket probably. In that case we might probably skip 
>>> > the use of circleci config process for that part of the menu. (but not 
>>> > for adding new jobs and meaningful permanent updates)
>>> > 2. Rename jobs on circle to be more indicative of their function
>>> > +0 I am probably super used to the current names but Derek brought it to 
>>> > my attention that there are names which are confusing for someone new to 
>>> > the cassandra world. With that said I would say we can do this in a 
>>> > separate ticket, mass update.
>>> > 3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see: 
>>> > https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595)
>>> > I am against unifying per JDK workflows but I am all in for unifying the 
>>> > pre-commit/separate workflows and getting back to 2 workflows as 
>>> > suggested by Andres. If we think of how that will look in the UI I think 
>>> > it will be super hard to follow. (the case of having unified both jdks in 
>>> > one workflow)
>>> > 4. Update documentation w/guidance on using circle, .circleci/generate.sh 
>>> > examples, etc 4a. How to commit: 
>>> > https://cassandra.apache.org/_/development/how_to_commit.html 4b. 
>>> > Testing: https://cassandra.apache.org/_/development/testing.html
>>> > I will open a ticket and post the guide I was working on. But it also 
>>> > doesn't make sense to fully update it now if we are going to 
>>> > significantly change the workflow soon. Until then I believe Andres has 
>>> > updated the circleci readme and provided good usage examples.
>>> > 5. Flag on generate.sh to allow auto-run on push
>>> > Auto-run on push? Can you elaborate? Like to start your whole workflow 
>>> > directly without using the UI? There is an approval step in the config 
>>> > file, we can probably add some flags to change pre-commit workflows to 
>>> > start build without approval when we use those mentioned flags. But 
>>> > having by default everything to start on push is an overkill in my 
>>> > opinion. People will be forgetting it and pushing builds for nothing on 
>>> > WIP branches. Talking from experience :D
>>> > 6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for 
>>> > all suites, default to -m, deprecate -h?) <- may not be a code-change 
>>> > issue and instead be a documentation issue
>>> > If we agree except the free tier config file we want one more reasonable 
>>> > config which doesn't bump resources to the max without a need but 
>>> > provides balanced use of resources - absolutely. -h was kept as there was 
>>> > understanding there are people in the community actively using it.
>>> > 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE] 
>>> > temporary circleci config" as the commit message
>>> > +0
>>> > I also wanted to address a few of the points David made.
>>> > "Ekaterina is probably dealing with with her JDK17 work" - if you mean to 
>>> > ensure we have all jobs for all jdks properly, yes. That was my plan. 
>>> > Until Derek was so good at suggesting to work on adding missing jobs in 
>>> > CircleCI now so my work on that will be a bit less for certain things. 
>>> > This is an effort related to the recent changes in our release document. 
>>> > Ticket CASSANDRA-17950 :-) I am helping with mentoring/reviews. Everyone 
>>> > is welcome to join the party.
>>> > "1) resource_class used is not because its needed… in HIGHER file we 
>>> > default to xlarge but only python upgrade tests need that… reported in 
>>> > CASSANDRA-17600" - one of the reasons. we had the MIDRES in the first 
>>> > place as I mentioned in my other email the other day. [1]
>>> >
>>> > "our current patching allows MID/HIGHER to drift as changes need new 
>>> > patches else patching may do the wrong thing… reported in 
>>> > CASSANDRA-17600" - I'd say the patching is annoying sometimes, indeed but 
>>> > with/without the patching any changes to config mean we need to check it 
>>> > by reading through diff and pushing a run to CI before commit. With that 
>>> > said I am all in for automation but this will not change the fact we need 
>>> > to push test runs and verify the changes did not hurt us in a way. Same 
>>> > as testing patches on all branches, running all needed tests and 
>>> > confirming no regressions. Nothing new or changing here IMHO
>>> >
>>> > "CI is a combinatorial problem, we need to run all jobs for all JDKs, 
>>> > vnode on/of, cdc on/off, compression on/of, etc…. But this is currently 
>>> > controlled and fleshed out by humans who want to add new jobs.. we should 
>>> > move away from maintaining .circleci/config-2_1.yml and instead 
>>> > auto-generate it. Simple example of this problem is jdk11 support… we run 
>>> > a subset of tests on jdk11 and say its supported… will jdk17 have the 
>>> > same issue? Will it be even less tests? Why does the burden lie on 
>>> > everyone to “do the right thing” when all they want is a simple job?"
>>> >  Controlled and fleshed by humans it will always be but I agree we need 
>>> > to automate the steps to make it easier for people to add most of the 
>>> > combinations and not to skip any because it is too much work. We will 
>>> > always need a human to decide which jdks, cdc, vnodes, etc. With that 
>>> > said I shared your ticket/patch with Derek as he had similar thoughts, we 
>>> > need to get back to that one at some point. (CASSANDRA-17600) Thanks for 
>>> > working on that!
>>> >
>>> > "why do we require people to install “circleci” command to contribute? If 
>>> > you rename .circleci/config-2_1.yml to .circleci/config.yml then CI will 
>>> > work just fine… we don’t need to call “circleci config process” every 
>>> > time we touch circle config…. Also, seems that w/e someone new to circle 
>>> > config (but not cassandra) touch it they always mutate LOW/MID/HIGH and 
>>> > not .circleci/config-2_1.yml… so I keep going back to fix 
>>> > .circleci/config-2_1.yml…."
>>> > I'd say config-2_1.yml is mainly for those who will make permanent 
>>> > changes to config (like adding/removing jobs). config-2_1.yml is actually 
>>> > created as per the CircleCI automation rules - 1st we add and reuse 
>>> > executors, parameters and commands but I think we can reduce further 
>>> > things if we add even more parameters probably. I have to look more into 
>>> > the current file. I am sure there is room for further improvement. 2nd 
>>> > circleci cli tool can verify the config file for errors and helps with 
>>> > debugging before we push to CircleCI. There is circleci config validate. 
>>> > If we make changes manually we are on our own to verify the long yml and 
>>> > also deal with duplication in config.yml. My concern is that things that 
>>> > need to be almost identical might start to diverge easier. Though I made 
>>> > my suggestion in point 1 for what cases probably we can add menu options 
>>> > that potentially will not require using circleci cli tool. There might be 
>>> > more cases though.
>>> > Currently config-2_1.yml is 2256 lines while config.yml is 5793 lines. 
>>> > I'd say lots of duplication there
>>> >
>>> > [1] https://lists.apache.org/thread/htxoh60zt8zxc4vgxj9zh71trk0zxwhl
>>> >
>>> > On Wed, 19 Oct 2022 at 17:20, David Capwell <[email protected]> wrote:
>>> >>
>>> >> 1. Tune parallelism levels per job (David and Ekaterina have insight on 
>>> >> this)
>>> >>
>>> >>
>>> >> +1 to this!  I drastically lower our parallelism as only python-dtest 
>>> >> upgrade tests need many resources…
>>> >>
>>> >> What I do for JVM unit/jvm-dtest is the following
>>> >>
>>> >> def java_parallelism(src_dir, kind, num_file_in_worker, include = lambda 
>>> >> a, b: True):
>>> >>     d = os.path.join(src_dir, 'test', kind)
>>> >>     num_files = 0
>>> >>     for root, dirs, files in os.walk(d):
>>> >>         for f in files:
>>> >>             if f.endswith('Test.java') and include(os.path.join(root, 
>>> >> f), f):
>>> >>                 num_files += 1
>>> >>     return math.floor(num_files / num_file_in_worker)
>>> >>
>>> >> def fix_parallelism(args, contents):
>>> >>     jobs = contents['jobs']
>>> >>
>>> >>     unit_parallelism                = java_parallelism(args.src, 'unit', 
>>> >> 20)
>>> >>     jvm_dtest_parallelism           = java_parallelism(args.src, 
>>> >> 'distributed', 4, lambda full, name: 'upgrade' not in full)
>>> >>     jvm_dtest_upgrade_parallelism   = java_parallelism(args.src, 
>>> >> 'distributed', 2, lambda full, name: 'upgrade' in full)
>>> >>
>>> >> TL;DR - I find all test files we are going to run, and based off a 
>>> >> pre-defined variable that says “idea” number of files per worker, I then 
>>> >> calculate how many workers we need.  So unit tests are num_files / 20 ~= 
>>> >> 35 workers.  Can I be “smarter” by knowing which files have higher cost? 
>>> >>  Sure… but the “perfect” and the “average” are too similar that it 
>>> >> wasn’t worth it...
>>> >>
>>> >> 2. Rename jobs on circle to be more indicative of their function
>>> >>
>>> >>
>>> >> Have an example?  I am not against, I just don’t know the problem you 
>>> >> are referring to.
>>> >>
>>> >> 3. Unify j8 and j11 workflow pairs into single
>>> >>
>>> >>
>>> >> Fine by me, but we need to keep in mind j17 is coming.  Also, most 
>>> >> developmental CI builds don’t really need to run cross every JDK, so we 
>>> >> need some way to disable different JDKs…
>>> >>
>>> >> When I am testing out a patch I tend to run the following (my script): 
>>> >> "circleci-enable.py --no-jdk11”; this will remove the JDK11 builds.  I 
>>> >> know I am going to run them pre-merge so I know its safe for me.
>>> >>
>>> >> 5. Flag on generate.sh to allow auto-run on push
>>> >>
>>> >>
>>> >> I really hate that we don’t do this by default… I still to this day 
>>> >> strongly feel you should opt-out of CI rather than opt-in… seen several 
>>> >> commits get merged as they didn’t see a error in circle… because circle 
>>> >> didn’t do any work…. Yes, I am fully aware that I am beating a dead 
>>> >> horse…
>>> >>
>>> >> TL;DR +1
>>> >>
>>> >> 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE] 
>>> >> temporary circleci config" as the commit message
>>> >>
>>> >>
>>> >> +0 from me… I have seen people not realize you have to commit after 
>>> >> typing “higher” (wrapper around my personal circleci-enable.py script to 
>>> >> apply my defaults to the build) but not an issue I have… so I don’t mind 
>>> >> if people want the tool to integrate with git…
>>> >>
>>> >>
>>> >> With all that said, I do feel there is more, and something I feel 
>>> >> Ekaterina is probably dealing with with her JDK17 work…
>>> >>
>>> >> 1) resource_class used is not because its needed… in HIGHER file we 
>>> >> default to xlarge but only python upgrade tests need that… reported in 
>>> >> CASSANDRA-17600
>>> >> 2) our current patching allows MID/HIGHER to drift as changes need new 
>>> >> patches else patching may do the wrong thing… reported in CASSANDRA-17600
>>> >> 3) CI is a combinatorial problem, we need to run all jobs for all JDKs, 
>>> >> vnode on/of, cdc on/off, compression on/of, etc…. But this is currently 
>>> >> controlled and fleshed out by humans who want to add new jobs..  we 
>>> >> should move away from maintaining .circleci/config-2_1.yml and instead 
>>> >> auto-generate it.  Simple example of this problem is jdk11 support… we 
>>> >> run a subset of tests on jdk11 and say its supported… will jdk17 have 
>>> >> the same issue?  Will it be even less tests?  Why does the burden lie on 
>>> >> everyone to “do the right thing” when all they want is a simple job?
>>> >> 4) why do we require people to install “circleci” command to contribute? 
>>> >>  If you rename .circleci/config-2_1.yml to .circleci/config.yml then CI 
>>> >> will work just fine… we don’t need to call “circleci config process” 
>>> >> every time we touch circle config…. Also, seems that w/e someone new to 
>>> >> circle config (but not cassandra) touch it they always mutate 
>>> >> LOW/MID/HIGH and not .circleci/config-2_1.yml… so I keep going back to 
>>> >> fix .circleci/config-2_1.yml….
>>> >>
>>> >>
>>> >> On Oct 19, 2022, at 1:32 PM, Miklosovic, Stefan 
>>> >> <[email protected]> wrote:
>>> >>
>>> >> 1) would be nice to have. The first thing I do is that I change the 
>>> >> parallelism to 20. None of committed config.yaml's are appropriate for 
>>> >> our company CircleCI so I have to tweak this manually. I think we can 
>>> >> not run more that 25/30 containers in parallel, something like that. 
>>> >> HIGHRES has 100 and MIDRES has some jobs having parallelism equal to 50 
>>> >> or so so that is not good either. I would be happy with simple way to 
>>> >> modify default config.yaml on parallelism. I use "sed" to change 
>>> >> parallelism: 4 to parallelism: 20 and leave parallelism: 1 where it does 
>>> >> not make sense to increase it. However I noticed that there is not "4" 
>>> >> set everywhere, some jobs have it set to "1" so I have to take extra 
>>> >> care of these cases (I consider that to be a bug, I think there are two 
>>> >> or three, I do not remember). Once set, I have that config in "git 
>>> >> stash" so I just apply it every time I need it.
>>> >>
>>> >> 5) would be nice too.
>>> >> 7) is nice but not crucial, it takes no time to commit that.
>>> >>
>>> >> ________________________________________
>>> >> From: Josh McKenzie <[email protected]>
>>> >> Sent: Wednesday, October 19, 2022 21:50
>>> >> To: dev
>>> >> Subject: [DISCUSS] Potential circleci config and workflow changes
>>> >>
>>> >> NetApp Security WARNING: This is an external email. Do not click links 
>>> >> or open attachments unless you recognize the sender and know the content 
>>> >> is safe.
>>> >>
>>> >>
>>> >>
>>> >> While working w/Andres on CASSANDRA-17939 a variety of things came up 
>>> >> regarding our circleci config and opportunities to improve it. Figured 
>>> >> I'd hit the list up here to see what people's thoughts are since many of 
>>> >> us intersect with these systems daily and having your workflow disrupted 
>>> >> without having a chance to provide input is bad.
>>> >>
>>> >> The ideas:
>>> >> 1. Tune parallelism levels per job (David and Ekaterina have insight on 
>>> >> this)
>>> >> 2. Rename jobs on circle to be more indicative of their function
>>> >> 3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see: 
>>> >> https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595)
>>> >> 4. Update documentation w/guidance on using circle, 
>>> >> .circleci/generate.sh examples, etc
>>> >> 4a. How to commit: 
>>> >> https://cassandra.apache.org/_/development/how_to_commit.html
>>> >> 4b. Testing: https://cassandra.apache.org/_/development/testing.html
>>> >> 5. Flag on generate.sh to allow auto-run on push
>>> >> 6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for 
>>> >> all suites, default to -m, deprecate -h?) <- may not be a code-change 
>>> >> issue and instead be a documentation issue
>>> >> 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE] 
>>> >> temporary circleci config" as the commit message
>>> >>
>>> >> Curious to see what folks think.
>>> >>
>>> >> ~Josh
>>> >>
>>> >>
> 
> 
> -- 
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>

Re: [DISCUSS] Potential circleci config and workflow changes

Reply via email to