Bringing discussion from JIRA (CASSANDRA-17729) to here: Mick said: > Agree with the notion that Jenkins (lower resources/more contention) is > better at exposing flakies, but that there's a trade-off between encouraging > flakies and creating difficult-to-deal-with noise. I come back to the question: what minimum spec of hardware do we want to support for C*, and how can we best configure our CI infrastructure to be representative of that? Given the complexity and temporal relationships w/multiple actors in a distributed system, there's *always* going to be "defects" that show up if you sufficiently under-provision a host. That doesn't necessarily mean it's a user-facing bug that needs to be fixed. What I mean by that specifically: if you under-provision a node with 2 cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, and the nodes take so long with GC pauses, compaction, streaming, etc that they don't correctly complete certain operations in expected time, completely time out, fall over, or otherwise *preserve correctness but die or don't complete operations in time* - is that a bug?
And if the angle is more "the test isn't deterministic and fails on under-provisioned hosts; there's a bug *in the test*", well, that's just our lives. We have a lot of technical debt in the form of brittle non-deterministic tests we'd have to target excising to get past this if we keep our container provisioning where it is. If in the lead up to 4.0 we saw a sub 20% hit rate in product defects from flaky tests vs. test environment flakes alone, we have to consider how much effort from how many engineers it's taking in the run up to a release to hammer all these "flaky due to provisioning" tests back down vs. using other methodologies of testing to uncover correctness defects in timing, schema propagation, consistency level guarantees, etc. On Wed, Jul 6, 2022, at 10:43 AM, Brandon Williams wrote: > I suspect there's another problem with some of the Jenkins nodes where > the system CPU usage is high and drives the load much higher than > other nodes, possibly causing timeouts. However, the docker space > issue needs to be resolved first since we don't have the capacity to > experiment with those nodes out of commission. > > On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie <jmcken...@apache.org> wrote: > > > > Another option would be to increase the resources dedicated to each agent > > container and run less in parallel. Or, best yet, do both (up timeouts and > > lower parallelization / up resources). > > > > As far as I can tell the failures on Jenkins aren't value-add compared to > > what we're seeing on circleci and are just generating busywork. > > > > There's a reasonable discussion to be had about "what's the smallest > > footprint of hardware we consider C* supported on" and targeting ASF CI to > > validate that. I believe the noisy env + low resources on ASF CI currently > > are lower than whatever floor we'd reasonably agree on. > > > > On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote: > > > > Hi All, > > > > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML > > for visibility as this has been a discussion point with some of you. > > > > I noticed tests timeout much more on jenkins that circle. I was > > wondering if legit bugs were hiding behind those timeouts and it might > > be the case. Feel free to jump in the ticket :-) > > > > Regards > > > > > > >