Re: Gating Fedora updates on Fedora CoreOS CI

Dusty Mabe Fri, 14 Feb 2025 13:32:21 -0800


On 2/14/25 5:07 AM, Pierre-Yves Chibon wrote:
> On Fri, Feb 14, 2025 at 10:34:51AM +0100, Clement Verna wrote:
>>    On Thu, 13 Feb 2025 at 17:44, Kevin Fenzi <[1]ke...@scrye.com> wrote:
>>
>>      I agree with downthread folks that that seems like way too high a
>>      failure rate to enable gating on. However, a few questions if I can:
>>
>>    Yes the failure rate is quite high and most of these are real failures,
>>    that we deal with in Fedora CoreOS. So I am reading this like, because the
>>    tests are catching too many failures we should continue ignoring them 🫤
> 
> I think what is scaring people with the data you've provided is that we do not
> know which %/numbers of these failures are genuine failures that should gate 
> the
> update because they are bugs vs infrastructure/pipeline issues.
> Would you have a way to distinguish between the two? Basically a failure vs
> error output.


I think what you bring up here is valid and I think in our next round of metrics
we will come up with a way to classify the failures so we can get a better idea.

However, I'd like to propose that we don't let this discourage us from moving 
forward.
You've raised concerns and we hear you. What I will say, though, is that we do 
monitor
these failures (hence the matrix channel) and we do restart tests if we believe 
they
are failing due to flakes or issues on our side.

In other words, if the failure is believed to be on our side we try to resolve 
the issue
without package maintainers needing to do anything.

Now, will we always be looking at them in realtime? No. However, I would 
propose that we
gate by default and try to give some time to determine the root cause before 
waiving.

> 
> The push-back I'm hearing is more toward: there are a lot of failures here and
> if they are all related to infrastructure issues then we're going to cause
> disruption without a clear benefits.

I'd like to push back slightly on the word "disruption" here. IMO disruption is 
more
applicable in the case where a test fails (keep in mind we are already running 
the tests
and reporting the results) and it goes in anyway and causes issues in 
downstream built
artifacts. We (Fedora as a whole) were given bad results and it went in anyway.

> Now if you're able to say: "95% of these errors are genuine bug that today are
> impacting our users despite our pipeline having found it and 5% are
> infrastructure related", that's a different story :)

IMO the bar would only need to be that high if the user had no way to ignore 
the test results.
All gating does here (IIUC) is require them to do an extra step before it 
automatically flows
into the next rawhide compose.

Dusty
-- 
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Re: Gating Fedora updates on Fedora CoreOS CI

Reply via email to