Re: [Mesa-dev] [RFC] Mesa 17.3.x release problems and process improvements

Timothy Arceri Tue, 13 Mar 2018 22:07:51 -0700

On 14/03/18 07:36, Mark Janes wrote:

Daniel Vetter <dan...@ffwll.ch> writes:

On Tue, Mar 13, 2018 at 4:46 PM, Mark Janes <mark.a.ja...@intel.com> wrote:

Daniel Vetter <dan...@ffwll.ch> writes:

On Mon, Mar 12, 2018 at 11:54:45PM -0700, Kenneth Graunke wrote:

On Friday, March 9, 2018 12:12:28 PM PDT Mark Janes wrote:
[snip]

I've been doing this for Intel.  Developers are on the hook to fix their
bugs, but you can't make them do it.  They have many pressures on them,
and a maintainer can't make the call as to whether a rendering bug is
more important than day-1 vulkan conformance, for example.

We could heighten the transparency of what is blocking the build by
publicizing the authors of bisected blocking bugs to Phoronix, which
might get things moving.


I hope you're being sarcastic here, or else I'm misunderstanding your
proposal.  Public shaming of developers who create bugs has absolutely
no place in the Mesa community, IMHO.  It would foster the kind of toxic
community that none of us want to be a part of.

Sometimes, people who create bugs are the very people that work the
hardest, who the project may not even exist without.  Would you want
to chew out someone for creating a bug in a Vulkan driver when...if it
weren't for that person, you wouldn't have a Vulkan driver at all?  Or,
maybe they caused a couple bad bugs...but also fixed hundreds of them.

Other times, they're new contributors or volunteers who do this, not as
their day job.  Frankly, those people are under no obligation to help us
at all, so we need to thank them and appreciate the time and effort they
spend - and give them a hand fixing things when they're too busy, or
don't have the relevant hardware or skill to track down a regression.

It's easy to be pissed off when there are bugs, and things seem to not
be making progress, but let's try and keep things positive and work
together to make Mesa the best we can.


I'd like to second this with my experience from the kernel community. The
public shaming game for when you create a regression is very strong there,
lead by Linus Torvalds. In my experience this directly causes:

- Maintainers to hide bug reports and regressions reports at all costs,
   because having Linus destroy you just aint never worth it. The meta game
   becomes "avoid getting railed" instead of "deliver quality code", and
   there's lots of ways to easily achieve the former that serious hurt the
   latter.

- Best practice (in my experience) is to not mention the dreaded
   "REGRESSION" tag when you need another maintainer's help to fix a
   regression, because it's too likely they'll just panic. That means they
   start screaming at you to go away, or brain locks up and they can't
   effectively help you track down the bug (seen both cases).

- Creates a culture where talking about process/tooling improvements to
   prevent regressions and/or handle them quicker becomes too dangerous,
   because it all turns into a personal shaming game of who maintains the
   worst subsystem.

Long term you end up with a culture fucked up for good :-/

Imo the only way to make this better is to try analyzing why a regressions
happened, and fix the tooling to prevent that in the future. Maybe better
test coverage (and long term efforts to fix known gaps), maybe better
presentation of automated checks (stuff like github pull requests that
automatically run CI and report full results, blocking the merge if
anything is amiss).


You have to have a very strong CI to use it to block commits.  i965 Mesa
has a big CI which identifies many regressions, but I wouldn't want to
checkpoint commits in an automated way.  A large pool of obsolete
CI hardware will have lower reliability than the mesa master branch --
which generates noise for developers and impedes progress.


This was all in general about blaming regressions on people, not
specifically for the stable-backporting-from-master issue here.

And if parts of your CI can't autogate then you can make it more
informal - there's definitely stuff you want to autogate, like "does
it compile everywhere in all configs", and probably you don't want to
autogate on gen2 dying :-)


It's a bit different for us, because multiple companies and volunteers
can push.  We have a buildtest which prevents intel engineers and any CI
user from breaking radeon for example.  However, radeon still breaks
when AMD devs push LLVM-version-dependent patches.  We can't stop that,
and there are a set of similar situations where builds break.  Reverts
and quick fixes are fine for this IMO.

My point was if you don't want regressions, make it as easy as
possible for people to never push a regression (whether master or
stable trees) instead of a pillory or other blaming exercises. Litlle
things (like whether your CI results is in some mail somewhere, maybe
for an oudated version of your patches on a different baseline, or
right next to the "do you really want to merge" button) matters.


Agreed.  Anyone can painlessly test in our CI, and the majority of
developers verifying patches in our CI are external.  We offer it to
them after a regression is detected.  Usually, they make use of the CI,
because they care about the product, and they want their patches to be
great.

There have been a few situations where developers have skipped CI for
what they thought was a trivial patch, and they caused regressions for
everyone.  Lazy behavior can be quite disruptive, and can inflict cost
on the community that you want to participate in.

I'd just like to point out that as an outside user of Intels CI I havemissed regressions on a couple of occasions. However this was not due to"Lazy behaviour", having a CI system is fantastic and I'm very gratefulto have access to it. However, it's not uncommon to run into issues andhave no idea what is going on with the system.

Some examples are getting no emails back from the system after pushing,results that look like a successful run even though things have failedor just no results email at all, results with tonnes and tonnes of failswhich are clearly unrelated to my latest branch, on occasion the systemseems to have crashed? and been unresponsive for a whole weekend (whichmeans down for me on a Monday in Australia).

If the change is for i965 I either wait or try bug someone at Intel (ifanyone happens to be around) to find out what is going on, but for coremesa changes having run piglit on radeonsi locally I tend to push mychanges. I get that regressions are frustrating at times but using theCI system as an outside user can be frustrating also when you have noidea whats going on in the black box after pushing a branch, especiallywhen you need to wait an hour or so to try again in between runs.

I gave feedback at first on ways the system could be better, or errors Iseemed to hit but was told there wouldn't be and improvements made forthe foreseeable future so I stopped giving feedback and instead switchedto relying on my own local testing when the CI system seemed to havelost its mind.

Anyway this is not meant to be a criticism, I just wanted to share myexperience as an outside user.

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Mesa 17.3.x release problems and process improvements

Reply via email to