Re: [PATCH v3 1/2] CI: Move default image under global defaults

Simon Glass Wed, 05 Mar 2025 06:19:11 -0800

Hi Tom,

On Tue, 4 Mar 2025 at 09:12, Tom Rini <tr...@konsulko.com> wrote:
>
> On Tue, Mar 04, 2025 at 08:35:56AM -0700, Simon Glass wrote:
> > Hi Tom,
> >
> > On Thu, 27 Feb 2025 at 10:03, Tom Rini <tr...@konsulko.com> wrote:
> > >
> > > On Thu, Feb 27, 2025 at 09:26:10AM -0700, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Mon, 24 Feb 2025 at 16:14, Tom Rini <tr...@konsulko.com> wrote:
> > > > >
> > > > > On Sat, Feb 22, 2025 at 05:24:05PM -0700, Simon Glass wrote:
> > > > > > Hi Tom,
> > > > > >
> > > > > > On Sat, 22 Feb 2025 at 14:37, Tom Rini <tr...@konsulko.com> wrote:
> > > > > > >
> > > > > > > On Sat, Feb 22, 2025 at 10:23:59AM -0700, Simon Glass wrote:
> > > > > > > > Hi Tom,
> > > > > > > >
> > > > > > > > On Fri, 21 Feb 2025 at 17:08, Tom Rini <tr...@konsulko.com> 
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Feb 21, 2025 at 04:42:09PM -0700, Simon Glass wrote:
> > > > > > > > > > Hi Tom,
> > > > > > > > > >
> > > > > > > > > > On Mon, 17 Feb 2025 at 07:14, Tom Rini <tr...@konsulko.com> 
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Feb 17, 2025 at 06:14:06AM -0700, Simon Glass 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Hi Tom,
> > > > > > > > > > > >
> > > > > > > > > > > > On Sun, 16 Feb 2025 at 14:52, Tom Rini 
> > > > > > > > > > > > <tr...@konsulko.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Feb 16, 2025 at 12:39:34PM -0700, Simon Glass 
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Hi Tom,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sun, 16 Feb 2025 at 09:07, Tom Rini 
> > > > > > > > > > > > > > <tr...@konsulko.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sun, Feb 16, 2025 at 07:10:12AM -0700, Simon 
> > > > > > > > > > > > > > > Glass wrote:
> > > > > > > > > > > > > > > > Hi Tom,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Sat, 15 Feb 2025 at 11:12, Tom Rini 
> > > > > > > > > > > > > > > > <tr...@konsulko.com> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Sat, Feb 15, 2025 at 10:21:16AM -0700, 
> > > > > > > > > > > > > > > > > Simon Glass wrote:
> > > > > > > > > > > > > > > > > > Hi Tom,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Sat, 15 Feb 2025 at 07:41, Tom Rini 
> > > > > > > > > > > > > > > > > > <tr...@konsulko.com> wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Sat, Feb 15, 2025 at 04:59:40AM -0700, 
> > > > > > > > > > > > > > > > > > > Simon Glass wrote:
> > > > > > > > > > > > > > > > > > > > Hi Tom,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Mon, 10 Feb 2025 at 09:25, Tom Rini 
> > > > > > > > > > > > > > > > > > > > <tr...@konsulko.com> wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Thu, Feb 06, 2025 at 03:38:55PM 
> > > > > > > > > > > > > > > > > > > > > -0700, Simon Glass wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > This is a global default, so put it 
> > > > > > > > > > > > > > > > > > > > > > under 'default' like the tags.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Simon Glass 
> > > > > > > > > > > > > > > > > > > > > > <s...@chromium.org>
> > > > > > > > > > > > > > > > > > > > > > Suggested-by: Tom Rini 
> > > > > > > > > > > > > > > > > > > > > > <tr...@konsulko.com>
> > > > > > > > > > > > > > > > > > > > > > Reviewed-by: Tom Rini 
> > > > > > > > > > > > > > > > > > > > > > <tr...@konsulko.com>
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Please make v4 include the way you 
> > > > > > > > > > > > > > > > > > > > > redid the second patch and be on top
> > > > > > > > > > > > > > > > > > > > > of mainline, thanks.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > That's enough versions for me, so I'll 
> > > > > > > > > > > > > > > > > > > > let you do that, if you'd like.
> > > > > > > > > > > > > > > > > > > > It probably doesn't affect your tree as 
> > > > > > > > > > > > > > > > > > > > not as much is done in
> > > > > > > > > > > > > > > > > > > > parallel.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I am disappointed.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm sorry to disappoint you.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The background is that I looked at the 
> > > > > > > > > > > > > > > > > > difference between our trees
> > > > > > > > > > > > > > > > > > and the gitlab files are quite different. 
> > > > > > > > > > > > > > > > > > My CI runs take about 35
> > > > > > > > > > > > > > > > > > mins and it seems that yours is around 90 
> > > > > > > > > > > > > > > > > > mins. I would like to reduce
> > > > > > > > > > > > > > > > > > / remove the delta (for time and patch 
> > > > > > > > > > > > > > > > > > diff), but I'm not sure how.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > My goal is to get CI runs to below 20 
> > > > > > > > > > > > > > > > > > minutes, best case.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'm sure CI could be quicker still with a 
> > > > > > > > > > > > > > > > > number of faster runners. But
> > > > > > > > > > > > > > > > > if you can't be bothered to make changes 
> > > > > > > > > > > > > > > > > against mainline, what is the
> > > > > > > > > > > > > > > > > point?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If you recall, I was working with your tree and 
> > > > > > > > > > > > > > > > had various ideas to
> > > > > > > > > > > > > > > > speed things up, but you didn't like it. So 
> > > > > > > > > > > > > > > > I've had to do it in my
> > > > > > > > > > > > > > > > tree. This is not about more runners (although 
> > > > > > > > > > > > > > > > I might have another
> > > > > > > > > > > > > > > > one soon). It is about running jobs in parallel.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And I wasn't sure more runners in parallel would 
> > > > > > > > > > > > > > > help (as it would slow
> > > > > > > > > > > > > > > down the fast runner which is what keeps the long 
> > > > > > > > > > > > > > > jobs from being even
> > > > > > > > > > > > > > > longer) as much as adding more regular runners 
> > > > > > > > > > > > > > > would (which we've done)
> > > > > > > > > > > > > > > and noted that in the end it's a configuration on 
> > > > > > > > > > > > > > > the runner side so to
> > > > > > > > > > > > > > > go ahead. And I reviewed and ack'd the patches 
> > > > > > > > > > > > > > > here which exposed the
> > > > > > > > > > > > > > > issues your path revealed. I just can't apply 
> > > > > > > > > > > > > > > them because they need to
> > > > > > > > > > > > > > > be rebased (and squashed).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > You have already added tags for things, but (IIUC) 
> > > > > > > > > > > > > > they are around the
> > > > > > > > > > > > > > other way from what I have added.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have a tag called 'single' which means that the 
> > > > > > > > > > > > > > machine is only
> > > > > > > > > > > > > > allowed to one of those jobs. The world-build jobs 
> > > > > > > > > > > > > > are marked with
> > > > > > > > > > > > > > 'single'.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For other jobs, I allow the runners to pick up some 
> > > > > > > > > > > > > > in parallel
> > > > > > > > > > > > > > depending on their performance (for moa and tui 
> > > > > > > > > > > > > > that is 10).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So at most, there is a 'world build' and 10 test.py 
> > > > > > > > > > > > > > jobs running on
> > > > > > > > > > > > > > the same machine. It seems to work fine in 
> > > > > > > > > > > > > > practice, although I would
> > > > > > > > > > > > > > rather be able to make these two types of jobs 
> > > > > > > > > > > > > > mutually exclusive, so
> > > > > > > > > > > > > > that a runner is either running 10 parallel jobs or 
> > > > > > > > > > > > > > 1 'single' job,
> > > > > > > > > > > > > > but not both. I'm not sure how to do that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So unless I'm missing something, in both cases the 
> > > > > > > > > > > > > bottleneck is that
> > > > > > > > > > > > > for world build jobs you don't want anything else 
> > > > > > > > > > > > > going on with the
> > > > > > > > > > > > > underlying build host. You could register 10 "all" 
> > > > > > > > > > > > > runners and 1 "fast
> > > > > > > > > > > > > amd64" runner (and something similar but smaller for 
> > > > > > > > > > > > > alexandra). If you
> > > > > > > > > > > > > update the registrations on source.denx.de can you 
> > > > > > > > > > > > > then shut down your
> > > > > > > > > > > > > gitlab instance?
> > > > > > > > > > > >
> > > > > > > > > > > > I've put a tag of 'single' on things that should run on 
> > > > > > > > > > > > the single-job
> > > > > > > > > > > > runner. Everything else can run concurrently, e.g. up 
> > > > > > > > > > > > to 10 jobs. So I
> > > > > > > > > > > > have two runners on the same host. E.g. tui-single has 
> > > > > > > > > > > > 'limit = 1',
> > > > > > > > > > > > but 'tui' has no limit and is just governed by the 
> > > > > > > > > > > > 'concurrent = 10'
> > > > > > > > > > > > at the top of the file.
> > > > > > > > > > >
> > > > > > > > > > > Yes. And you could move those runners to the mainline 
> > > > > > > > > > > gitlab. There is
> > > > > > > > > > > no "single" tag, that would be the "all" tag. And 
> > > > > > > > > > > "tui-single" would be
> > > > > > > > > > > "fast amd64".
> > > > > > > > > >
> > > > > > > > > > They are still attached to the Denx gitlab. Nothing has 
> > > > > > > > > > changed on my
> > > > > > > > > > side. I'm not sure that your new tags are working though. I 
> > > > > > > > > > have a
> > > > > > > > > > feeling something broke along the way when you made all 
> > > > > > > > > > your tag
> > > > > > > > > > changes. One of my servers makes a bit of noise and I 
> > > > > > > > > > haven't heard it
> > > > > > > > > > in quite a while.
> > > > > > > > >
> > > > > > > > > There's a few of your runners that are "stale" and haven't 
> > > > > > > > > contacted
> > > > > > > > > gitlab in a long time. I'll double check the tags tho.
> > > > > > > > >
> > > > > > > > > > If Denx would like to give me access to their gitlab 
> > > > > > > > > > instances, I'd be
> > > > > > > > > > happy to play around and figure out how to get it going as 
> > > > > > > > > > fast as my
> > > > > > > > > > tree does, and send a patch.
> > > > > > > > >
> > > > > > > > > I'm not sure what you mean by that? The instance itself?
> > > > > > > >
> > > > > > > > Yes. I can fiddle with tags on my runners and try to figure it 
> > > > > > > > out.
> > > > > > >
> > > > > > > I'm not sure what you're getting at here. If you mean "tags" in
> > > > > > > /etc/gitlab-runner/config.toml those aren't relevant here I 
> > > > > > > believe.
> > > > > >
> > > > > > No, I mean the tags in CI. If I fiddle with them I can probably come
> > > > > > up with a way to run your CI much faster. Mine is about 35mins.
> > > > >
> > > > > I'm not so sure about that. Yours runs faster because it tests less. 
> > > > > Now
> > > > > that we've got some of your other fast runners showing up again, this 
> > > > > is
> > > > > more instructive of current times I think:
> > > > > https://source.denx.de/u-boot/u-boot/-/pipelines/24802
> > > >
> > > > But not this? :
> > >
> > > You forgot a link. But presumably to some run yesterday which took
> > > longer. And because Ilias was tweaking the currently donated arm64
> > > runners (that have other jobs to run) and also we had two or three
> > > custodians at a time preparing trees, things ran slower.
> >
> > Maybe, but I don't think so.
>
> No need to "think" about it. You can look at the pipeline history and
> see what was in queue for how long. And since I needed to be keeping an
> eye on two of the 3 arm64 runners, I could see when custodians were
> firing off tests. Aside from the number of pull requests I had waiting
> that morning.


Your runs are reliably around an hour but mine are reliably just over
30 minutes. I only have three runners.

https://source.denx.de/u-boot/u-boot/-/pipelines
https://sjg.u-boot.org/u-boot/u-boot/-/pipelines

I know you have added a duplicate build on arm64, but I can still
speed it up significantly if you'll allow me.

>
> > > > > If you want to make mainline CI run faster you will need to catch up
> > > > > with the missing coverage or argue that some things are redundant.
> > > >
> > > > Or perhaps I can actually just make it faster without dropping coverage?
> > >
> > > I mean, I don't know how that's physically possible, outside of adding
> > > many more expensive build hosts. We have two-three fast arm64 hosts and
> > > that world builds between 30-45 minutes. That's the biggest time
> > > bottleneck.
> >
> > Why did you join those builds up? It is better for throughput to have
> > a few runners working in parallel.
>
> Because I'm not optimizing for the single developer running CI case (or
> the loads of fast runners case). If we had sufficient resources, yes,
> the fastest possible way would be 4 "fast arm64" servers and 4 "fast
> amd64" servers each running if not 25% of the world build, at least 4
> easy to make and maintain groupings.
>
> However, we don't have that many of either. And they also need to be
> used for the biggest sandbox test suite jobs so that they run in about 5
> minutes, not about 10 minutes. So in order to not entirely block other
> custodians we do a single world build. Because make and buildman are
> very good about otherwise fully loading the server. Running anything
> else while that is going on will slow down the world build (and, the
> other job too).

They're OK (on a fast machine) so long as the 'other' load is not too much.

>
> Aside, maintaining groupings is a pain. It was very bad with Travis, and
> it's only moderately painful with Azure where at least the end goal is
> 10 pipelines for maximum concurrency. And with Azure everyone *can* get
> their own "project" or whatever the right term is, and utilize 10
> runners at once.

Yes, but we can update buildman to handle grouping automatically, as I
suggested once.

>
>
> > > The next biggest is that unless sandbox tests are run on a fast host,
> > > they take upwards of 10 minutes, rather than 5.
> >
> > Yes, they are just getting slower and slower.
>
> Adding more tests takes more time. But the real question is which tests
> take wall clock noticeable time, and why, and if we can do anything
> about it. My gut feeling is that it's in the disk image related tests
> and the user space verification of them.
>
> > > But please, rebase your work to next and see what you can do. There is
> > > likely some speed-ups possible if we allow for failures to take longer
> > > to happen (and don't gate world builds on all of test.py stage
> > > completing, just say sandbox). And if you do the work on source.denx.de
> > > (as there is *NOTHING* stopping you from registering more runners to
> > > your tree and using whatever tagging scheme you like) you might even see
> > > more of the time variability due to load from other custodians.
> >
> > I can't edit the tags on the runners, nor can I adjust them to run
> > untagged jobs, nor can I delete runners I don't want, so no, I believe
> > I need access to do that.
>
> You can do all of that with runners specific to u-boot-dm, and you can
> disable project / group runners yourself too. So yes, you can.

Nope, sorry, I wasn't able to do any of this with the Denx tree as I
can't adjust tags and can't delete and recreate runners.

>
> > > > > > > > > > I also have another runner to add.
> > > > > > > > >
> > > > > > > > > I'll contact you off-list with the token.
> > > > > > > > >
> > > > > > > > > > > > From my side, I have found it helpful and refreshing to 
> > > > > > > > > > > > have a gitlab
> > > > > > > > > > > > instance which I can control, e.g. it runs in half the 
> > > > > > > > > > > > time and if my
> > > > > > > > > > > > patches are completely blocked by Linaro, etc., I have 
> > > > > > > > > > > > an escape
> > > > > > > > > > > > valve.
> > > > > > > > > > >
> > > > > > > > > > > Yes, and I have no idea what any of that has to do with 
> > > > > > > > > > > anything other
> > > > > > > > > > > than leading to confusion about what tree is or is not 
> > > > > > > > > > > mainline. Since
> > > > > > > > > > > you own u-boot.org and ci.u-boot.org is your gitlab and
> > > > > > > > > > > https://ci.u-boot.org/u-boot/u-boot/ is your personal 
> > > > > > > > > > > tree.
> > > > > > > > > >
> > > > > > > > > > For now I am working with my tree, so that I am not blocked 
> > > > > > > > > > by Linaro,
> > > > > > > > > > etc. but as you have seen I can rebase series for your tree 
> > > > > > > > > > as needed.
> > > > > > > > >
> > > > > > > > > And you're not addressing my point about using the project 
> > > > > > > > > domain for
> > > > > > > > > your personal tree. That's my big huge "are you forking the 
> > > > > > > > > project or
> > > > > > > > > what" problem.
> > > > > > > >
> > > > > > > > I'm just making sure that my work is not blocked or lost, as 
> > > > > > > > that has
> > > > > > > > happened too many times in the past few years.
> > > > > > >
> > > > > > > Again, are you intending to fork the project? Putting your 
> > > > > > > personal tree
> > > > > > > in as "https://ci.u-boot.org/u-boot/u-boot.git"; is not OK. I keep 
> > > > > > > asking
> > > > > > > you to stop it.
> > > > > >
> > > > > > No, I'm not intending to fork anything. But I need a tree that I can
> > > > > > control and push things into.
> > > > >
> > > > > I don't know how you can call your personal tree being at
> > > > > "https://ci.u-boot.org/u-boot/u-boot.git"; and saying it's somewhere 
> > > > > you
> > > > > control and can push to while not also saying it's a fork. If you want
> > > > > to close down your gitlab and CNAME ci.u-boot.org to source.denx.de, 
> > > > > you
> > > > > can still push things to u-boot-dm. Or if that's too constrained of a
> > > > > namespace you can also get a contributors/sjg/ namespace. But what
> > > > > you're doing today WILL lead to confusion.
> > > >
> > > > I believe I've answered this question before. It is simply that I
> > > > cannot get certain patches (bloblist, EFI, devicetree) into your tree.
> > > > There really isn't any other reason.
> > >
> > > Yes, that's still not an answer to my question.
> > >
> > > Or is the answer to my question "Yes, I'm trying to confuse people to
> > > thinking my tree is mainline."
> >
> > No, it's simply that you are not taking some patches in your tree and
> > complaining about the amount of patches.
>
> That's misleading at best. I'm not taking the patches that other
> custodians have repeatedly rejected and explained why they're rejecting
> them.

Yes and this has affected my ability to move things forward so much
that I've had to set up my own tree. It has been working very well, to
have a relief valve.

>
> > > > At the moment your CI seems to be flaky as well:
> > > >
> > > > https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/1038174
> > >
> > > [aside, I think you meant to link to the pipeline itself, which also
> > > passed, but had some retries]
> > >
> > > Funny story. Ilias needed to tweak the fast arm64 hosts and also wanted
> > > to explore "What if we have concurrency higher?" and ran in to the
> > > problems you also ran in to with respect to git seeing an existing clone
> > > in progress and bailing. Followed by the problem of multiple non-trivial
> > > jobs running concurrently.
> > >
> > > All of which is why I keep trying to tell you that while "single" and
> > > concurrent runners work fine for you on a single user instance it will
> > > not scale.
> >
> > Yes, but I solved that with the patch I sent and it seems to be 100%
> > reliable now.
>
> Yes, you eventually solved it with 3 patches, which I asked you to
> rebase and squash to two patches (because #3 just fixes that #2 wasn't
> sufficient) and you declined.

In general, why not just be more open to my ideas, even just try it
for a year? Given the tools I'm confident I can speed up your CI as
well.

Regards,
Simon

Re: [PATCH v3 1/2] CI: Move default image under global defaults

Reply via email to