[CI] Change to be made in Pulsar CI to mitigate CI resource consumption issues

Lari Hotari Thu, 15 Sep 2022 01:36:33 -0700

Hi all,

The GitHub Actions based Pulsar CI has been experiencing issues for
multiple weeks. The condition is currently better, but the resource
shortage issue remains. CI builds will take a long time to complete even
after many optimizations have been made.

There's a long email thread with some details about the past issues:
https://lists.apache.org/thread/p7rb04vf1mt0kk3v2r7xl9dvb3zkhtxf

I have filed an issue to GitHub support about the CI issues over a week
ago, and I finally received an answer a few hours ago. However the
GitHub support person didn't reply to my questions at all, but instead
suggested that there's a beta program where it's possible to pay for
more resources. That solution isn't suitable for our case, since it
doesn't seem to be possible to assign GitHub Actions Runner VM resources
only for a specific Apache project. I'll follow up with GitHub support, but
I don't expect that to resolve our problems in the near term. We need
to make changes in our CI resource consumption.

In a the-asf Slack thread [1] about Pulsar CI issues, Martin Grigorov
suggested: "Apache Spark project requires that all PRs are executed in
the contributor's GHA quota. Maybe Pulsar can do the same ?!"

The Apache Spark contributing guide contains details about this in the
"Pull request" section, https://spark.apache.org/contributing.html .

"Before creating a pull request in Apache Spark, it is important to
check if tests can pass on your branch because our GitHub Actions
workflows automatically run tests for your pull request/following
commits and every run burdens the limited resources of GitHub Actions in
Apache Spark repository. "

In Pulsar, we will need to do the same. As a solution to this, Tison
suggested that we would not run all tests for the PR unless there's a
"ready-to-test" label on the PR.

I think this is a good suggestion. We could extend the existing
"pulsarbot" to help with the automation.

A reviewer could comment "/pulsarbot ready-to-test" on the PR and
pulsarbot would add the label and also restart the CI workflow to make
it proceed and run the tests.
pulsarbot would check for authorized users. One simple
approach would be to add a file ".pulsarci.yaml" in apache/pulsar
repository with the relevant information:

committer_github_ids:
- committer1
- committer2
...

ready_to_test:
authorized_github_ids:
- userid1
- userid2
...

We would have a script to synchronize all Pulsar committers to this file
peridiotically (manual step after there's a new committer). ASF provides
public json files for project members at
https://whimsy.apache.org/public/public_ldap_projects.json , however the
mapping to github user names seems to be missing. That could be done
with a custom script since ASF LDAP contains the github username.

All Pulsar committers would have access. In addition, there could be other
users that are authorized for using "/pulsarbot ready-to-test".

This solution would also require changes in the GitHub Actions workflows
so that the workflow is failed in an early step unless there's a
ready-to-test label for the PR.

With the above solution, we would be able to cut the amount of
unnecessary builds and get the excessive resource consumption issue
under control. The PR authors would be instructed to run initial PR
builds in their own fork and the reviewer should check that this is done
before approving the PR for testing with "/pulsarbot ready-to-test".

I would suggest proceeding quickly on this matter without separate PIPs
or votes. We could follow the Apache lazy consensus
(https://community.apache.org/committers/lazyConsensus.html) principle
and make this happen if there aren't objections in the next 72 hours.
The improvement suggestions to this proposal would obviously be taken
into account and if someone objects, we wouldn't have reached lazy
consensus and we wouldn't proceed.

-Lari

1 -
https://the-asf.slack.com/archives/CBX4TSBQ8/p1661849820238809?thread_ts=1661512133.913279&cid=CBX4TSBQ8

[CI] Change to be made in Pulsar CI to mitigate CI resource consumption issues

Reply via email to