Thanks for the feedback Michael.

> I left a question on the doc about how concurrent runs affect the
> repository's 5 GB cache limit.

This is a good question. There isn't a clear answer in the GitHub Actions
Cache documentation.

The documentation is
https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows
.
Based on this document and some testing, I have made these conclusions:
For GitHub Actions Cache, pull requests get executed in the context of the
forked repository.
The workflow triggered by a pull request event can only update it's own
cache. It has read-only access to upstream caches.
If there's a cache miss, the entry will get written to the cache of the
forked repository. If the PR could to write to the upstream cache, it would
be a security issue since this would be vulnerable to cache poisoning
attacks. Each repository has a 5GB quota for writes. The entries are kept
up to 7 days.
The performance is fairly good. Loading docker images from the repository
happens about 15MB/s. Writing is 2-3x slower, about 5-7MB/s. (the
performance of the GHA cache is most likely higher since this is the
throughput for docker load / docker save)

If a single repository has a lot of concurrent jobs, it could start
evicting caches.
However that isn't likely to happen with the way Pulsar is developed since
pull requests are created from personal forks.

> I also think it could be helpful to explicitly document, or reference
> github documentation, on how failure will affect the DAG. I'm assuming
that
> if an action fails, its parallel peer actions will run to completion, and
> that the rest of the remaining stages will get canceled, but I haven't
> worked with github actions before.

For matrix jobs, "fail fast" is the default, which cancels all jobs in the
matrix if one fails. Other parallel flows would run to completion by
default.
In the prototype, I have added a Github script step to each job to cancel
the complete workflow when a failure occurs.
Here's an example:
https://github.com/lhotari/pulsar/blob/lh-refactor-pulsar-ci-with-retries/.github/workflows/pulsar-ci.yaml#L281-L289

The prototype follows a "fail fast" design. When a failure occurs, fail
fast and don't continue with other jobs.
The benefit of this is that it reduces resource consumption. This helps
keep the build queue short.
When the build queue is short, developers get quick feedback from CI.

Documenting all details in the PIP document isn't practical.
*I'm hoping to start a separate document on low level details when there is
a high level acceptance of the proposed "Changes to GitHub Actions based
Pulsar CI".*
Together we can make this happen. We need decisions too. This proposal
cannot stay as a draft forever.
I'm looking forward to hearing from the Pulsar community, Pulsar
committer and Pulsar PMC members how to take this forward.

BR,
-Lari

On Fri, Mar 12, 2021 at 8:06 AM Michael Marshall <mikemars...@gmail.com>
wrote:

> This will be a great improvement. I read through the PIP, and overall, it
> looks good to me.
>
> I left a question on the doc about how concurrent runs affect the
> repository's 5 GB cache limit.
>
> I also think it could be helpful to explicitly document, or reference
> github documentation, on how failure will affect the DAG. I'm assuming that
> if an action fails, its parallel peer actions will run to completion, and
> that the rest of the remaining stages will get canceled, but I haven't
> worked with github actions before.
>
> Thanks for all of the work you've put in so far.
>
> On Thu, Mar 11, 2021 at 6:37 PM Yuva raj <uvar...@gmail.com> wrote:
>
> > This is great news. Thanks Hari , Mateo and pulsar community
> >
> > On Fri, Mar 12, 2021, 2:04 AM Lari Hotari <lari.hot...@sagire.fi> wrote:
> >
> > > Dear Pulsar community members,
> > >
> > > The work on "Changes to GitHub Actions based Pulsar CI" has gone
> forward
> > > based on your feedback. Here are some updates about the work.
> > >
> > > The draft PIP proposal document is here:
> > >
> > >
> >
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry
> > > There's a *detailed status update in the document about a prototype for
> > the
> > > refactored Pulsar CI GitHub Actions based workflow*.
> > >
> > > Thanks for all the suggestions and feedback by now. A lot of
> improvements
> > > have been made by the Pulsar contributors to overcome the technical
> > > obstacles.
> > > Special thanks go to Matteo for reducing the sizes of docker images. A
> > lot
> > > of small improvements have been made to the Pulsar maven build to
> enable
> > > the new refactored GitHub Actions workflow. Thank you for all PR
> reviews
> > > and feedback.
> > >
> > > The main goal of the "Changes to GitHub Actions based Pulsar CI" work
> has
> > > been to *reduce the resource consumption of the Pulsar CI build and to
> > > speed up Pulsar development by improving the developer productivity*
> when
> > > less time is wasted in waiting for Pulsar CI build feedback. The
> > prototype
> > > demonstrates these improvements.
> > >
> > > As you can see from the email from Jan 28 below, *the resource
> > consumption
> > > was 19 hrs 36 minutes* for a single pull request that was observed when
> > the
> > > work began.
> > > Now, with the prototype of the refactored Pulsar CI build, the resource
> > > consumption is *7 hrs 9 minutes.*
> > > *This is about 60% reduction in resource consumption.* The whole
> pipeline
> > > completes in 75-100 minutes.
> > >
> > > Here's a breakdown of the duration (resource consumption) of each build
> > job
> > > in the refactored workflow:
> > > Workflow Job seconds h:mm:ss
> > > Pulsar CI Changed files check 4 0:00:04
> > > Pulsar CI Go 1.11 Functions 155 0:02:35
> > > Pulsar CI Go 1.12 Functions 166 0:02:46
> > > Pulsar CI Go 1.13 Functions 113 0:01:53
> > > Pulsar CI Go 1.14 Functions 96 0:01:36
> > > Pulsar CI Build on MacOS 1017 0:16:57
> > > Pulsar CI Build and License check 346 0:05:46
> > > Pulsar CI Build Pulsar CPP and Python clients 683 0:11:23
> > > Pulsar CI Build Pulsar java-test-image docker image 405 0:06:45
> > > Pulsar CI CI - Unit - Other 1580 0:26:20
> > > Pulsar CI CI - Unit - Brokers - Broker Group 1 968 0:16:08
> > > Pulsar CI CI - Unit - Brokers - Broker Group 2 2223 0:37:03
> > > Pulsar CI CI - Unit - Brokers - Client Api 1652 0:27:32
> > > Pulsar CI CI - Unit - Brokers - Client Impl 916 0:15:16
> > > Pulsar CI CI - Unit - Brokers - Other 522 0:08:42
> > > Pulsar CI CI - Unit - Proxy 331 0:05:31
> > > Pulsar CI Build Pulsar docker image 2343 0:39:03
> > > Pulsar CI CI - Integration - Shade 414 0:06:54
> > > Pulsar CI CI - Integration - Backwards Compatibility 849 0:14:09
> > > Pulsar CI CI - Integration - Cli 1490 0:24:50
> > > Pulsar CI CI - Integration - Messaging 857 0:14:17
> > > Pulsar CI CI - Integration - Schema 468 0:07:48
> > > Pulsar CI CI - Integration - Standalone 286 0:04:46
> > > Pulsar CI CI - Integration - Transaction 362 0:06:02
> > > Pulsar CI CI - System - Function State 699 0:11:39
> > > Pulsar CI CI - System - Tiered FileSystem 779 0:12:59
> > > Pulsar CI CI - System - Tiered JCloud 529 0:08:49
> > > Pulsar CI CI - System - Pulsar Connectors - Thread 1795 0:29:55
> > > Pulsar CI CI - System - Pulsar Connectors - Process 2312 0:38:32
> > > Pulsar CI CI - System - Sql 1377 0:22:57
> > > *Total resource consumption*
> > > 7:08:57
> > >
> > >
> > > GitHub Actions doesn't support restarting a single job (
> > >
> > >
> >
> https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234
> > > ).
> > > However, this is not a showstopper since there are ways to address the
> > > issues that cause flakiness.
> > > There is a separate PIP for changing the way to handle flaky tests. You
> > can
> > > find the link to that in the "Changes to GitHub Actions based Pulsar
> CI"
> > > document's header.
> > >
> > > *Some requests for the Pulsar community:*
> > >
> > > 1) *Please take a look at the updated PIP document*:
> > >
> > >
> >
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry
> > > . *It also contains more details of the prototype that has been
> > > successfully completed.*
> > >
> > > 2) *Please share your feedback and suggest a way forward.*
> > >
> > > *Thank you for your help!*
> > >
> > > BR, Lari
> > >
> > > On Thu, Jan 28, 2021 at 7:13 PM Lari Hotari <lari.hot...@sagire.fi>
> > wrote:
> > >
> > > > Dear Pulsar community members,
> > > >
> > > > Currently, the Pulsar GitHub Actions workflows are consuming the
> > majority
> > > > of the shared pool of resources allocated for github.com/apache
> > > projects.
> > > > Other Apache projects have been impacted and there is a demand to
> > improve
> > > > the Pulsar CI
> > > > <https://github.com/apache/pulsar/pull/9159#issuecomment-766915396>
> > > asap.
> > > >
> > > > In GitHub Actions Runners, the unit of resources is the time that a
> > > Runner
> > > > is occupied. I observed the workflow runs for handling a single Pull
> > > > Request (in my personal fork) and these were the running durations:
> > > > Workflow name Duration
> > > > CI - Build - MacOS 0:17:23
> > > > CI - Go Functions style check 0:02:38
> > > > CI - Unit - Brokers - Other 0:15:40
> > > > CI - Unit - Brokers - Client Impl 0:16:28
> > > > CI - Misc 0:16:51
> > > > CI - Unit - Proxy 0:14:23
> > > > CI - Go Functions Tests 0:22:08
> > > > CI - CPP, Python Tests 0:23:30
> > > > CI - Unit 0:42:11
> > > > CI - Integration - Sql 1:00:13
> > > > CI - Integration - Tiered JCloud 1:00:18
> > > > CI - Integration - Tiered FileSystem 1:00:13
> > > > CI - Integration - Function State 1:00:12
> > > > CI - Integration - Cli 1:10:22
> > > > CI - Integration - Transaction 1:16:34
> > > > CI - Integration - Process 1:11:23
> > > > CI - Shade - Test 1:15:45
> > > > CI - Unit - Brokers - Client Api 0:26:13
> > > > CI - Unit - Brokers - Broker Group 2 0:35:05
> > > > CI - Integration - Standalone 0:45:29
> > > > CI - Integration - Messaging 1:00:23
> > > > CI - Integration - Thread 1:00:19
> > > > CI - Integration - Backwards Compatibility 1:00:19
> > > > CI - Integration - Schema 1:00:19
> > > > CI - Unit - Brokers - Broker Group 1 2:02:31
> > > > TOTAL 19:36:50
> > > >
> > > > *In this case, the total resource consumption of GitHub Actions
> Runners
> > > is
> > > > 19 hours 36 minutes 50 seconds for a single pull request to
> > > apache/pulsar.*
> > > >
> > > > Since GitHub Actions Runner resource pool utilization is very high,
> > this
> > > > leads to the build queue to grow and take a long time to process.
> > > >
> > > > I have been looking for ways to improve the Pulsar CI for the last 3
> > > > months. During this period I worked on a few experiments. The
> learnings
> > > > from the past experiments are documented at a high level in the
> > following
> > > > draft PIP document.
> > > >
> > > > *The draft PIP "Changes to GitHub Actions based Pulsar CI" document
> is
> > a
> > > > Google doc:*
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit?usp=sharing
> > > >
> > > > *Please participate* so that we get the plan adjusted based on the
> > > > feedback asap. If there's already a similar effort ongoing, I hope we
> > can
> > > > join efforts.
> > > >
> > > > *Let's fix Pulsar CI!*
> > > >
> > > > BR, Lari
> > > >
> > >
> >
>

Reply via email to