The PR https://github.com/apache/pulsar/pull/14819 is now ready for review.
The changes in the PR now use GitHub Actions Artifacts for sharing binary files (such as docker images) between the build steps. This saves a lot of GitHub Actions VM resources since the docker images are built once and shared in downstream jobs. GitHub Actions Artifacts are meant to be used for sharing data between the jobs in a GitHub Actions workflow [1]. I'm looking forward to your review and feedback on https://github.com/apache/pulsar/pull/14819 . BR, -Lari References: [1] GitHub Actions: Storing workflow data as artifacts - https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts On 2022/03/23 10:46:10 Lari Hotari wrote: > I have submitted the PR for refactoring the apache/pulsar GitHub Actions > based CI. Please review https://github.com/apache/pulsar/pull/14819 . > > BR, > -Lari > > On 2022/03/22 13:38:36 Enrico Olivelli wrote: > > Lari, > > > > Il Mar 22 Mar 2022, 14:32 Lari Hotari <lhot...@apache.org> ha scritto: > > > > > I have resumed work to improve our GitHub Actions based Pulsar CI. > > > > > > Last year, I worked on a proof-of-concept which significantly reduced the > > > resource consumption and improved the usability of the build by combining > > > multiple workflows into a single larger workflow. > > > > > > The showstopper a year ago was the lack of being able to re-run a single > > > failed job in a larger workflow. > > > GitHub has since then delivered this feature and no showstoppers are > > > present. > > > > > > I have been posting updates to > > > https://github.com/apache/pulsar/issues/14401 "Speed up CI workflows" > > > about the progress. > > > I have rebased the changes from last year's PoC, and I'm finalizing and > > > testing the changes in my fork under > > > https://github.com/lhotari/pulsar/pull/59 . I'll send a PR to > > > apache/pulsar, when the refactoring is ready. > > > > > > > This is great news ! > > > > Looking forward to your patch > > > > Enrico > > > > > > > > > -Lari > > > > > > On 2021/03/16 01:10:52 Sijie Guo wrote: > > > > > The prototype has demonstrated about 60% reduction in > > > > resource consumption. > > > > > > > > It is hard to quantify. Merging them into one large workflow can result > > > in > > > > more failures. Re-running those failures can consume resources as well. > > > > > > > > > Isn't it urgent to resolve it? > > > > > > > > I think we are in a stage that gives us breathing room to fix flaky > > > > tests > > > > and solve other problems, no? > > > > I don't mean we stop the effort here. I mean we have other enhancements > > > > that we can do to improve the situation. > > > > Once we get into a position where the flakiness is reduced, we can merge > > > > them into one workflow. > > > > > > > > - Sijie > > > > > > > > > > > > On Mon, Mar 15, 2021 at 2:48 AM Lari Hotari <l...@hotari.net> wrote: > > > > > > > > > Thanks for the feedback Sijie. > > > > > > > > > > > We are using a lazy consensus approach. Typically if there is no > > > > > objection, > > > > > > please go ahead and not need to wait for approval. > > > > > > If people raise concerns, please address the concerns. > > > > > > > > > > You and Ali have raised concerns about changing the existing GitHub > > > Actions > > > > > workflows in a way where multiple workflows would be combined to a > > > single > > > > > workflow. Before proceeding, there is a need to address the concerns. > > > We > > > > > might end up with a completely different type of solution of what has > > > been > > > > > proposed initially. :) > > > > > > > > > > > Yes. So I am in favor of addressing flaky tests than merging all > > > > > workflows > > > > > > into one giant workflow. > > > > > > > > > > I agree that addressing flaky tests is favorable. The main reason for > > > PIP > > > > > "Changes to GitHub Actions based Pulsar CI" is to > > > > > 1) Reduce GitHub Action Runner resource consumption of Pulsar PR > > > > > builds > > > > > 2) Reduce lead times for Pull Request feedback > > > > > We cannot ignore these problems. If we don't change anything, the > > > problems > > > > > won't get fixed. The prototype has demonstrated about 60% reduction in > > > > > resource consumption. Measuring the lead times hasn't been done in the > > > > > prototype, but since the reason for long lead times has been long > > > > > build > > > > > queues due to excessive resource consumption, it's likely that the > > > > > lead > > > > > times would be reduced. > > > > > > > > > > I know that switching to a single workflow isn't the only solution to > > > the > > > > > above problems. I had a discussion with Ali. He recommended reducing > > > the > > > > > modules in Pulsar repository (PIP-62), reducing the docker container > > > size > > > > > and improving the Pulsar Broker unit test harness so that tests using > > > it > > > > > would be less flaky and that it would be easier to fix the issues in > > > > > failing test when there would be better information about what was the > > > > > state problem that caused the test to fail. > > > > > > > > > > As mentioned in the earlier email about the optimizations in the > > > Pulsar CI > > > > > refactoring prototype, the main benefits come from reusing binary > > > artifacts > > > > > from previous build stages so that each job doesn't have to build > > > > > everything from scratch. This becomes irrelevant when the build is > > > > > very > > > > > fast and there isn't a benefit of reusing artifacts. > > > > > This means that it's possible to resolve the resource consumption > > > problem > > > > > of Pulsar PR builds in the way that Ali is recommending, without > > > switching > > > > > from multiple workflows to a single workflow that can reuse binary > > > > > artifacts in the build. > > > > > > > > > > > Hence I am +1 to "changes to flaky test handing" and suggest > > > > > > focusing > > > > > more > > > > > > on solving flaky tests. > > > > > > Consider merging them into one workflow when the tests are in a > > > better > > > > > > situation. > > > > > > > > > > Makes sense for minimizing the risk of change, but we cannot just wait > > > for > > > > > things to fix themselves. > > > > > How long will other Apache projects tolerate the resource consumption > > > > > issues Pulsar is causing in the shared GitHub Actions Runner VM quota? > > > For > > > > > example, > > > https://github.com/apache/pulsar/pull/9159#issuecomment-766915396 > > > > > . > > > > > Isn't it urgent to resolve it? > > > > > > > > > > I'll revisit the plan for PIP "Changes to GitHub Actions based Pulsar > > > CI" > > > > > based on the community feedback in the upcoming days. That might mean > > > that > > > > > the current solution is pivoted. The goal is to solve the problems of > > > high > > > > > resource consumption and long lead time for PR build in Pulsar CI. > > > Please > > > > > continue to provide feedback so that we get a revisited plan together > > > that > > > > > addresses these problems.Thank you! > > > > > > > > > > BR, > > > > > -Lari > > > > > > > > > > On Fri, Mar 12, 2021 at 11:06 PM Sijie Guo <guosi...@gmail.com> wrote: > > > > > > > > > > > > *Sijie, how far are we from getting the draft PIP "Changes to > > > GitHub > > > > > > Actions based Pulsar CI" into an actual PIP that gets put on the > > > > > > wiki > > > > > > page https://github.com/apache/pulsar/wiki > > > > > > <https://github.com/apache/pulsar/wiki> ?* > > > > > > > > > > > > I see what you referred to before now. This can be easily done. I > > > (or any > > > > > > other committer) can do it for you. > > > > > > > > > > > > There is no real blocker for you to continue work even there are > > > concerns > > > > > > or it is not listed in the PIP. > > > > > > We are using a lazy consensus approach. Typically if there is no > > > > > objection, > > > > > > please go ahead and not need to wait for approval. > > > > > > If people raise concerns, please address the concerns. > > > > > > > > > > > > > The reason why > > > > > > re-runs happen currently is because of the high flakiness of tests. > > > > > > > > > > > > Yes. So I am in favor of addressing flaky tests than merging all > > > > > workflows > > > > > > into one giant workflow. > > > > > > It is not about "No pain, no gain". The community has suffered a lot > > > with > > > > > > giant workflow before. > > > > > > There were a lot of committers and contributors working hard to > > > split one > > > > > > giant workflow into multiple > > > > > > current workflows. Unless there is really strong evidence that > > > merging > > > > > them > > > > > > back to one will improve > > > > > > the entire CI experience, I will still have concerns about one giant > > > > > > workflow approach. > > > > > > > > > > > > Hence I am +1 to "changes to flaky test handing" and suggest > > > > > > focusing > > > > > more > > > > > > on solving flaky tests. > > > > > > Consider merging them into one workflow when the tests are in a > > > better > > > > > > situation. > > > > > > > > > > > > > This solution would also require disabling > > > > > > required status checks > > > > > > > > > > > > I don't think it is a good idea to disable status checks. We can > > > consider > > > > > > running "dark mode" but it will just overload the action quota. > > > > > > > > > > > > Another alternative is to mirror the pull requests into another > > > Github > > > > > > account to test that and get more concrete statistics on the > > > flakiness > > > > > rate > > > > > > of one workflow approach. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 1:57 AM Lari Hotari <l...@hotari.net> wrote: > > > > > > > > > > > > > Thanks for the feedback, Sijie. > > > > > > > > > > > > > > > The "Fail fast" approach is great. That would be super helpful > > > > > > > > if > > > > > there > > > > > > > are > > > > > > > > multiple workflows and each workflow is retryable. > > > > > > > > However, I am not sure how much it will help if you run all > > > workflows > > > > > > in > > > > > > > > one giant workflow. Or is it making things worse? > > > > > > > > > > > > > > We can reduce the need for re-running workflow runs. The reason > > > > > > > why > > > > > > > re-runs happen currently is because of high flakiness of tests. > > > > > > > Addressing flakiness continues to be top-priority. Now that the > > > Pulsar > > > > > CI > > > > > > > workflow prototype is finished, I'll be focusing more in the other > > > > > draft > > > > > > > PIP, "Changes to flaky test handling". > > > > > > > We as a community should address the critical problem that the > > > current > > > > > > > retry solution has: it can mask bugs in production code and make > > > the > > > > > > build > > > > > > > pass and allow changes to be merged that cause regressions. > > > > > > > It's a false sense of security what the green builds after all the > > > > > > retries > > > > > > > bring us. Bringing Pulsar to the next level in stability requires > > > > > > > addressing this. > > > > > > > > > > > > > > If something doesn't work, it can be adapted and improved. Changes > > > can > > > > > be > > > > > > > rolled back and revisited when things go worse. We also need a > > > leap of > > > > > > > faith. > > > > > > > "No pain, no gain", like any change, it will be painful at first, > > > but > > > > > we > > > > > > > will get over the bump. > > > > > > > > > > > > > > > Secondly, your test has been done in your folk where there are > > > not a > > > > > > lot > > > > > > > of > > > > > > > > concurrent pushes and pull requests. I am not sure how your > > > approach > > > > > > will > > > > > > > > behave once it is merged into master. Can you simulate multiple > > > > > > > concurrent > > > > > > > > pull requests in your account to prove your approach doesn't > > > bring > > > > > side > > > > > > > > effects? > > > > > > > > > > > > > > One possibility to address this is to introduce the new workflow > > > in a > > > > > > mode > > > > > > > where you need to opt-in to the new workflow in some way. > > > > > > > This was an idea brought up by my colleagues Enrico and Andrey. > > > > > > > It might be possible to configure the existing workflow and this > > > new > > > > > > > workflow in a way where some condition (for example whitelisted > > > github > > > > > > user > > > > > > > name or a certain keyword in the PR title/description) chooses > > > either > > > > > one > > > > > > > for the Pull request. This solution would also require disabling > > > > > > > required status checks ("Require status checks to pass before > > > merging" > > > > > > > feature in GitHub branch protection rules) since the names of the > > > > > checks > > > > > > > would be different. > > > > > > > > > > > > > > > Lastly, can we apply those optimizations to current workflows > > > without > > > > > > > > merging them into one giant workflow? > > > > > > > > > > > > > > This is what I have been doing. All individual optimizations have > > > > > already > > > > > > > been sent as PRs in the last months. I guess there's 20-30 PRs > > > > > > > that > > > > > have > > > > > > > already been merged. > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/pulls?q=is%3Apr+author%3Alhotari+is%3Amerged > > > > > > > There's also 2 build related PRs from yesterday which haven't been > > > > > merged > > > > > > > yet: > > > > > > > Fix Maven download issues (ported from the prototype to our > > > existing > > > > > > Pulsar > > > > > > > CI): https://github.com/apache/pulsar/pull/9883 > > > > > > > improve Maven module build order (required for more efficient > > > builds > > > > > that > > > > > > > selectively build required artifacts): > > > > > > > https://github.com/apache/pulsar/pull/9882 > > > > > > > > > > > > > > There aren't many optimizations left that could be ported from the > > > > > > > prototype to the existing workflow. There are a few, but the > > > impact is > > > > > > > minor. > > > > > > > The reason for this is that the optimization with the greatest > > > impact > > > > > are > > > > > > > the ones that build a binary artifacts (maven libs, docker images) > > > once > > > > > > and > > > > > > > share it with the downstream jobs in the pipeline. > > > > > > > Applying this type of solution has certain challenges when there > > > are > > > > > > > multiple separate workflow. Sharing binary artifacts to other > > > workflows > > > > > > > would require that the workflow to reuse the artifacts gets > > > triggered > > > > > by > > > > > > > the workflow that produced the artifacts. This wouldn't be secure > > > or > > > > > > > practical for handling pull requests. Triggering a workflow > > > explicitly > > > > > > > would require a token from the main repository and using that for > > > pull > > > > > > > request builds would be a serious security vulnerability. (more > > > details > > > > > > > about the GitHub Actions security model in > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://securitylab.github.com/research/github-actions-preventing-pwn-requests > > > > > > > ) > > > > > > > > > > > > > > *Sijie, how far are we from getting the draft PIP "Changes to > > > GitHub > > > > > > > Actions based Pulsar CI" into an actual PIP that gets put on the > > > wiki > > > > > > > page https://github.com/apache/pulsar/wiki > > > > > > > <https://github.com/apache/pulsar/wiki> ?* > > > > > > > > > > > > > > -Lari > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 10:46 AM Sijie Guo <guosi...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > This is good progress. However, my main concern is still merging > > > all > > > > > > > > workflows into one giant workflow. > > > > > > > > > > > > > > > > The "Fail fast" approach is great. That would be super helpful > > > > > > > > if > > > > > there > > > > > > > are > > > > > > > > multiple workflows and each workflow is retryable. > > > > > > > > However, I am not sure how much it will help if you run all > > > workflows > > > > > > in > > > > > > > > one giant workflow. Or is it making things worse? > > > > > > > > > > > > > > > > Secondly, your test has been done in your folk where there are > > > not a > > > > > > lot > > > > > > > of > > > > > > > > concurrent pushes and pull requests. I am not sure how your > > > approach > > > > > > will > > > > > > > > behave once it is merged into master. Can you simulate multiple > > > > > > > concurrent > > > > > > > > pull requests in your account to prove your approach doesn't > > > bring > > > > > side > > > > > > > > effects? > > > > > > > > > > > > > > > > Lastly, can we apply those optimizations to current workflows > > > without > > > > > > > > merging them into one giant workflow? > > > > > > > > > > > > > > > > - Sijie > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 12:30 AM Lari Hotari <l...@hotari.net> > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for the feedback Michael. > > > > > > > > > > > > > > > > > > > I left a question on the doc about how concurrent runs > > > affect the > > > > > > > > > > repository's 5 GB cache limit. > > > > > > > > > > > > > > > > > > This is a good question. There isn't a clear answer in the > > > GitHub > > > > > > > Actions > > > > > > > > > Cache documentation. > > > > > > > > > > > > > > > > > > The documentation is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows > > > > > > > > > . > > > > > > > > > Based on this document and some testing, I have made these > > > > > > conclusions: > > > > > > > > > For GitHub Actions Cache, pull requests get executed in the > > > context > > > > > > of > > > > > > > > the > > > > > > > > > forked repository. > > > > > > > > > The workflow triggered by a pull request event can only update > > > it's > > > > > > own > > > > > > > > > cache. It has read-only access to upstream caches. > > > > > > > > > If there's a cache miss, the entry will get written to the > > > cache of > > > > > > the > > > > > > > > > forked repository. If the PR could to write to the upstream > > > cache, > > > > > it > > > > > > > > would > > > > > > > > > be a security issue since this would be vulnerable to cache > > > > > poisoning > > > > > > > > > attacks. Each repository has a 5GB quota for writes. The > > > entries > > > > > are > > > > > > > kept > > > > > > > > > up to 7 days. > > > > > > > > > The performance is fairly good. Loading docker images from the > > > > > > > repository > > > > > > > > > happens about 15MB/s. Writing is 2-3x slower, about 5-7MB/s. > > > (the > > > > > > > > > performance of the GHA cache is most likely higher since this > > > is > > > > > the > > > > > > > > > throughput for docker load / docker save) > > > > > > > > > > > > > > > > > > If a single repository has a lot of concurrent jobs, it could > > > start > > > > > > > > > evicting caches. > > > > > > > > > However that isn't likely to happen with the way Pulsar is > > > > > developed > > > > > > > > since > > > > > > > > > pull requests are created from personal forks. > > > > > > > > > > > > > > > > > > > I also think it could be helpful to explicitly document, or > > > > > > reference > > > > > > > > > > github documentation, on how failure will affect the DAG. > > > > > > > > > > I'm > > > > > > > assuming > > > > > > > > > that > > > > > > > > > > if an action fails, its parallel peer actions will run to > > > > > > completion, > > > > > > > > and > > > > > > > > > > that the rest of the remaining stages will get canceled, but > > > I > > > > > > > haven't > > > > > > > > > > worked with github actions before. > > > > > > > > > > > > > > > > > > For matrix jobs, "fail fast" is the default, which cancels all > > > jobs > > > > > > in > > > > > > > > the > > > > > > > > > matrix if one fails. Other parallel flows would run to > > > completion > > > > > by > > > > > > > > > default. > > > > > > > > > In the prototype, I have added a Github script step to each > > > job to > > > > > > > cancel > > > > > > > > > the complete workflow when a failure occurs. > > > > > > > > > Here's an example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/lhotari/pulsar/blob/lh-refactor-pulsar-ci-with-retries/.github/workflows/pulsar-ci.yaml#L281-L289 > > > > > > > > > > > > > > > > > > The prototype follows a "fail fast" design. When a failure > > > occurs, > > > > > > fail > > > > > > > > > fast and don't continue with other jobs. > > > > > > > > > The benefit of this is that it reduces resource consumption. > > > This > > > > > > helps > > > > > > > > > keep the build queue short. > > > > > > > > > When the build queue is short, developers get quick feedback > > > from > > > > > CI. > > > > > > > > > > > > > > > > > > Documenting all details in the PIP document isn't practical. > > > > > > > > > *I'm hoping to start a separate document on low level details > > > when > > > > > > > there > > > > > > > > is > > > > > > > > > a high level acceptance of the proposed "Changes to GitHub > > > Actions > > > > > > > based > > > > > > > > > Pulsar CI".* > > > > > > > > > Together we can make this happen. We need decisions too. This > > > > > > proposal > > > > > > > > > cannot stay as a draft forever. > > > > > > > > > I'm looking forward to hearing from the Pulsar community, > > > Pulsar > > > > > > > > > committer and Pulsar PMC members how to take this forward. > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > -Lari > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 8:06 AM Michael Marshall < > > > > > > > mikemars...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > This will be a great improvement. I read through the PIP, > > > > > > > > > > and > > > > > > > overall, > > > > > > > > it > > > > > > > > > > looks good to me. > > > > > > > > > > > > > > > > > > > > I left a question on the doc about how concurrent runs > > > affect the > > > > > > > > > > repository's 5 GB cache limit. > > > > > > > > > > > > > > > > > > > > I also think it could be helpful to explicitly document, or > > > > > > reference > > > > > > > > > > github documentation, on how failure will affect the DAG. > > > > > > > > > > I'm > > > > > > > assuming > > > > > > > > > that > > > > > > > > > > if an action fails, its parallel peer actions will run to > > > > > > completion, > > > > > > > > and > > > > > > > > > > that the rest of the remaining stages will get canceled, but > > > I > > > > > > > haven't > > > > > > > > > > worked with github actions before. > > > > > > > > > > > > > > > > > > > > Thanks for all of the work you've put in so far. > > > > > > > > > > > > > > > > > > > > On Thu, Mar 11, 2021 at 6:37 PM Yuva raj <uvar...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > This is great news. Thanks Hari , Mateo and pulsar > > > community > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021, 2:04 AM Lari Hotari < > > > > > lari.hot...@sagire.fi > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Dear Pulsar community members, > > > > > > > > > > > > > > > > > > > > > > > > The work on "Changes to GitHub Actions based Pulsar CI" > > > has > > > > > > gone > > > > > > > > > > forward > > > > > > > > > > > > based on your feedback. Here are some updates about the > > > work. > > > > > > > > > > > > > > > > > > > > > > > > The draft PIP proposal document is here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry > > > > > > > > > > > > There's a *detailed status update in the document about > > > > > > > > > > > > a > > > > > > > prototype > > > > > > > > > for > > > > > > > > > > > the > > > > > > > > > > > > refactored Pulsar CI GitHub Actions based workflow*. > > > > > > > > > > > > > > > > > > > > > > > > Thanks for all the suggestions and feedback by now. A > > > lot of > > > > > > > > > > improvements > > > > > > > > > > > > have been made by the Pulsar contributors to overcome > > > > > > > > > > > > the > > > > > > > technical > > > > > > > > > > > > obstacles. > > > > > > > > > > > > Special thanks go to Matteo for reducing the sizes of > > > docker > > > > > > > > images. > > > > > > > > > A > > > > > > > > > > > lot > > > > > > > > > > > > of small improvements have been made to the Pulsar maven > > > > > build > > > > > > to > > > > > > > > > > enable > > > > > > > > > > > > the new refactored GitHub Actions workflow. Thank you > > > for all > > > > > > PR > > > > > > > > > > reviews > > > > > > > > > > > > and feedback. > > > > > > > > > > > > > > > > > > > > > > > > The main goal of the "Changes to GitHub Actions based > > > Pulsar > > > > > > CI" > > > > > > > > work > > > > > > > > > > has > > > > > > > > > > > > been to *reduce the resource consumption of the Pulsar > > > > > > > > > > > > CI > > > > > build > > > > > > > and > > > > > > > > > to > > > > > > > > > > > > speed up Pulsar development by improving the developer > > > > > > > > productivity* > > > > > > > > > > when > > > > > > > > > > > > less time is wasted in waiting for Pulsar CI build > > > feedback. > > > > > > The > > > > > > > > > > > prototype > > > > > > > > > > > > demonstrates these improvements. > > > > > > > > > > > > > > > > > > > > > > > > As you can see from the email from Jan 28 below, *the > > > > > resource > > > > > > > > > > > consumption > > > > > > > > > > > > was 19 hrs 36 minutes* for a single pull request that > > > > > > > > > > > > was > > > > > > > observed > > > > > > > > > when > > > > > > > > > > > the > > > > > > > > > > > > work began. > > > > > > > > > > > > Now, with the prototype of the refactored Pulsar CI > > > build, > > > > > the > > > > > > > > > resource > > > > > > > > > > > > consumption is *7 hrs 9 minutes.* > > > > > > > > > > > > *This is about 60% reduction in resource consumption.* > > > The > > > > > > whole > > > > > > > > > > pipeline > > > > > > > > > > > > completes in 75-100 minutes. > > > > > > > > > > > > > > > > > > > > > > > > Here's a breakdown of the duration (resource > > > consumption) of > > > > > > each > > > > > > > > > build > > > > > > > > > > > job > > > > > > > > > > > > in the refactored workflow: > > > > > > > > > > > > Workflow Job seconds h:mm:ss > > > > > > > > > > > > Pulsar CI Changed files check 4 0:00:04 > > > > > > > > > > > > Pulsar CI Go 1.11 Functions 155 0:02:35 > > > > > > > > > > > > Pulsar CI Go 1.12 Functions 166 0:02:46 > > > > > > > > > > > > Pulsar CI Go 1.13 Functions 113 0:01:53 > > > > > > > > > > > > Pulsar CI Go 1.14 Functions 96 0:01:36 > > > > > > > > > > > > Pulsar CI Build on MacOS 1017 0:16:57 > > > > > > > > > > > > Pulsar CI Build and License check 346 0:05:46 > > > > > > > > > > > > Pulsar CI Build Pulsar CPP and Python clients 683 > > > > > > > > > > > > 0:11:23 > > > > > > > > > > > > Pulsar CI Build Pulsar java-test-image docker image 405 > > > > > 0:06:45 > > > > > > > > > > > > Pulsar CI CI - Unit - Other 1580 0:26:20 > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Broker Group 1 968 > > > 0:16:08 > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Broker Group 2 2223 > > > 0:37:03 > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Client Api 1652 0:27:32 > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Client Impl 916 0:15:16 > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Other 522 0:08:42 > > > > > > > > > > > > Pulsar CI CI - Unit - Proxy 331 0:05:31 > > > > > > > > > > > > Pulsar CI Build Pulsar docker image 2343 0:39:03 > > > > > > > > > > > > Pulsar CI CI - Integration - Shade 414 0:06:54 > > > > > > > > > > > > Pulsar CI CI - Integration - Backwards Compatibility 849 > > > > > > 0:14:09 > > > > > > > > > > > > Pulsar CI CI - Integration - Cli 1490 0:24:50 > > > > > > > > > > > > Pulsar CI CI - Integration - Messaging 857 0:14:17 > > > > > > > > > > > > Pulsar CI CI - Integration - Schema 468 0:07:48 > > > > > > > > > > > > Pulsar CI CI - Integration - Standalone 286 0:04:46 > > > > > > > > > > > > Pulsar CI CI - Integration - Transaction 362 0:06:02 > > > > > > > > > > > > Pulsar CI CI - System - Function State 699 0:11:39 > > > > > > > > > > > > Pulsar CI CI - System - Tiered FileSystem 779 0:12:59 > > > > > > > > > > > > Pulsar CI CI - System - Tiered JCloud 529 0:08:49 > > > > > > > > > > > > Pulsar CI CI - System - Pulsar Connectors - Thread 1795 > > > > > 0:29:55 > > > > > > > > > > > > Pulsar CI CI - System - Pulsar Connectors - Process 2312 > > > > > > 0:38:32 > > > > > > > > > > > > Pulsar CI CI - System - Sql 1377 0:22:57 > > > > > > > > > > > > *Total resource consumption* > > > > > > > > > > > > 7:08:57 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > GitHub Actions doesn't support restarting a single job ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234 > > > > > > > > > > > > ). > > > > > > > > > > > > However, this is not a showstopper since there are ways > > > to > > > > > > > address > > > > > > > > > the > > > > > > > > > > > > issues that cause flakiness. > > > > > > > > > > > > There is a separate PIP for changing the way to handle > > > flaky > > > > > > > tests. > > > > > > > > > You > > > > > > > > > > > can > > > > > > > > > > > > find the link to that in the "Changes to GitHub Actions > > > based > > > > > > > > Pulsar > > > > > > > > > > CI" > > > > > > > > > > > > document's header. > > > > > > > > > > > > > > > > > > > > > > > > *Some requests for the Pulsar community:* > > > > > > > > > > > > > > > > > > > > > > > > 1) *Please take a look at the updated PIP document*: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry > > > > > > > > > > > > . *It also contains more details of the prototype that > > > has > > > > > been > > > > > > > > > > > > successfully completed.* > > > > > > > > > > > > > > > > > > > > > > > > 2) *Please share your feedback and suggest a way > > > forward.* > > > > > > > > > > > > > > > > > > > > > > > > *Thank you for your help!* > > > > > > > > > > > > > > > > > > > > > > > > BR, Lari > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 28, 2021 at 7:13 PM Lari Hotari < > > > > > > > lari.hot...@sagire.fi > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Dear Pulsar community members, > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, the Pulsar GitHub Actions workflows are > > > > > consuming > > > > > > > the > > > > > > > > > > > majority > > > > > > > > > > > > > of the shared pool of resources allocated for > > > > > > > github.com/apache > > > > > > > > > > > > projects. > > > > > > > > > > > > > Other Apache projects have been impacted and there is > > > > > > > > > > > > > a > > > > > > demand > > > > > > > to > > > > > > > > > > > improve > > > > > > > > > > > > > the Pulsar CI > > > > > > > > > > > > > < > > > > > > > > > > > https://github.com/apache/pulsar/pull/9159#issuecomment-766915396 > > > > > > > > > > > > > > > > > > > > > > asap. > > > > > > > > > > > > > > > > > > > > > > > > > > In GitHub Actions Runners, the unit of resources is > > > > > > > > > > > > > the > > > > > time > > > > > > > > that a > > > > > > > > > > > > Runner > > > > > > > > > > > > > is occupied. I observed the workflow runs for handling > > > a > > > > > > single > > > > > > > > > Pull > > > > > > > > > > > > > Request (in my personal fork) and these were the > > > running > > > > > > > > durations: > > > > > > > > > > > > > Workflow name Duration > > > > > > > > > > > > > CI - Build - MacOS 0:17:23 > > > > > > > > > > > > > CI - Go Functions style check 0:02:38 > > > > > > > > > > > > > CI - Unit - Brokers - Other 0:15:40 > > > > > > > > > > > > > CI - Unit - Brokers - Client Impl 0:16:28 > > > > > > > > > > > > > CI - Misc 0:16:51 > > > > > > > > > > > > > CI - Unit - Proxy 0:14:23 > > > > > > > > > > > > > CI - Go Functions Tests 0:22:08 > > > > > > > > > > > > > CI - CPP, Python Tests 0:23:30 > > > > > > > > > > > > > CI - Unit 0:42:11 > > > > > > > > > > > > > CI - Integration - Sql 1:00:13 > > > > > > > > > > > > > CI - Integration - Tiered JCloud 1:00:18 > > > > > > > > > > > > > CI - Integration - Tiered FileSystem 1:00:13 > > > > > > > > > > > > > CI - Integration - Function State 1:00:12 > > > > > > > > > > > > > CI - Integration - Cli 1:10:22 > > > > > > > > > > > > > CI - Integration - Transaction 1:16:34 > > > > > > > > > > > > > CI - Integration - Process 1:11:23 > > > > > > > > > > > > > CI - Shade - Test 1:15:45 > > > > > > > > > > > > > CI - Unit - Brokers - Client Api 0:26:13 > > > > > > > > > > > > > CI - Unit - Brokers - Broker Group 2 0:35:05 > > > > > > > > > > > > > CI - Integration - Standalone 0:45:29 > > > > > > > > > > > > > CI - Integration - Messaging 1:00:23 > > > > > > > > > > > > > CI - Integration - Thread 1:00:19 > > > > > > > > > > > > > CI - Integration - Backwards Compatibility 1:00:19 > > > > > > > > > > > > > CI - Integration - Schema 1:00:19 > > > > > > > > > > > > > CI - Unit - Brokers - Broker Group 1 2:02:31 > > > > > > > > > > > > > TOTAL 19:36:50 > > > > > > > > > > > > > > > > > > > > > > > > > > *In this case, the total resource consumption of > > > > > > > > > > > > > GitHub > > > > > > Actions > > > > > > > > > > Runners > > > > > > > > > > > > is > > > > > > > > > > > > > 19 hours 36 minutes 50 seconds for a single pull > > > request to > > > > > > > > > > > > apache/pulsar.* > > > > > > > > > > > > > > > > > > > > > > > > > > Since GitHub Actions Runner resource pool utilization > > > is > > > > > very > > > > > > > > high, > > > > > > > > > > > this > > > > > > > > > > > > > leads to the build queue to grow and take a long time > > > to > > > > > > > process. > > > > > > > > > > > > > > > > > > > > > > > > > > I have been looking for ways to improve the Pulsar CI > > > for > > > > > the > > > > > > > > last > > > > > > > > > 3 > > > > > > > > > > > > > months. During this period I worked on a few > > > experiments. > > > > > The > > > > > > > > > > learnings > > > > > > > > > > > > > from the past experiments are documented at a high > > > level in > > > > > > the > > > > > > > > > > > following > > > > > > > > > > > > > draft PIP document. > > > > > > > > > > > > > > > > > > > > > > > > > > *The draft PIP "Changes to GitHub Actions based Pulsar > > > CI" > > > > > > > > document > > > > > > > > > > is > > > > > > > > > > > a > > > > > > > > > > > > > Google doc:* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit?usp=sharing > > > > > > > > > > > > > > > > > > > > > > > > > > *Please participate* so that we get the plan adjusted > > > based > > > > > > on > > > > > > > > the > > > > > > > > > > > > > feedback asap. If there's already a similar effort > > > > > ongoing, I > > > > > > > > hope > > > > > > > > > we > > > > > > > > > > > can > > > > > > > > > > > > > join efforts. > > > > > > > > > > > > > > > > > > > > > > > > > > *Let's fix Pulsar CI!* > > > > > > > > > > > > > > > > > > > > > > > > > > BR, Lari > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >