Thank you for the reviews and feedback. I have started making the switch to the new refactored Pulsar CI.
Merging new PRs are blocked until the switch is ready. The reason for this is that I have merged https://github.com/apache/pulsar/pull/14939 preparing for merging https://github.com/apache/pulsar/pull/14819 . The GitHub Actions "required checks" change in the refactored Pulsar CI and there can be only one effective set of "required checks" for a branch. After the new Pulsar CI workflow PR has been merged, each in-progress PR has to be closed & immediately reopened to pick up the new workflow and the PR build has to run through the new workflow. Another way to pick up the new workflow is to rebase the PR (or merge master branch changes to it). Please let me know if you experience any issues with the new Pulsar CI workflow. I'll be on the #testing channel on Pulsar Slack too. -Lari On 2022/03/29 15:43:31 Michael Marshall wrote: > Great work, Lari! It's great news that GitHub's new feature helps this > valuable work move forward. I look forward to seeing your PR merged, > and I am happy to help resolve any issues that might pop up. > > Thanks, > Michael > > On Tue, Mar 29, 2022 at 7:55 AM Lari Hotari <lhot...@apache.org> wrote: > > > > The PR has sufficient reviews, and I'll proceed with merging it today or > > tomorrow. > > Please provide feedback now if you want to do that before the PR is merged. > > > > Thanks! > > > > -Lari > > > > On 2022/03/28 20:05:14 Lari Hotari wrote: > > > The PR https://github.com/apache/pulsar/pull/14819 is now ready for > > > review. > > > > > > The changes in the PR now use GitHub Actions Artifacts for sharing binary > > > files (such as docker images) between the build steps. This saves a lot > > > of GitHub Actions VM resources since the docker images are built once and > > > shared in downstream jobs. > > > GitHub Actions Artifacts are meant to be used for sharing data between > > > the jobs in a GitHub Actions workflow [1]. > > > > > > I'm looking forward to your review and feedback on > > > https://github.com/apache/pulsar/pull/14819 . > > > > > > BR, > > > > > > -Lari > > > > > > References: > > > [1] GitHub Actions: Storing workflow data as artifacts - > > > https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts > > > > > > On 2022/03/23 10:46:10 Lari Hotari wrote: > > > > I have submitted the PR for refactoring the apache/pulsar GitHub > > > > Actions based CI. Please review > > > > https://github.com/apache/pulsar/pull/14819 . > > > > > > > > BR, > > > > -Lari > > > > > > > > On 2022/03/22 13:38:36 Enrico Olivelli wrote: > > > > > Lari, > > > > > > > > > > Il Mar 22 Mar 2022, 14:32 Lari Hotari <lhot...@apache.org> ha scritto: > > > > > > > > > > > I have resumed work to improve our GitHub Actions based Pulsar CI. > > > > > > > > > > > > Last year, I worked on a proof-of-concept which significantly > > > > > > reduced the > > > > > > resource consumption and improved the usability of the build by > > > > > > combining > > > > > > multiple workflows into a single larger workflow. > > > > > > > > > > > > The showstopper a year ago was the lack of being able to re-run a > > > > > > single > > > > > > failed job in a larger workflow. > > > > > > GitHub has since then delivered this feature and no showstoppers are > > > > > > present. > > > > > > > > > > > > I have been posting updates to > > > > > > https://github.com/apache/pulsar/issues/14401 "Speed up CI > > > > > > workflows" > > > > > > about the progress. > > > > > > I have rebased the changes from last year's PoC, and I'm finalizing > > > > > > and > > > > > > testing the changes in my fork under > > > > > > https://github.com/lhotari/pulsar/pull/59 . I'll send a PR to > > > > > > apache/pulsar, when the refactoring is ready. > > > > > > > > > > > > > > > > This is great news ! > > > > > > > > > > Looking forward to your patch > > > > > > > > > > Enrico > > > > > > > > > > > > > > > > > > > > > -Lari > > > > > > > > > > > > On 2021/03/16 01:10:52 Sijie Guo wrote: > > > > > > > > The prototype has demonstrated about 60% reduction in > > > > > > > resource consumption. > > > > > > > > > > > > > > It is hard to quantify. Merging them into one large workflow can > > > > > > > result > > > > > > in > > > > > > > more failures. Re-running those failures can consume resources as > > > > > > > well. > > > > > > > > > > > > > > > Isn't it urgent to resolve it? > > > > > > > > > > > > > > I think we are in a stage that gives us breathing room to fix > > > > > > > flaky tests > > > > > > > and solve other problems, no? > > > > > > > I don't mean we stop the effort here. I mean we have other > > > > > > > enhancements > > > > > > > that we can do to improve the situation. > > > > > > > Once we get into a position where the flakiness is reduced, we > > > > > > > can merge > > > > > > > them into one workflow. > > > > > > > > > > > > > > - Sijie > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 15, 2021 at 2:48 AM Lari Hotari <l...@hotari.net> > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks for the feedback Sijie. > > > > > > > > > > > > > > > > > We are using a lazy consensus approach. Typically if there is > > > > > > > > > no > > > > > > > > objection, > > > > > > > > > please go ahead and not need to wait for approval. > > > > > > > > > If people raise concerns, please address the concerns. > > > > > > > > > > > > > > > > You and Ali have raised concerns about changing the existing > > > > > > > > GitHub > > > > > > Actions > > > > > > > > workflows in a way where multiple workflows would be combined > > > > > > > > to a > > > > > > single > > > > > > > > workflow. Before proceeding, there is a need to address the > > > > > > > > concerns. > > > > > > We > > > > > > > > might end up with a completely different type of solution of > > > > > > > > what has > > > > > > been > > > > > > > > proposed initially. :) > > > > > > > > > > > > > > > > > Yes. So I am in favor of addressing flaky tests than merging > > > > > > > > > all > > > > > > > > workflows > > > > > > > > > into one giant workflow. > > > > > > > > > > > > > > > > I agree that addressing flaky tests is favorable. The main > > > > > > > > reason for > > > > > > PIP > > > > > > > > "Changes to GitHub Actions based Pulsar CI" is to > > > > > > > > 1) Reduce GitHub Action Runner resource consumption of Pulsar > > > > > > > > PR builds > > > > > > > > 2) Reduce lead times for Pull Request feedback > > > > > > > > We cannot ignore these problems. If we don't change anything, > > > > > > > > the > > > > > > problems > > > > > > > > won't get fixed. The prototype has demonstrated about 60% > > > > > > > > reduction in > > > > > > > > resource consumption. Measuring the lead times hasn't been done > > > > > > > > in the > > > > > > > > prototype, but since the reason for long lead times has been > > > > > > > > long build > > > > > > > > queues due to excessive resource consumption, it's likely that > > > > > > > > the lead > > > > > > > > times would be reduced. > > > > > > > > > > > > > > > > I know that switching to a single workflow isn't the only > > > > > > > > solution to > > > > > > the > > > > > > > > above problems. I had a discussion with Ali. He recommended > > > > > > > > reducing > > > > > > the > > > > > > > > modules in Pulsar repository (PIP-62), reducing the docker > > > > > > > > container > > > > > > size > > > > > > > > and improving the Pulsar Broker unit test harness so that tests > > > > > > > > using > > > > > > it > > > > > > > > would be less flaky and that it would be easier to fix the > > > > > > > > issues in > > > > > > > > failing test when there would be better information about what > > > > > > > > was the > > > > > > > > state problem that caused the test to fail. > > > > > > > > > > > > > > > > As mentioned in the earlier email about the optimizations in the > > > > > > Pulsar CI > > > > > > > > refactoring prototype, the main benefits come from reusing > > > > > > > > binary > > > > > > artifacts > > > > > > > > from previous build stages so that each job doesn't have to > > > > > > > > build > > > > > > > > everything from scratch. This becomes irrelevant when the build > > > > > > > > is very > > > > > > > > fast and there isn't a benefit of reusing artifacts. > > > > > > > > This means that it's possible to resolve the resource > > > > > > > > consumption > > > > > > problem > > > > > > > > of Pulsar PR builds in the way that Ali is recommending, without > > > > > > switching > > > > > > > > from multiple workflows to a single workflow that can reuse > > > > > > > > binary > > > > > > > > artifacts in the build. > > > > > > > > > > > > > > > > > Hence I am +1 to "changes to flaky test handing" and suggest > > > > > > > > > focusing > > > > > > > > more > > > > > > > > > on solving flaky tests. > > > > > > > > > Consider merging them into one workflow when the tests are in > > > > > > > > > a > > > > > > better > > > > > > > > > situation. > > > > > > > > > > > > > > > > Makes sense for minimizing the risk of change, but we cannot > > > > > > > > just wait > > > > > > for > > > > > > > > things to fix themselves. > > > > > > > > How long will other Apache projects tolerate the resource > > > > > > > > consumption > > > > > > > > issues Pulsar is causing in the shared GitHub Actions Runner VM > > > > > > > > quota? > > > > > > For > > > > > > > > example, > > > > > > https://github.com/apache/pulsar/pull/9159#issuecomment-766915396 > > > > > > > > . > > > > > > > > Isn't it urgent to resolve it? > > > > > > > > > > > > > > > > I'll revisit the plan for PIP "Changes to GitHub Actions based > > > > > > > > Pulsar > > > > > > CI" > > > > > > > > based on the community feedback in the upcoming days. That > > > > > > > > might mean > > > > > > that > > > > > > > > the current solution is pivoted. The goal is to solve the > > > > > > > > problems of > > > > > > high > > > > > > > > resource consumption and long lead time for PR build in Pulsar > > > > > > > > CI. > > > > > > Please > > > > > > > > continue to provide feedback so that we get a revisited plan > > > > > > > > together > > > > > > that > > > > > > > > addresses these problems.Thank you! > > > > > > > > > > > > > > > > BR, > > > > > > > > -Lari > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 11:06 PM Sijie Guo <guosi...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > *Sijie, how far are we from getting the draft PIP "Changes > > > > > > > > > > to > > > > > > GitHub > > > > > > > > > Actions based Pulsar CI" into an actual PIP that gets put on > > > > > > > > > the wiki > > > > > > > > > page https://github.com/apache/pulsar/wiki > > > > > > > > > <https://github.com/apache/pulsar/wiki> ?* > > > > > > > > > > > > > > > > > > I see what you referred to before now. This can be easily > > > > > > > > > done. I > > > > > > (or any > > > > > > > > > other committer) can do it for you. > > > > > > > > > > > > > > > > > > There is no real blocker for you to continue work even there > > > > > > > > > are > > > > > > concerns > > > > > > > > > or it is not listed in the PIP. > > > > > > > > > We are using a lazy consensus approach. Typically if there is > > > > > > > > > no > > > > > > > > objection, > > > > > > > > > please go ahead and not need to wait for approval. > > > > > > > > > If people raise concerns, please address the concerns. > > > > > > > > > > > > > > > > > > > The reason why > > > > > > > > > re-runs happen currently is because of the high flakiness of > > > > > > > > > tests. > > > > > > > > > > > > > > > > > > Yes. So I am in favor of addressing flaky tests than merging > > > > > > > > > all > > > > > > > > workflows > > > > > > > > > into one giant workflow. > > > > > > > > > It is not about "No pain, no gain". The community has > > > > > > > > > suffered a lot > > > > > > with > > > > > > > > > giant workflow before. > > > > > > > > > There were a lot of committers and contributors working hard > > > > > > > > > to > > > > > > split one > > > > > > > > > giant workflow into multiple > > > > > > > > > current workflows. Unless there is really strong evidence that > > > > > > merging > > > > > > > > them > > > > > > > > > back to one will improve > > > > > > > > > the entire CI experience, I will still have concerns about > > > > > > > > > one giant > > > > > > > > > workflow approach. > > > > > > > > > > > > > > > > > > Hence I am +1 to "changes to flaky test handing" and suggest > > > > > > > > > focusing > > > > > > > > more > > > > > > > > > on solving flaky tests. > > > > > > > > > Consider merging them into one workflow when the tests are in > > > > > > > > > a > > > > > > better > > > > > > > > > situation. > > > > > > > > > > > > > > > > > > > This solution would also require disabling > > > > > > > > > required status checks > > > > > > > > > > > > > > > > > > I don't think it is a good idea to disable status checks. We > > > > > > > > > can > > > > > > consider > > > > > > > > > running "dark mode" but it will just overload the action > > > > > > > > > quota. > > > > > > > > > > > > > > > > > > Another alternative is to mirror the pull requests into > > > > > > > > > another > > > > > > Github > > > > > > > > > account to test that and get more concrete statistics on the > > > > > > flakiness > > > > > > > > rate > > > > > > > > > of one workflow approach. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 1:57 AM Lari Hotari <l...@hotari.net> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks for the feedback, Sijie. > > > > > > > > > > > > > > > > > > > > > The "Fail fast" approach is great. That would be super > > > > > > > > > > > helpful if > > > > > > > > there > > > > > > > > > > are > > > > > > > > > > > multiple workflows and each workflow is retryable. > > > > > > > > > > > However, I am not sure how much it will help if you run > > > > > > > > > > > all > > > > > > workflows > > > > > > > > > in > > > > > > > > > > > one giant workflow. Or is it making things worse? > > > > > > > > > > > > > > > > > > > > We can reduce the need for re-running workflow runs. The > > > > > > > > > > reason why > > > > > > > > > > re-runs happen currently is because of high flakiness of > > > > > > > > > > tests. > > > > > > > > > > Addressing flakiness continues to be top-priority. Now that > > > > > > > > > > the > > > > > > Pulsar > > > > > > > > CI > > > > > > > > > > workflow prototype is finished, I'll be focusing more in > > > > > > > > > > the other > > > > > > > > draft > > > > > > > > > > PIP, "Changes to flaky test handling". > > > > > > > > > > We as a community should address the critical problem that > > > > > > > > > > the > > > > > > current > > > > > > > > > > retry solution has: it can mask bugs in production code and > > > > > > > > > > make > > > > > > the > > > > > > > > > build > > > > > > > > > > pass and allow changes to be merged that cause regressions. > > > > > > > > > > It's a false sense of security what the green builds after > > > > > > > > > > all the > > > > > > > > > retries > > > > > > > > > > bring us. Bringing Pulsar to the next level in stability > > > > > > > > > > requires > > > > > > > > > > addressing this. > > > > > > > > > > > > > > > > > > > > If something doesn't work, it can be adapted and improved. > > > > > > > > > > Changes > > > > > > can > > > > > > > > be > > > > > > > > > > rolled back and revisited when things go worse. We also > > > > > > > > > > need a > > > > > > leap of > > > > > > > > > > faith. > > > > > > > > > > "No pain, no gain", like any change, it will be painful at > > > > > > > > > > first, > > > > > > but > > > > > > > > we > > > > > > > > > > will get over the bump. > > > > > > > > > > > > > > > > > > > > > Secondly, your test has been done in your folk where > > > > > > > > > > > there are > > > > > > not a > > > > > > > > > lot > > > > > > > > > > of > > > > > > > > > > > concurrent pushes and pull requests. I am not sure how > > > > > > > > > > > your > > > > > > approach > > > > > > > > > will > > > > > > > > > > > behave once it is merged into master. Can you simulate > > > > > > > > > > > multiple > > > > > > > > > > concurrent > > > > > > > > > > > pull requests in your account to prove your approach > > > > > > > > > > > doesn't > > > > > > bring > > > > > > > > side > > > > > > > > > > > effects? > > > > > > > > > > > > > > > > > > > > One possibility to address this is to introduce the new > > > > > > > > > > workflow > > > > > > in a > > > > > > > > > mode > > > > > > > > > > where you need to opt-in to the new workflow in some way. > > > > > > > > > > This was an idea brought up by my colleagues Enrico and > > > > > > > > > > Andrey. > > > > > > > > > > It might be possible to configure the existing workflow and > > > > > > > > > > this > > > > > > new > > > > > > > > > > workflow in a way where some condition (for example > > > > > > > > > > whitelisted > > > > > > github > > > > > > > > > user > > > > > > > > > > name or a certain keyword in the PR title/description) > > > > > > > > > > chooses > > > > > > either > > > > > > > > one > > > > > > > > > > for the Pull request. This solution would also require > > > > > > > > > > disabling > > > > > > > > > > required status checks ("Require status checks to pass > > > > > > > > > > before > > > > > > merging" > > > > > > > > > > feature in GitHub branch protection rules) since the names > > > > > > > > > > of the > > > > > > > > checks > > > > > > > > > > would be different. > > > > > > > > > > > > > > > > > > > > > Lastly, can we apply those optimizations to current > > > > > > > > > > > workflows > > > > > > without > > > > > > > > > > > merging them into one giant workflow? > > > > > > > > > > > > > > > > > > > > This is what I have been doing. All individual > > > > > > > > > > optimizations have > > > > > > > > already > > > > > > > > > > been sent as PRs in the last months. I guess there's 20-30 > > > > > > > > > > PRs that > > > > > > > > have > > > > > > > > > > already been merged. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/pulls?q=is%3Apr+author%3Alhotari+is%3Amerged > > > > > > > > > > There's also 2 build related PRs from yesterday which > > > > > > > > > > haven't been > > > > > > > > merged > > > > > > > > > > yet: > > > > > > > > > > Fix Maven download issues (ported from the prototype to our > > > > > > existing > > > > > > > > > Pulsar > > > > > > > > > > CI): https://github.com/apache/pulsar/pull/9883 > > > > > > > > > > improve Maven module build order (required for more > > > > > > > > > > efficient > > > > > > builds > > > > > > > > that > > > > > > > > > > selectively build required artifacts): > > > > > > > > > > https://github.com/apache/pulsar/pull/9882 > > > > > > > > > > > > > > > > > > > > There aren't many optimizations left that could be ported > > > > > > > > > > from the > > > > > > > > > > prototype to the existing workflow. There are a few, but the > > > > > > impact is > > > > > > > > > > minor. > > > > > > > > > > The reason for this is that the optimization with the > > > > > > > > > > greatest > > > > > > impact > > > > > > > > are > > > > > > > > > > the ones that build a binary artifacts (maven libs, docker > > > > > > > > > > images) > > > > > > once > > > > > > > > > and > > > > > > > > > > share it with the downstream jobs in the pipeline. > > > > > > > > > > Applying this type of solution has certain challenges when > > > > > > > > > > there > > > > > > are > > > > > > > > > > multiple separate workflow. Sharing binary artifacts to > > > > > > > > > > other > > > > > > workflows > > > > > > > > > > would require that the workflow to reuse the artifacts gets > > > > > > triggered > > > > > > > > by > > > > > > > > > > the workflow that produced the artifacts. This wouldn't be > > > > > > > > > > secure > > > > > > or > > > > > > > > > > practical for handling pull requests. Triggering a workflow > > > > > > explicitly > > > > > > > > > > would require a token from the main repository and using > > > > > > > > > > that for > > > > > > pull > > > > > > > > > > request builds would be a serious security vulnerability. > > > > > > > > > > (more > > > > > > details > > > > > > > > > > about the GitHub Actions security model in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://securitylab.github.com/research/github-actions-preventing-pwn-requests > > > > > > > > > > ) > > > > > > > > > > > > > > > > > > > > *Sijie, how far are we from getting the draft PIP "Changes > > > > > > > > > > to > > > > > > GitHub > > > > > > > > > > Actions based Pulsar CI" into an actual PIP that gets put > > > > > > > > > > on the > > > > > > wiki > > > > > > > > > > page https://github.com/apache/pulsar/wiki > > > > > > > > > > <https://github.com/apache/pulsar/wiki> ?* > > > > > > > > > > > > > > > > > > > > -Lari > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 10:46 AM Sijie Guo > > > > > > > > > > <guosi...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > This is good progress. However, my main concern is still > > > > > > > > > > > merging > > > > > > all > > > > > > > > > > > workflows into one giant workflow. > > > > > > > > > > > > > > > > > > > > > > The "Fail fast" approach is great. That would be super > > > > > > > > > > > helpful if > > > > > > > > there > > > > > > > > > > are > > > > > > > > > > > multiple workflows and each workflow is retryable. > > > > > > > > > > > However, I am not sure how much it will help if you run > > > > > > > > > > > all > > > > > > workflows > > > > > > > > > in > > > > > > > > > > > one giant workflow. Or is it making things worse? > > > > > > > > > > > > > > > > > > > > > > Secondly, your test has been done in your folk where > > > > > > > > > > > there are > > > > > > not a > > > > > > > > > lot > > > > > > > > > > of > > > > > > > > > > > concurrent pushes and pull requests. I am not sure how > > > > > > > > > > > your > > > > > > approach > > > > > > > > > will > > > > > > > > > > > behave once it is merged into master. Can you simulate > > > > > > > > > > > multiple > > > > > > > > > > concurrent > > > > > > > > > > > pull requests in your account to prove your approach > > > > > > > > > > > doesn't > > > > > > bring > > > > > > > > side > > > > > > > > > > > effects? > > > > > > > > > > > > > > > > > > > > > > Lastly, can we apply those optimizations to current > > > > > > > > > > > workflows > > > > > > without > > > > > > > > > > > merging them into one giant workflow? > > > > > > > > > > > > > > > > > > > > > > - Sijie > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 12:30 AM Lari Hotari > > > > > > > > > > > <l...@hotari.net> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback Michael. > > > > > > > > > > > > > > > > > > > > > > > > > I left a question on the doc about how concurrent runs > > > > > > affect the > > > > > > > > > > > > > repository's 5 GB cache limit. > > > > > > > > > > > > > > > > > > > > > > > > This is a good question. There isn't a clear answer in > > > > > > > > > > > > the > > > > > > GitHub > > > > > > > > > > Actions > > > > > > > > > > > > Cache documentation. > > > > > > > > > > > > > > > > > > > > > > > > The documentation is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows > > > > > > > > > > > > . > > > > > > > > > > > > Based on this document and some testing, I have made > > > > > > > > > > > > these > > > > > > > > > conclusions: > > > > > > > > > > > > For GitHub Actions Cache, pull requests get executed in > > > > > > > > > > > > the > > > > > > context > > > > > > > > > of > > > > > > > > > > > the > > > > > > > > > > > > forked repository. > > > > > > > > > > > > The workflow triggered by a pull request event can only > > > > > > > > > > > > update > > > > > > it's > > > > > > > > > own > > > > > > > > > > > > cache. It has read-only access to upstream caches. > > > > > > > > > > > > If there's a cache miss, the entry will get written to > > > > > > > > > > > > the > > > > > > cache of > > > > > > > > > the > > > > > > > > > > > > forked repository. If the PR could to write to the > > > > > > > > > > > > upstream > > > > > > cache, > > > > > > > > it > > > > > > > > > > > would > > > > > > > > > > > > be a security issue since this would be vulnerable to > > > > > > > > > > > > cache > > > > > > > > poisoning > > > > > > > > > > > > attacks. Each repository has a 5GB quota for writes. The > > > > > > entries > > > > > > > > are > > > > > > > > > > kept > > > > > > > > > > > > up to 7 days. > > > > > > > > > > > > The performance is fairly good. Loading docker images > > > > > > > > > > > > from the > > > > > > > > > > repository > > > > > > > > > > > > happens about 15MB/s. Writing is 2-3x slower, about > > > > > > > > > > > > 5-7MB/s. > > > > > > (the > > > > > > > > > > > > performance of the GHA cache is most likely higher > > > > > > > > > > > > since this > > > > > > is > > > > > > > > the > > > > > > > > > > > > throughput for docker load / docker save) > > > > > > > > > > > > > > > > > > > > > > > > If a single repository has a lot of concurrent jobs, it > > > > > > > > > > > > could > > > > > > start > > > > > > > > > > > > evicting caches. > > > > > > > > > > > > However that isn't likely to happen with the way Pulsar > > > > > > > > > > > > is > > > > > > > > developed > > > > > > > > > > > since > > > > > > > > > > > > pull requests are created from personal forks. > > > > > > > > > > > > > > > > > > > > > > > > > I also think it could be helpful to explicitly > > > > > > > > > > > > > document, or > > > > > > > > > reference > > > > > > > > > > > > > github documentation, on how failure will affect the > > > > > > > > > > > > > DAG. I'm > > > > > > > > > > assuming > > > > > > > > > > > > that > > > > > > > > > > > > > if an action fails, its parallel peer actions will > > > > > > > > > > > > > run to > > > > > > > > > completion, > > > > > > > > > > > and > > > > > > > > > > > > > that the rest of the remaining stages will get > > > > > > > > > > > > > canceled, but > > > > > > I > > > > > > > > > > haven't > > > > > > > > > > > > > worked with github actions before. > > > > > > > > > > > > > > > > > > > > > > > > For matrix jobs, "fail fast" is the default, which > > > > > > > > > > > > cancels all > > > > > > jobs > > > > > > > > > in > > > > > > > > > > > the > > > > > > > > > > > > matrix if one fails. Other parallel flows would run to > > > > > > completion > > > > > > > > by > > > > > > > > > > > > default. > > > > > > > > > > > > In the prototype, I have added a Github script step to > > > > > > > > > > > > each > > > > > > job to > > > > > > > > > > cancel > > > > > > > > > > > > the complete workflow when a failure occurs. > > > > > > > > > > > > Here's an example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/lhotari/pulsar/blob/lh-refactor-pulsar-ci-with-retries/.github/workflows/pulsar-ci.yaml#L281-L289 > > > > > > > > > > > > > > > > > > > > > > > > The prototype follows a "fail fast" design. When a > > > > > > > > > > > > failure > > > > > > occurs, > > > > > > > > > fail > > > > > > > > > > > > fast and don't continue with other jobs. > > > > > > > > > > > > The benefit of this is that it reduces resource > > > > > > > > > > > > consumption. > > > > > > This > > > > > > > > > helps > > > > > > > > > > > > keep the build queue short. > > > > > > > > > > > > When the build queue is short, developers get quick > > > > > > > > > > > > feedback > > > > > > from > > > > > > > > CI. > > > > > > > > > > > > > > > > > > > > > > > > Documenting all details in the PIP document isn't > > > > > > > > > > > > practical. > > > > > > > > > > > > *I'm hoping to start a separate document on low level > > > > > > > > > > > > details > > > > > > when > > > > > > > > > > there > > > > > > > > > > > is > > > > > > > > > > > > a high level acceptance of the proposed "Changes to > > > > > > > > > > > > GitHub > > > > > > Actions > > > > > > > > > > based > > > > > > > > > > > > Pulsar CI".* > > > > > > > > > > > > Together we can make this happen. We need decisions > > > > > > > > > > > > too. This > > > > > > > > > proposal > > > > > > > > > > > > cannot stay as a draft forever. > > > > > > > > > > > > I'm looking forward to hearing from the Pulsar > > > > > > > > > > > > community, > > > > > > Pulsar > > > > > > > > > > > > committer and Pulsar PMC members how to take this > > > > > > > > > > > > forward. > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > -Lari > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 8:06 AM Michael Marshall < > > > > > > > > > > mikemars...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > This will be a great improvement. I read through the > > > > > > > > > > > > > PIP, and > > > > > > > > > > overall, > > > > > > > > > > > it > > > > > > > > > > > > > looks good to me. > > > > > > > > > > > > > > > > > > > > > > > > > > I left a question on the doc about how concurrent runs > > > > > > affect the > > > > > > > > > > > > > repository's 5 GB cache limit. > > > > > > > > > > > > > > > > > > > > > > > > > > I also think it could be helpful to explicitly > > > > > > > > > > > > > document, or > > > > > > > > > reference > > > > > > > > > > > > > github documentation, on how failure will affect the > > > > > > > > > > > > > DAG. I'm > > > > > > > > > > assuming > > > > > > > > > > > > that > > > > > > > > > > > > > if an action fails, its parallel peer actions will > > > > > > > > > > > > > run to > > > > > > > > > completion, > > > > > > > > > > > and > > > > > > > > > > > > > that the rest of the remaining stages will get > > > > > > > > > > > > > canceled, but > > > > > > I > > > > > > > > > > haven't > > > > > > > > > > > > > worked with github actions before. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for all of the work you've put in so far. > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 11, 2021 at 6:37 PM Yuva raj > > > > > > > > > > > > > <uvar...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > This is great news. Thanks Hari , Mateo and pulsar > > > > > > community > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021, 2:04 AM Lari Hotari < > > > > > > > > lari.hot...@sagire.fi > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear Pulsar community members, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The work on "Changes to GitHub Actions based > > > > > > > > > > > > > > > Pulsar CI" > > > > > > has > > > > > > > > > gone > > > > > > > > > > > > > forward > > > > > > > > > > > > > > > based on your feedback. Here are some updates > > > > > > > > > > > > > > > about the > > > > > > work. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The draft PIP proposal document is here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry > > > > > > > > > > > > > > > There's a *detailed status update in the document > > > > > > > > > > > > > > > about a > > > > > > > > > > prototype > > > > > > > > > > > > for > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > refactored Pulsar CI GitHub Actions based > > > > > > > > > > > > > > > workflow*. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for all the suggestions and feedback by > > > > > > > > > > > > > > > now. A > > > > > > lot of > > > > > > > > > > > > > improvements > > > > > > > > > > > > > > > have been made by the Pulsar contributors to > > > > > > > > > > > > > > > overcome the > > > > > > > > > > technical > > > > > > > > > > > > > > > obstacles. > > > > > > > > > > > > > > > Special thanks go to Matteo for reducing the > > > > > > > > > > > > > > > sizes of > > > > > > docker > > > > > > > > > > > images. > > > > > > > > > > > > A > > > > > > > > > > > > > > lot > > > > > > > > > > > > > > > of small improvements have been made to the > > > > > > > > > > > > > > > Pulsar maven > > > > > > > > build > > > > > > > > > to > > > > > > > > > > > > > enable > > > > > > > > > > > > > > > the new refactored GitHub Actions workflow. Thank > > > > > > > > > > > > > > > you > > > > > > for all > > > > > > > > > PR > > > > > > > > > > > > > reviews > > > > > > > > > > > > > > > and feedback. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The main goal of the "Changes to GitHub Actions > > > > > > > > > > > > > > > based > > > > > > Pulsar > > > > > > > > > CI" > > > > > > > > > > > work > > > > > > > > > > > > > has > > > > > > > > > > > > > > > been to *reduce the resource consumption of the > > > > > > > > > > > > > > > Pulsar CI > > > > > > > > build > > > > > > > > > > and > > > > > > > > > > > > to > > > > > > > > > > > > > > > speed up Pulsar development by improving the > > > > > > > > > > > > > > > developer > > > > > > > > > > > productivity* > > > > > > > > > > > > > when > > > > > > > > > > > > > > > less time is wasted in waiting for Pulsar CI build > > > > > > feedback. > > > > > > > > > The > > > > > > > > > > > > > > prototype > > > > > > > > > > > > > > > demonstrates these improvements. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As you can see from the email from Jan 28 below, > > > > > > > > > > > > > > > *the > > > > > > > > resource > > > > > > > > > > > > > > consumption > > > > > > > > > > > > > > > was 19 hrs 36 minutes* for a single pull request > > > > > > > > > > > > > > > that was > > > > > > > > > > observed > > > > > > > > > > > > when > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > work began. > > > > > > > > > > > > > > > Now, with the prototype of the refactored Pulsar > > > > > > > > > > > > > > > CI > > > > > > build, > > > > > > > > the > > > > > > > > > > > > resource > > > > > > > > > > > > > > > consumption is *7 hrs 9 minutes.* > > > > > > > > > > > > > > > *This is about 60% reduction in resource > > > > > > > > > > > > > > > consumption.* > > > > > > The > > > > > > > > > whole > > > > > > > > > > > > > pipeline > > > > > > > > > > > > > > > completes in 75-100 minutes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here's a breakdown of the duration (resource > > > > > > consumption) of > > > > > > > > > each > > > > > > > > > > > > build > > > > > > > > > > > > > > job > > > > > > > > > > > > > > > in the refactored workflow: > > > > > > > > > > > > > > > Workflow Job seconds h:mm:ss > > > > > > > > > > > > > > > Pulsar CI Changed files check 4 0:00:04 > > > > > > > > > > > > > > > Pulsar CI Go 1.11 Functions 155 0:02:35 > > > > > > > > > > > > > > > Pulsar CI Go 1.12 Functions 166 0:02:46 > > > > > > > > > > > > > > > Pulsar CI Go 1.13 Functions 113 0:01:53 > > > > > > > > > > > > > > > Pulsar CI Go 1.14 Functions 96 0:01:36 > > > > > > > > > > > > > > > Pulsar CI Build on MacOS 1017 0:16:57 > > > > > > > > > > > > > > > Pulsar CI Build and License check 346 0:05:46 > > > > > > > > > > > > > > > Pulsar CI Build Pulsar CPP and Python clients 683 > > > > > > > > > > > > > > > 0:11:23 > > > > > > > > > > > > > > > Pulsar CI Build Pulsar java-test-image docker > > > > > > > > > > > > > > > image 405 > > > > > > > > 0:06:45 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Other 1580 0:26:20 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Broker Group 1 968 > > > > > > 0:16:08 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Broker Group 2 > > > > > > > > > > > > > > > 2223 > > > > > > 0:37:03 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Client Api 1652 > > > > > > > > > > > > > > > 0:27:32 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Client Impl 916 > > > > > > > > > > > > > > > 0:15:16 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Brokers - Other 522 0:08:42 > > > > > > > > > > > > > > > Pulsar CI CI - Unit - Proxy 331 0:05:31 > > > > > > > > > > > > > > > Pulsar CI Build Pulsar docker image 2343 0:39:03 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Shade 414 0:06:54 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Backwards > > > > > > > > > > > > > > > Compatibility 849 > > > > > > > > > 0:14:09 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Cli 1490 0:24:50 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Messaging 857 0:14:17 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Schema 468 0:07:48 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Standalone 286 > > > > > > > > > > > > > > > 0:04:46 > > > > > > > > > > > > > > > Pulsar CI CI - Integration - Transaction 362 > > > > > > > > > > > > > > > 0:06:02 > > > > > > > > > > > > > > > Pulsar CI CI - System - Function State 699 0:11:39 > > > > > > > > > > > > > > > Pulsar CI CI - System - Tiered FileSystem 779 > > > > > > > > > > > > > > > 0:12:59 > > > > > > > > > > > > > > > Pulsar CI CI - System - Tiered JCloud 529 0:08:49 > > > > > > > > > > > > > > > Pulsar CI CI - System - Pulsar Connectors - > > > > > > > > > > > > > > > Thread 1795 > > > > > > > > 0:29:55 > > > > > > > > > > > > > > > Pulsar CI CI - System - Pulsar Connectors - > > > > > > > > > > > > > > > Process 2312 > > > > > > > > > 0:38:32 > > > > > > > > > > > > > > > Pulsar CI CI - System - Sql 1377 0:22:57 > > > > > > > > > > > > > > > *Total resource consumption* > > > > > > > > > > > > > > > 7:08:57 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > GitHub Actions doesn't support restarting a > > > > > > > > > > > > > > > single job ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234 > > > > > > > > > > > > > > > ). > > > > > > > > > > > > > > > However, this is not a showstopper since there > > > > > > > > > > > > > > > are ways > > > > > > to > > > > > > > > > > address > > > > > > > > > > > > the > > > > > > > > > > > > > > > issues that cause flakiness. > > > > > > > > > > > > > > > There is a separate PIP for changing the way to > > > > > > > > > > > > > > > handle > > > > > > flaky > > > > > > > > > > tests. > > > > > > > > > > > > You > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > find the link to that in the "Changes to GitHub > > > > > > > > > > > > > > > Actions > > > > > > based > > > > > > > > > > > Pulsar > > > > > > > > > > > > > CI" > > > > > > > > > > > > > > > document's header. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Some requests for the Pulsar community:* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) *Please take a look at the updated PIP > > > > > > > > > > > > > > > document*: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry > > > > > > > > > > > > > > > . *It also contains more details of the prototype > > > > > > > > > > > > > > > that > > > > > > has > > > > > > > > been > > > > > > > > > > > > > > > successfully completed.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) *Please share your feedback and suggest a way > > > > > > forward.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Thank you for your help!* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR, Lari > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 28, 2021 at 7:13 PM Lari Hotari < > > > > > > > > > > lari.hot...@sagire.fi > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear Pulsar community members, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, the Pulsar GitHub Actions workflows > > > > > > > > > > > > > > > > are > > > > > > > > consuming > > > > > > > > > > the > > > > > > > > > > > > > > majority > > > > > > > > > > > > > > > > of the shared pool of resources allocated for > > > > > > > > > > github.com/apache > > > > > > > > > > > > > > > projects. > > > > > > > > > > > > > > > > Other Apache projects have been impacted and > > > > > > > > > > > > > > > > there is a > > > > > > > > > demand > > > > > > > > > > to > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > the Pulsar CI > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/pull/9159#issuecomment-766915396 > > > > > > > > > > > > > > > > > > > > > > > > > > > > asap. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In GitHub Actions Runners, the unit of > > > > > > > > > > > > > > > > resources is the > > > > > > > > time > > > > > > > > > > > that a > > > > > > > > > > > > > > > Runner > > > > > > > > > > > > > > > > is occupied. I observed the workflow runs for > > > > > > > > > > > > > > > > handling > > > > > > a > > > > > > > > > single > > > > > > > > > > > > Pull > > > > > > > > > > > > > > > > Request (in my personal fork) and these were the > > > > > > running > > > > > > > > > > > durations: > > > > > > > > > > > > > > > > Workflow name Duration > > > > > > > > > > > > > > > > CI - Build - MacOS 0:17:23 > > > > > > > > > > > > > > > > CI - Go Functions style check 0:02:38 > > > > > > > > > > > > > > > > CI - Unit - Brokers - Other 0:15:40 > > > > > > > > > > > > > > > > CI - Unit - Brokers - Client Impl 0:16:28 > > > > > > > > > > > > > > > > CI - Misc 0:16:51 > > > > > > > > > > > > > > > > CI - Unit - Proxy 0:14:23 > > > > > > > > > > > > > > > > CI - Go Functions Tests 0:22:08 > > > > > > > > > > > > > > > > CI - CPP, Python Tests 0:23:30 > > > > > > > > > > > > > > > > CI - Unit 0:42:11 > > > > > > > > > > > > > > > > CI - Integration - Sql 1:00:13 > > > > > > > > > > > > > > > > CI - Integration - Tiered JCloud 1:00:18 > > > > > > > > > > > > > > > > CI - Integration - Tiered FileSystem 1:00:13 > > > > > > > > > > > > > > > > CI - Integration - Function State 1:00:12 > > > > > > > > > > > > > > > > CI - Integration - Cli 1:10:22 > > > > > > > > > > > > > > > > CI - Integration - Transaction 1:16:34 > > > > > > > > > > > > > > > > CI - Integration - Process 1:11:23 > > > > > > > > > > > > > > > > CI - Shade - Test 1:15:45 > > > > > > > > > > > > > > > > CI - Unit - Brokers - Client Api 0:26:13 > > > > > > > > > > > > > > > > CI - Unit - Brokers - Broker Group 2 0:35:05 > > > > > > > > > > > > > > > > CI - Integration - Standalone 0:45:29 > > > > > > > > > > > > > > > > CI - Integration - Messaging 1:00:23 > > > > > > > > > > > > > > > > CI - Integration - Thread 1:00:19 > > > > > > > > > > > > > > > > CI - Integration - Backwards Compatibility > > > > > > > > > > > > > > > > 1:00:19 > > > > > > > > > > > > > > > > CI - Integration - Schema 1:00:19 > > > > > > > > > > > > > > > > CI - Unit - Brokers - Broker Group 1 2:02:31 > > > > > > > > > > > > > > > > TOTAL 19:36:50 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *In this case, the total resource consumption > > > > > > > > > > > > > > > > of GitHub > > > > > > > > > Actions > > > > > > > > > > > > > Runners > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > 19 hours 36 minutes 50 seconds for a single pull > > > > > > request to > > > > > > > > > > > > > > > apache/pulsar.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since GitHub Actions Runner resource pool > > > > > > > > > > > > > > > > utilization > > > > > > is > > > > > > > > very > > > > > > > > > > > high, > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > leads to the build queue to grow and take a > > > > > > > > > > > > > > > > long time > > > > > > to > > > > > > > > > > process. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have been looking for ways to improve the > > > > > > > > > > > > > > > > Pulsar CI > > > > > > for > > > > > > > > the > > > > > > > > > > > last > > > > > > > > > > > > 3 > > > > > > > > > > > > > > > > months. During this period I worked on a few > > > > > > experiments. > > > > > > > > The > > > > > > > > > > > > > learnings > > > > > > > > > > > > > > > > from the past experiments are documented at a > > > > > > > > > > > > > > > > high > > > > > > level in > > > > > > > > > the > > > > > > > > > > > > > > following > > > > > > > > > > > > > > > > draft PIP document. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *The draft PIP "Changes to GitHub Actions based > > > > > > > > > > > > > > > > Pulsar > > > > > > CI" > > > > > > > > > > > document > > > > > > > > > > > > > is > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > Google doc:* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit?usp=sharing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Please participate* so that we get the plan > > > > > > > > > > > > > > > > adjusted > > > > > > based > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > > > > > > feedback asap. If there's already a similar > > > > > > > > > > > > > > > > effort > > > > > > > > ongoing, I > > > > > > > > > > > hope > > > > > > > > > > > > we > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > join efforts. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Let's fix Pulsar CI!* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR, Lari > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >