Thanks for the feedback Sijie. > We are using a lazy consensus approach. Typically if there is no objection, > please go ahead and not need to wait for approval. > If people raise concerns, please address the concerns.
You and Ali have raised concerns about changing the existing GitHub Actions workflows in a way where multiple workflows would be combined to a single workflow. Before proceeding, there is a need to address the concerns. We might end up with a completely different type of solution of what has been proposed initially. :) > Yes. So I am in favor of addressing flaky tests than merging all workflows > into one giant workflow. I agree that addressing flaky tests is favorable. The main reason for PIP "Changes to GitHub Actions based Pulsar CI" is to 1) Reduce GitHub Action Runner resource consumption of Pulsar PR builds 2) Reduce lead times for Pull Request feedback We cannot ignore these problems. If we don't change anything, the problems won't get fixed. The prototype has demonstrated about 60% reduction in resource consumption. Measuring the lead times hasn't been done in the prototype, but since the reason for long lead times has been long build queues due to excessive resource consumption, it's likely that the lead times would be reduced. I know that switching to a single workflow isn't the only solution to the above problems. I had a discussion with Ali. He recommended reducing the modules in Pulsar repository (PIP-62), reducing the docker container size and improving the Pulsar Broker unit test harness so that tests using it would be less flaky and that it would be easier to fix the issues in failing test when there would be better information about what was the state problem that caused the test to fail. As mentioned in the earlier email about the optimizations in the Pulsar CI refactoring prototype, the main benefits come from reusing binary artifacts from previous build stages so that each job doesn't have to build everything from scratch. This becomes irrelevant when the build is very fast and there isn't a benefit of reusing artifacts. This means that it's possible to resolve the resource consumption problem of Pulsar PR builds in the way that Ali is recommending, without switching from multiple workflows to a single workflow that can reuse binary artifacts in the build. > Hence I am +1 to "changes to flaky test handing" and suggest focusing more > on solving flaky tests. > Consider merging them into one workflow when the tests are in a better > situation. Makes sense for minimizing the risk of change, but we cannot just wait for things to fix themselves. How long will other Apache projects tolerate the resource consumption issues Pulsar is causing in the shared GitHub Actions Runner VM quota? For example, https://github.com/apache/pulsar/pull/9159#issuecomment-766915396 . Isn't it urgent to resolve it? I'll revisit the plan for PIP "Changes to GitHub Actions based Pulsar CI" based on the community feedback in the upcoming days. That might mean that the current solution is pivoted. The goal is to solve the problems of high resource consumption and long lead time for PR build in Pulsar CI. Please continue to provide feedback so that we get a revisited plan together that addresses these problems.Thank you! BR, -Lari On Fri, Mar 12, 2021 at 11:06 PM Sijie Guo <guosi...@gmail.com> wrote: > > *Sijie, how far are we from getting the draft PIP "Changes to GitHub > Actions based Pulsar CI" into an actual PIP that gets put on the wiki > page https://github.com/apache/pulsar/wiki > <https://github.com/apache/pulsar/wiki> ?* > > I see what you referred to before now. This can be easily done. I (or any > other committer) can do it for you. > > There is no real blocker for you to continue work even there are concerns > or it is not listed in the PIP. > We are using a lazy consensus approach. Typically if there is no objection, > please go ahead and not need to wait for approval. > If people raise concerns, please address the concerns. > > > The reason why > re-runs happen currently is because of the high flakiness of tests. > > Yes. So I am in favor of addressing flaky tests than merging all workflows > into one giant workflow. > It is not about "No pain, no gain". The community has suffered a lot with > giant workflow before. > There were a lot of committers and contributors working hard to split one > giant workflow into multiple > current workflows. Unless there is really strong evidence that merging them > back to one will improve > the entire CI experience, I will still have concerns about one giant > workflow approach. > > Hence I am +1 to "changes to flaky test handing" and suggest focusing more > on solving flaky tests. > Consider merging them into one workflow when the tests are in a better > situation. > > > This solution would also require disabling > required status checks > > I don't think it is a good idea to disable status checks. We can consider > running "dark mode" but it will just overload the action quota. > > Another alternative is to mirror the pull requests into another Github > account to test that and get more concrete statistics on the flakiness rate > of one workflow approach. > > > > > > > On Fri, Mar 12, 2021 at 1:57 AM Lari Hotari <l...@hotari.net> wrote: > > > Thanks for the feedback, Sijie. > > > > > The "Fail fast" approach is great. That would be super helpful if there > > are > > > multiple workflows and each workflow is retryable. > > > However, I am not sure how much it will help if you run all workflows > in > > > one giant workflow. Or is it making things worse? > > > > We can reduce the need for re-running workflow runs. The reason why > > re-runs happen currently is because of high flakiness of tests. > > Addressing flakiness continues to be top-priority. Now that the Pulsar CI > > workflow prototype is finished, I'll be focusing more in the other draft > > PIP, "Changes to flaky test handling". > > We as a community should address the critical problem that the current > > retry solution has: it can mask bugs in production code and make the > build > > pass and allow changes to be merged that cause regressions. > > It's a false sense of security what the green builds after all the > retries > > bring us. Bringing Pulsar to the next level in stability requires > > addressing this. > > > > If something doesn't work, it can be adapted and improved. Changes can be > > rolled back and revisited when things go worse. We also need a leap of > > faith. > > "No pain, no gain", like any change, it will be painful at first, but we > > will get over the bump. > > > > > Secondly, your test has been done in your folk where there are not a > lot > > of > > > concurrent pushes and pull requests. I am not sure how your approach > will > > > behave once it is merged into master. Can you simulate multiple > > concurrent > > > pull requests in your account to prove your approach doesn't bring side > > > effects? > > > > One possibility to address this is to introduce the new workflow in a > mode > > where you need to opt-in to the new workflow in some way. > > This was an idea brought up by my colleagues Enrico and Andrey. > > It might be possible to configure the existing workflow and this new > > workflow in a way where some condition (for example whitelisted github > user > > name or a certain keyword in the PR title/description) chooses either one > > for the Pull request. This solution would also require disabling > > required status checks ("Require status checks to pass before merging" > > feature in GitHub branch protection rules) since the names of the checks > > would be different. > > > > > Lastly, can we apply those optimizations to current workflows without > > > merging them into one giant workflow? > > > > This is what I have been doing. All individual optimizations have already > > been sent as PRs in the last months. I guess there's 20-30 PRs that have > > already been merged. > > > > > https://github.com/apache/pulsar/pulls?q=is%3Apr+author%3Alhotari+is%3Amerged > > There's also 2 build related PRs from yesterday which haven't been merged > > yet: > > Fix Maven download issues (ported from the prototype to our existing > Pulsar > > CI): https://github.com/apache/pulsar/pull/9883 > > improve Maven module build order (required for more efficient builds that > > selectively build required artifacts): > > https://github.com/apache/pulsar/pull/9882 > > > > There aren't many optimizations left that could be ported from the > > prototype to the existing workflow. There are a few, but the impact is > > minor. > > The reason for this is that the optimization with the greatest impact are > > the ones that build a binary artifacts (maven libs, docker images) once > and > > share it with the downstream jobs in the pipeline. > > Applying this type of solution has certain challenges when there are > > multiple separate workflow. Sharing binary artifacts to other workflows > > would require that the workflow to reuse the artifacts gets triggered by > > the workflow that produced the artifacts. This wouldn't be secure or > > practical for handling pull requests. Triggering a workflow explicitly > > would require a token from the main repository and using that for pull > > request builds would be a serious security vulnerability. (more details > > about the GitHub Actions security model in > > > > > https://securitylab.github.com/research/github-actions-preventing-pwn-requests > > ) > > > > *Sijie, how far are we from getting the draft PIP "Changes to GitHub > > Actions based Pulsar CI" into an actual PIP that gets put on the wiki > > page https://github.com/apache/pulsar/wiki > > <https://github.com/apache/pulsar/wiki> ?* > > > > -Lari > > > > > > > > On Fri, Mar 12, 2021 at 10:46 AM Sijie Guo <guosi...@gmail.com> wrote: > > > > > This is good progress. However, my main concern is still merging all > > > workflows into one giant workflow. > > > > > > The "Fail fast" approach is great. That would be super helpful if there > > are > > > multiple workflows and each workflow is retryable. > > > However, I am not sure how much it will help if you run all workflows > in > > > one giant workflow. Or is it making things worse? > > > > > > Secondly, your test has been done in your folk where there are not a > lot > > of > > > concurrent pushes and pull requests. I am not sure how your approach > will > > > behave once it is merged into master. Can you simulate multiple > > concurrent > > > pull requests in your account to prove your approach doesn't bring side > > > effects? > > > > > > Lastly, can we apply those optimizations to current workflows without > > > merging them into one giant workflow? > > > > > > - Sijie > > > > > > > > > On Fri, Mar 12, 2021 at 12:30 AM Lari Hotari <l...@hotari.net> wrote: > > > > > > > Thanks for the feedback Michael. > > > > > > > > > I left a question on the doc about how concurrent runs affect the > > > > > repository's 5 GB cache limit. > > > > > > > > This is a good question. There isn't a clear answer in the GitHub > > Actions > > > > Cache documentation. > > > > > > > > The documentation is > > > > > > > > > > > > > > https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows > > > > . > > > > Based on this document and some testing, I have made these > conclusions: > > > > For GitHub Actions Cache, pull requests get executed in the context > of > > > the > > > > forked repository. > > > > The workflow triggered by a pull request event can only update it's > own > > > > cache. It has read-only access to upstream caches. > > > > If there's a cache miss, the entry will get written to the cache of > the > > > > forked repository. If the PR could to write to the upstream cache, it > > > would > > > > be a security issue since this would be vulnerable to cache poisoning > > > > attacks. Each repository has a 5GB quota for writes. The entries are > > kept > > > > up to 7 days. > > > > The performance is fairly good. Loading docker images from the > > repository > > > > happens about 15MB/s. Writing is 2-3x slower, about 5-7MB/s. (the > > > > performance of the GHA cache is most likely higher since this is the > > > > throughput for docker load / docker save) > > > > > > > > If a single repository has a lot of concurrent jobs, it could start > > > > evicting caches. > > > > However that isn't likely to happen with the way Pulsar is developed > > > since > > > > pull requests are created from personal forks. > > > > > > > > > I also think it could be helpful to explicitly document, or > reference > > > > > github documentation, on how failure will affect the DAG. I'm > > assuming > > > > that > > > > > if an action fails, its parallel peer actions will run to > completion, > > > and > > > > > that the rest of the remaining stages will get canceled, but I > > haven't > > > > > worked with github actions before. > > > > > > > > For matrix jobs, "fail fast" is the default, which cancels all jobs > in > > > the > > > > matrix if one fails. Other parallel flows would run to completion by > > > > default. > > > > In the prototype, I have added a Github script step to each job to > > cancel > > > > the complete workflow when a failure occurs. > > > > Here's an example: > > > > > > > > > > > > > > https://github.com/lhotari/pulsar/blob/lh-refactor-pulsar-ci-with-retries/.github/workflows/pulsar-ci.yaml#L281-L289 > > > > > > > > The prototype follows a "fail fast" design. When a failure occurs, > fail > > > > fast and don't continue with other jobs. > > > > The benefit of this is that it reduces resource consumption. This > helps > > > > keep the build queue short. > > > > When the build queue is short, developers get quick feedback from CI. > > > > > > > > Documenting all details in the PIP document isn't practical. > > > > *I'm hoping to start a separate document on low level details when > > there > > > is > > > > a high level acceptance of the proposed "Changes to GitHub Actions > > based > > > > Pulsar CI".* > > > > Together we can make this happen. We need decisions too. This > proposal > > > > cannot stay as a draft forever. > > > > I'm looking forward to hearing from the Pulsar community, Pulsar > > > > committer and Pulsar PMC members how to take this forward. > > > > > > > > BR, > > > > -Lari > > > > > > > > On Fri, Mar 12, 2021 at 8:06 AM Michael Marshall < > > mikemars...@gmail.com> > > > > wrote: > > > > > > > > > This will be a great improvement. I read through the PIP, and > > overall, > > > it > > > > > looks good to me. > > > > > > > > > > I left a question on the doc about how concurrent runs affect the > > > > > repository's 5 GB cache limit. > > > > > > > > > > I also think it could be helpful to explicitly document, or > reference > > > > > github documentation, on how failure will affect the DAG. I'm > > assuming > > > > that > > > > > if an action fails, its parallel peer actions will run to > completion, > > > and > > > > > that the rest of the remaining stages will get canceled, but I > > haven't > > > > > worked with github actions before. > > > > > > > > > > Thanks for all of the work you've put in so far. > > > > > > > > > > On Thu, Mar 11, 2021 at 6:37 PM Yuva raj <uvar...@gmail.com> > wrote: > > > > > > > > > > > This is great news. Thanks Hari , Mateo and pulsar community > > > > > > > > > > > > On Fri, Mar 12, 2021, 2:04 AM Lari Hotari <lari.hot...@sagire.fi > > > > > > wrote: > > > > > > > > > > > > > Dear Pulsar community members, > > > > > > > > > > > > > > The work on "Changes to GitHub Actions based Pulsar CI" has > gone > > > > > forward > > > > > > > based on your feedback. Here are some updates about the work. > > > > > > > > > > > > > > The draft PIP proposal document is here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry > > > > > > > There's a *detailed status update in the document about a > > prototype > > > > for > > > > > > the > > > > > > > refactored Pulsar CI GitHub Actions based workflow*. > > > > > > > > > > > > > > Thanks for all the suggestions and feedback by now. A lot of > > > > > improvements > > > > > > > have been made by the Pulsar contributors to overcome the > > technical > > > > > > > obstacles. > > > > > > > Special thanks go to Matteo for reducing the sizes of docker > > > images. > > > > A > > > > > > lot > > > > > > > of small improvements have been made to the Pulsar maven build > to > > > > > enable > > > > > > > the new refactored GitHub Actions workflow. Thank you for all > PR > > > > > reviews > > > > > > > and feedback. > > > > > > > > > > > > > > The main goal of the "Changes to GitHub Actions based Pulsar > CI" > > > work > > > > > has > > > > > > > been to *reduce the resource consumption of the Pulsar CI build > > and > > > > to > > > > > > > speed up Pulsar development by improving the developer > > > productivity* > > > > > when > > > > > > > less time is wasted in waiting for Pulsar CI build feedback. > The > > > > > > prototype > > > > > > > demonstrates these improvements. > > > > > > > > > > > > > > As you can see from the email from Jan 28 below, *the resource > > > > > > consumption > > > > > > > was 19 hrs 36 minutes* for a single pull request that was > > observed > > > > when > > > > > > the > > > > > > > work began. > > > > > > > Now, with the prototype of the refactored Pulsar CI build, the > > > > resource > > > > > > > consumption is *7 hrs 9 minutes.* > > > > > > > *This is about 60% reduction in resource consumption.* The > whole > > > > > pipeline > > > > > > > completes in 75-100 minutes. > > > > > > > > > > > > > > Here's a breakdown of the duration (resource consumption) of > each > > > > build > > > > > > job > > > > > > > in the refactored workflow: > > > > > > > Workflow Job seconds h:mm:ss > > > > > > > Pulsar CI Changed files check 4 0:00:04 > > > > > > > Pulsar CI Go 1.11 Functions 155 0:02:35 > > > > > > > Pulsar CI Go 1.12 Functions 166 0:02:46 > > > > > > > Pulsar CI Go 1.13 Functions 113 0:01:53 > > > > > > > Pulsar CI Go 1.14 Functions 96 0:01:36 > > > > > > > Pulsar CI Build on MacOS 1017 0:16:57 > > > > > > > Pulsar CI Build and License check 346 0:05:46 > > > > > > > Pulsar CI Build Pulsar CPP and Python clients 683 0:11:23 > > > > > > > Pulsar CI Build Pulsar java-test-image docker image 405 0:06:45 > > > > > > > Pulsar CI CI - Unit - Other 1580 0:26:20 > > > > > > > Pulsar CI CI - Unit - Brokers - Broker Group 1 968 0:16:08 > > > > > > > Pulsar CI CI - Unit - Brokers - Broker Group 2 2223 0:37:03 > > > > > > > Pulsar CI CI - Unit - Brokers - Client Api 1652 0:27:32 > > > > > > > Pulsar CI CI - Unit - Brokers - Client Impl 916 0:15:16 > > > > > > > Pulsar CI CI - Unit - Brokers - Other 522 0:08:42 > > > > > > > Pulsar CI CI - Unit - Proxy 331 0:05:31 > > > > > > > Pulsar CI Build Pulsar docker image 2343 0:39:03 > > > > > > > Pulsar CI CI - Integration - Shade 414 0:06:54 > > > > > > > Pulsar CI CI - Integration - Backwards Compatibility 849 > 0:14:09 > > > > > > > Pulsar CI CI - Integration - Cli 1490 0:24:50 > > > > > > > Pulsar CI CI - Integration - Messaging 857 0:14:17 > > > > > > > Pulsar CI CI - Integration - Schema 468 0:07:48 > > > > > > > Pulsar CI CI - Integration - Standalone 286 0:04:46 > > > > > > > Pulsar CI CI - Integration - Transaction 362 0:06:02 > > > > > > > Pulsar CI CI - System - Function State 699 0:11:39 > > > > > > > Pulsar CI CI - System - Tiered FileSystem 779 0:12:59 > > > > > > > Pulsar CI CI - System - Tiered JCloud 529 0:08:49 > > > > > > > Pulsar CI CI - System - Pulsar Connectors - Thread 1795 0:29:55 > > > > > > > Pulsar CI CI - System - Pulsar Connectors - Process 2312 > 0:38:32 > > > > > > > Pulsar CI CI - System - Sql 1377 0:22:57 > > > > > > > *Total resource consumption* > > > > > > > 7:08:57 > > > > > > > > > > > > > > > > > > > > > GitHub Actions doesn't support restarting a single job ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234 > > > > > > > ). > > > > > > > However, this is not a showstopper since there are ways to > > address > > > > the > > > > > > > issues that cause flakiness. > > > > > > > There is a separate PIP for changing the way to handle flaky > > tests. > > > > You > > > > > > can > > > > > > > find the link to that in the "Changes to GitHub Actions based > > > Pulsar > > > > > CI" > > > > > > > document's header. > > > > > > > > > > > > > > *Some requests for the Pulsar community:* > > > > > > > > > > > > > > 1) *Please take a look at the updated PIP document*: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#heading=h.f53rkcu20sry > > > > > > > . *It also contains more details of the prototype that has been > > > > > > > successfully completed.* > > > > > > > > > > > > > > 2) *Please share your feedback and suggest a way forward.* > > > > > > > > > > > > > > *Thank you for your help!* > > > > > > > > > > > > > > BR, Lari > > > > > > > > > > > > > > On Thu, Jan 28, 2021 at 7:13 PM Lari Hotari < > > lari.hot...@sagire.fi > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Dear Pulsar community members, > > > > > > > > > > > > > > > > Currently, the Pulsar GitHub Actions workflows are consuming > > the > > > > > > majority > > > > > > > > of the shared pool of resources allocated for > > github.com/apache > > > > > > > projects. > > > > > > > > Other Apache projects have been impacted and there is a > demand > > to > > > > > > improve > > > > > > > > the Pulsar CI > > > > > > > > < > > > https://github.com/apache/pulsar/pull/9159#issuecomment-766915396 > > > > > > > > > > > > asap. > > > > > > > > > > > > > > > > In GitHub Actions Runners, the unit of resources is the time > > > that a > > > > > > > Runner > > > > > > > > is occupied. I observed the workflow runs for handling a > single > > > > Pull > > > > > > > > Request (in my personal fork) and these were the running > > > durations: > > > > > > > > Workflow name Duration > > > > > > > > CI - Build - MacOS 0:17:23 > > > > > > > > CI - Go Functions style check 0:02:38 > > > > > > > > CI - Unit - Brokers - Other 0:15:40 > > > > > > > > CI - Unit - Brokers - Client Impl 0:16:28 > > > > > > > > CI - Misc 0:16:51 > > > > > > > > CI - Unit - Proxy 0:14:23 > > > > > > > > CI - Go Functions Tests 0:22:08 > > > > > > > > CI - CPP, Python Tests 0:23:30 > > > > > > > > CI - Unit 0:42:11 > > > > > > > > CI - Integration - Sql 1:00:13 > > > > > > > > CI - Integration - Tiered JCloud 1:00:18 > > > > > > > > CI - Integration - Tiered FileSystem 1:00:13 > > > > > > > > CI - Integration - Function State 1:00:12 > > > > > > > > CI - Integration - Cli 1:10:22 > > > > > > > > CI - Integration - Transaction 1:16:34 > > > > > > > > CI - Integration - Process 1:11:23 > > > > > > > > CI - Shade - Test 1:15:45 > > > > > > > > CI - Unit - Brokers - Client Api 0:26:13 > > > > > > > > CI - Unit - Brokers - Broker Group 2 0:35:05 > > > > > > > > CI - Integration - Standalone 0:45:29 > > > > > > > > CI - Integration - Messaging 1:00:23 > > > > > > > > CI - Integration - Thread 1:00:19 > > > > > > > > CI - Integration - Backwards Compatibility 1:00:19 > > > > > > > > CI - Integration - Schema 1:00:19 > > > > > > > > CI - Unit - Brokers - Broker Group 1 2:02:31 > > > > > > > > TOTAL 19:36:50 > > > > > > > > > > > > > > > > *In this case, the total resource consumption of GitHub > Actions > > > > > Runners > > > > > > > is > > > > > > > > 19 hours 36 minutes 50 seconds for a single pull request to > > > > > > > apache/pulsar.* > > > > > > > > > > > > > > > > Since GitHub Actions Runner resource pool utilization is very > > > high, > > > > > > this > > > > > > > > leads to the build queue to grow and take a long time to > > process. > > > > > > > > > > > > > > > > I have been looking for ways to improve the Pulsar CI for the > > > last > > > > 3 > > > > > > > > months. During this period I worked on a few experiments. The > > > > > learnings > > > > > > > > from the past experiments are documented at a high level in > the > > > > > > following > > > > > > > > draft PIP document. > > > > > > > > > > > > > > > > *The draft PIP "Changes to GitHub Actions based Pulsar CI" > > > document > > > > > is > > > > > > a > > > > > > > > Google doc:* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit?usp=sharing > > > > > > > > > > > > > > > > *Please participate* so that we get the plan adjusted based > on > > > the > > > > > > > > feedback asap. If there's already a similar effort ongoing, I > > > hope > > > > we > > > > > > can > > > > > > > > join efforts. > > > > > > > > > > > > > > > > *Let's fix Pulsar CI!* > > > > > > > > > > > > > > > > BR, Lari > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >