Dear Apache Pulsar Committers, I wish to address a few pressing concerns that emerged while I was working on cherry-picking PR #20461 [1]. This PR was aimed at upgrading Jetty from 9.4.48.v20220622 to 9.4.51.v20230217 to address the CVEs (CVE-2023-26048 and CVE-2023-26049). I discovered that Jetty had already been upgraded in the maintenance branches through four separate PRs (#20162, #20226, #20227, and #20228), all titled "[improve][build] Upgrade dependencies to reduce CVE" [2].
1. The newly adopted process of combining multiple dependency updates into a single PR, while omitting changes to the master branch, has not been discussed on the mailing list. 2. Our current process, which is based on cherry-picking, should maintain traceability across maintenance branches to discern whether a change made to the master branch is available in the maintenance branches. This breaks with the approach that was used. 3. It is advised that each dependency (or group of related dependencies) should be upgraded in its own PR, rather than upgrading multiple unrelated dependencies in a single PR. 4. We should aim for all changes to be first made to the master branch and then cherry-picked to other branches to prevent the maintenance branches from diverging from the master branch. 5. The compilation of release notes becomes challenging when PRs aren't atomic. 6. Similarly, detecting regressions can be problematic when PRs aren't atomic. Now, I want to clarify that I'm not entirely supportive of the cherry-picking process as it currently stands. I personally believe that a merge-based strategy could be more effective. This strategy would entail initially making changes to the oldest maintenance branch where a feature (or a dependency, as in this instance) exists. Subsequently, we would propagate all changes in a maintenance branch forward towards the master branch using git merges, effectively managing and resolving any merge conflicts that might arise along the way. Features wouldn't be added to maintenance branches. This strategy is employed in several open source projects, such as Grails [3] and Micronaut [4]. Indeed, there might be exceptions, and for such instances, cherry-picking would still be a tool within our strategy. The principal advantage of this proposed approach is that it allocates adequate focus on the maintenance branch, thereby curbing the instability typically experienced with our intermediate maintenance versions. Additionally, the merge-based approach addresses the issue with CI pipelines. If the PR is made to the maintenance branch, it ensures the changes integrate well and all tests pass in the maintenance version, enhancing stability. I understand the counter-argument that this could confuse our contributors if they have to make the PR against the maintenance branch. However, this could be mitigated by guidance from the PR reviewer and adding further information in the contribution guide and PR template. There are also more radical solutions, such as making the main maintenance branch the default branch, like the "4.1" branch in Netty. The merge strategy also helps ensuring that the LTS maintenance branch is always in a releasable state. Currently, it takes a significant amount of time to "stabilize" the branch before releasing. This is a counterproductive pattern and a waste of time that we must address and improve. There seem to be inherent obstacles in our existing process, evidenced by the recent adoption of bundled PR types that circumvent our cherry-picking process. Ordinarily, we insist on creating atomic PRs to the master branch prior to initiating cherry-picking and backporting. I would be keen to hear about the issues others have encountered with the cherry-picking process. Identifying these pain points is the first step towards refining and optimizing our process. With Pulsar's recent transition to a new Long-Term Support (LTS) release strategy, the stability of the LTS release has emerged as a vital concern. Our current cherry-picking process, which has sometimes led to insufficient integration testing within the maintenance branches, has been proven ineffective at maintaining the requisite stability. If we do not revisit our maintenance processes, the new LTS release strategy could encounter the same instability issues. Thus, in order to fully reap the benefits of the LTS release strategy, we must prioritize the improvement of our maintenance processes. In the existing procedure, the task of cherry-picking individual commits can become quite tedious, especially when it necessitates crafting a new PR for each cherry-picked commit. One possible solution to this inefficiency may be to enhance the coordination of cherry-picking. Under such a system, the committer could instigate a test run encompassing a sensible quantity of cherry-picked commits, thereby circumventing the need for separate PRs for each cherry-picked item. Furthermore, the implementation of a nightly build for all maintenance branches, set to execute if any changes have transpired since the last run, could be advantageous. By employing this approach, we can consistently maintain our branches in an optimal and release-ready state. A significant deficiency in our current cherry-picking process is its status as tribal knowledge, without a clearly documented description in place. While we do possess a release process guide [5], it does not adequately elaborate on the procedure. Similarly, our release policy [6] does not delve into the specifics of this process either. This lack of comprehensive documentation leaves a significant knowledge gap in our workflow. Our existing documentation [6] on the cherry-picking process states, "Generally, one committer shall volunteer as the release manager (RM) for a specific release. For feature releases and LTS releases, the last 3 weeks of the release cycle will be marked as a code-freeze period. The RM will branch off from master, and the RM is also responsible for selecting the changes that will be cherry-picked in the release branch." Unfortunately, this description falls short of the actual process. As it stands, we frequently cherry-pick commits as soon as the master branch PR has been merged. The description mentions Release Manager (RM) responsible for selecting the changes which isn't even the usual case. This practice is opaque and problematic. This situation prompts several crucial questions — what decision-making criteria does the RM use, and how do they manage quality assurance? It's currently the case that we need a substantial amount of time to prepare a maintenance branch for release, which clearly underscores that our current process requires significant enhancement. Moreover, while the recent implementation of the Long-Term Support (LTS) strategy is a significant step, it doesn't appear to have brought about a radical shift in our approach. Aside from committing to maintain a specific version for a longer duration, our operational methodology hasn't undergone substantial enhancements. To truly honor our commitment to long-term support, it's incumbent upon us to reform our processes, making them more efficient, reliable, and effective. Merely increasing the responsibilities of the Release Manager isn't the solution. An enterprise IT professional might suggest the introduction of a Change Advisory Board (CAB). However, such a measure doesn't necessarily address the core issue at hand. As the book "Accelerate: The Science of Lean Software and DevOps" [7] describes, approval by an external body (such as a manager or CAB), contrary to common belief, often do not result in higher levels of stability and can actually slow down the development process. We need to seek strategies that not only preserve stability but also promote agility and efficiency in our workflows. Thank you for your attention, and I look forward to hearing your thoughts on these matters. Meanwhile, I kindly request that we stick to our established cherry-picking process until a collective decision is made on a potential alternative. This implies discontinuing the current practice of bundling multiple changes in PRs to maintenance branches. Moreover, I earnestly hope for widespread involvement in refining this process. Specifically, I look forward to significant participation from the Apache Pulsar committers and PMC members in this pivotal discussion. Your collective insights and contributions will be important in effecting the much-needed improvements. In addition to discussions, there will also be a need for substantial effort. We must document the process thoroughly and continuously improve it as we gather more feedback during its progress. I'm looking forward to an active discussion and concrete contributions as PRs to our release policy & process documentation! Sharing the tribal knowledge is also welcome if you don't feel like contributing directly to documentation. ;) -Lari [1] - https://github.com/apache/pulsar/pull/20461 [2] - https://github.com/apache/pulsar/pulls?q=is%3Apr+%22Upgrade+dependencies+to+reduce+CVE%22+is%3Aclosed [3] - https://github.com/grails/grails-core [4] - https://github.com/micronaut-projects/micronaut-core [5] - https://pulsar.apache.org/contribute/release-process/ [6] - https://pulsar.apache.org/contribute/release-policy/ [7] - https://itrevolution.com/book/accelerate/ Appendix: Quote from "Accelerate: The Science of Lean Software and DevOps" [7] related to change approval by an external body (such as a manager or Change Advisory Board): "We investigated further the case of approval by an external body to see if this practice correlated with stability. We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all. Our recommendation based on these results is to use a lightweight change approval process based on peer review, such as pair programming or intrateam code review, combined with a deployment pipeline to detect and reject bad changes. This process can be used for all kinds of changes, including code, infrastructure, and database changes."