From Tribal Knowledge to Transparency: Enhancing and Documenting the LTS Maintenance & Cherry-Picking Process

Lari Hotari Fri, 02 Jun 2023 01:24:49 -0700

Dear Apache Pulsar Committers,

I wish to address a few pressing concerns that emerged while I was
working on cherry-picking PR #20461 [1]. This PR was aimed at upgrading
Jetty from 9.4.48.v20220622 to 9.4.51.v20230217 to address the CVEs
(CVE-2023-26048 and CVE-2023-26049). I discovered that Jetty had already
been upgraded in the maintenance branches through four separate PRs
(#20162, #20226, #20227, and #20228), all titled "[improve][build]
Upgrade dependencies to reduce CVE" [2].

1. The newly adopted process of combining multiple dependency updates
into a single PR, while omitting changes to the master branch, has
not been discussed on the mailing list.
2. Our current process, which is based on cherry-picking, should
maintain traceability across maintenance branches to discern whether
a change made to the master branch is available in the maintenance
branches. This breaks with the approach that was used.
3. It is advised that each dependency (or group of related dependencies)
should be upgraded in its own PR, rather than upgrading multiple
unrelated dependencies in a single PR.
4. We should aim for all changes to be first made to the master branch
and then cherry-picked to other branches to prevent the maintenance
branches from diverging from the master branch.
5. The compilation of release notes becomes challenging when PRs aren't
atomic.
6. Similarly, detecting regressions can be problematic when PRs aren't
atomic.

Now, I want to clarify that I'm not entirely supportive of the
cherry-picking process as it currently stands. I personally believe that
a merge-based strategy could be more effective. This strategy would
entail initially making changes to the oldest maintenance branch where a
feature (or a dependency, as in this instance) exists. Subsequently, we
would propagate all changes in a maintenance branch forward towards the
master branch using git merges, effectively managing and resolving any
merge conflicts that might arise along the way. Features wouldn't be
added to maintenance branches. This strategy is employed in several open
source projects, such as Grails [3] and Micronaut [4].

Indeed, there might be exceptions, and for such instances,
cherry-picking would still be a tool within our strategy. The principal
advantage of this proposed approach is that it allocates adequate focus
on the maintenance branch, thereby curbing the instability typically
experienced with our intermediate maintenance versions.

Additionally, the merge-based approach addresses the issue with CI
pipelines. If the PR is made to the maintenance branch, it ensures the
changes integrate well and all tests pass in the maintenance version,
enhancing stability. I understand the counter-argument that this could
confuse our contributors if they have to make the PR against the
maintenance branch. However, this could be mitigated by guidance from
the PR reviewer and adding further information in the contribution guide
and PR template. There are also more radical solutions, such as making
the main maintenance branch the default branch, like the "4.1" branch in
Netty.

The merge strategy also helps ensuring that the LTS maintenance branch
is always in a releasable state. Currently, it takes a significant
amount of time to "stabilize" the branch before releasing. This is a
counterproductive pattern and a waste of time that we must address and
improve.

There seem to be inherent obstacles in our existing process, evidenced
by the recent adoption of bundled PR types that circumvent our
cherry-picking process. Ordinarily, we insist on creating atomic PRs to
the master branch prior to initiating cherry-picking and backporting. I
would be keen to hear about the issues others have encountered with the
cherry-picking process. Identifying these pain points is the first step
towards refining and optimizing our process.

With Pulsar's recent transition to a new Long-Term Support (LTS) release
strategy, the stability of the LTS release has emerged as a vital
concern. Our current cherry-picking process, which has sometimes led to
insufficient integration testing within the maintenance branches, has
been proven ineffective at maintaining the requisite stability. If we do
not revisit our maintenance processes, the new LTS release strategy could
encounter the same instability issues. Thus, in order to fully reap the
benefits of the LTS release strategy, we must prioritize the improvement
of our maintenance processes.

In the existing procedure, the task of cherry-picking individual commits
can become quite tedious, especially when it necessitates crafting a new
PR for each cherry-picked commit. One possible solution to this
inefficiency may be to enhance the coordination of cherry-picking. Under
such a system, the committer could instigate a test run encompassing a
sensible quantity of cherry-picked commits, thereby circumventing the
need for separate PRs for each cherry-picked item. Furthermore, the
implementation of a nightly build for all maintenance branches, set to
execute if any changes have transpired since the last run, could be
advantageous. By employing this approach, we can consistently maintain
our branches in an optimal and release-ready state.

A significant deficiency in our current cherry-picking process is its
status as tribal knowledge, without a clearly documented description in
place. While we do possess a release process guide [5], it does not
adequately elaborate on the procedure. Similarly, our release policy [6]
does not delve into the specifics of this process either. This lack of
comprehensive documentation leaves a significant knowledge gap in our
workflow.

Our existing documentation [6] on the cherry-picking process states,
"Generally, one committer shall volunteer as the release manager (RM) for
a specific release. For feature releases and LTS releases, the last 3
weeks of the release cycle will be marked as a code-freeze period. The
RM will branch off from master, and the RM is also responsible for
selecting the changes that will be cherry-picked in the release branch."

Unfortunately, this description falls short of the actual process. As it
stands, we frequently cherry-pick commits as soon as the master branch
PR has been merged. The description mentions Release Manager (RM)
responsible for selecting the changes which isn't even the usual case.
This practice is opaque and problematic. This situation prompts several
crucial questions — what decision-making criteria does the RM use, and
how do they manage quality assurance? It's currently the case that we
need a substantial amount of time to prepare a maintenance branch for
release, which clearly underscores that our current process requires
significant enhancement.

Moreover, while the recent implementation of the Long-Term Support (LTS)
strategy is a significant step, it doesn't appear to have brought about
a radical shift in our approach. Aside from committing to maintain a
specific version for a longer duration, our operational methodology
hasn't undergone substantial enhancements. To truly honor our commitment
to long-term support, it's incumbent upon us to reform our processes,
making them more efficient, reliable, and effective. Merely increasing
the responsibilities of the Release Manager isn't the solution.

An enterprise IT professional might suggest the introduction of a Change
Advisory Board (CAB). However, such a measure doesn't necessarily
address the core issue at hand. As the book "Accelerate: The Science of
Lean Software and DevOps" [7] describes, approval by an external body
(such as a manager or CAB), contrary to common belief, often do not
result in higher levels of stability and can actually slow down the
development process. We need to seek strategies that not only preserve
stability but also promote agility and efficiency in our workflows.

Thank you for your attention, and I look forward to hearing your
thoughts on these matters. Meanwhile, I kindly request that we stick to
our established cherry-picking process until a collective decision is
made on a potential alternative. This implies discontinuing the current
practice of bundling multiple changes in PRs to maintenance branches.

Moreover, I earnestly hope for widespread involvement in refining this
process. Specifically, I look forward to significant participation from
the Apache Pulsar committers and PMC members in this pivotal discussion.
Your collective insights and contributions will be important in
effecting the much-needed improvements.

In addition to discussions, there will also be a need for substantial
effort. We must document the process thoroughly and continuously improve
it as we gather more feedback during its progress.

I'm looking forward to an active discussion and concrete contributions
as PRs to our release policy & process documentation! Sharing the tribal
knowledge is also welcome if you don't feel like contributing directly
to documentation. ;)

-Lari

[1] - https://github.com/apache/pulsar/pull/20461
[2] -
https://github.com/apache/pulsar/pulls?q=is%3Apr+%22Upgrade+dependencies+to+reduce+CVE%22+is%3Aclosed
[3] - https://github.com/grails/grails-core
[4] - https://github.com/micronaut-projects/micronaut-core
[5] - https://pulsar.apache.org/contribute/release-process/
[6] - https://pulsar.apache.org/contribute/release-policy/
[7] - https://itrevolution.com/book/accelerate/

Appendix:
Quote from "Accelerate: The Science of Lean Software and DevOps" [7]
related to change approval by an external body (such as a manager or
Change Advisory Board):

"We investigated further the case of approval by an external body to see
if this practice correlated with stability. We found that external
approvals were negatively correlated with lead time, deployment
frequency, and restore time, and had no correlation with change fail
rate. In short, approval by an external body (such as a manager or CAB)
simply doesn’t work to increase the stability of production systems,
measured by the time to restore service and change fail rate. However,
it certainly slows things down. It is, in fact, worse than having no
change approval process at all.

Our recommendation based on these results is to use a lightweight change
approval process based on peer review, such as pair programming or
intrateam code review, combined with a deployment pipeline to detect and
reject bad changes. This process can be used for all kinds of changes,
including code, infrastructure, and database changes."

From Tribal Knowledge to Transparency: Enhancing and Documenting the LTS Maintenance & Cherry-Picking Process

Reply via email to