Hi Tom,

I like the idea of starting with X=1 year and Y=3 months.  That plus having
CHI review recent PRs would be helpful.

Your suggestion of having a bot which "lints" naming conventions could be
useful.  Maybe there could be a CHI-bot to automate what you and others are
doing manually. :)
I would not suggest that this bot close PRs, but rather it could suggest
how to improve them.  Or if CI is green and other conditions are met, it
could apply a "ready for review" label to draw attention.

Cheers,

Jim



On Tue, Dec 17, 2024 at 11:21 AM Tom Cooper <c...@tomcooper.dev> wrote:

> TL;DR
>
> - We have over 1.2k open PRs, this is an issue as it makes new
> contributors think twice about committing and looks like a problem that is
> too-big-to-solve for committers.
> - There have been various attempts, over the last 6 years, to enable the
> Stale PR bot/action to prompt authors to refresh old PRs and auto-close
> them if no action is taken.
> - These were rejected as some committers felt this was punishing
> contributors for the committers not reviewing/closing PRs fast enough.
> - Others felt that, rather than "sweeping the problem under the rug",
> using the Stale PR functionality would actually reveal the true scale of
> the issue. Allowing committers to see what were truly active PRs.
> - Other Apache projects such as Kafka, Beam, Spark, Airflow and many
> others have enabled the stale PR GitHub action.
> - Despite this, Kafka still has 1k open PRs. However, these PRs have all
> been updated/commented on in the last 3-4 months, so can be considered
> active.
> - For Flink, only 12% of the open PRs have been updated in the last 3
> months and only 41% in the last year.
> - I propose we enable the Stale PR Github action to clear the backlog and
> reduce the PRs down to those that are active and relevant.
> - We can start with PRs that haven't been active in the last year and give
> authors 3 months to refresh them. These thresholds could then be reduced
> over time, towards the norm for other Apache projects, of 3 months
> inactivity and 1 month to refresh.
>
> The Problem
>
> Currently, we have 1245 open PRs in the main upstream Flink GitHub
> repository. The oldest of which was created over seven and half years ago.
> Many of these PRs haven't been commented on or interacted with in years.
>
> I am definitely not here to cast blame. Flink is a huge project, the
> committers are volunteers and only have so much time. Also, Flink is
> certainly not the only open source project to face this issue. However, the
> large number of open PRs is a drag on the community, it makes new
> contributors think twice about opening PRs and I am sure it is demoralising
> for committers to see the mountain keep growing.
>
> Dealing with this was a big part of why the [Community Health Initiative
> (CHI)](
> https://cwiki.apache.org/confluence/display/FLINK/Community+Health+Initiative+%28CHI%29+workgroup)
> working group was set up. We are [making progress](
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332500010)
> on reviewing and triaging the top of the PR stack. However, the bottom of
> the stack is also an issue.
>
> Background
>
> It seems reasonable that a PR that is the better part of a decade old and
> hasn't been commented on in years, is probably not relevant and could be
> closed. Indeed, this very point has been brought up before, first in [2018](
> https://lists.apache.org/thread/qpxkm7s0t6qy571bh14lr26jmos6plfp) [where](
> https://lists.apache.org/thread/k11krmyr5rbclw7jhcmshp828fov3w0n) it was
> commented that:
>
> > The current situation with 350 open PRs may send a signal to
> contributors that it may actually be too much hassle to get a change
> committed in Flink.
>
> At that time, there was some [push-back](
> https://lists.apache.org/thread/6qlytq81zrctbv9kbk4z30vkngxvw9o1) to the
> proposal of using the stale PR bot. Mostly around auto-closing the PRs
> being perceived as harsh, given that the issue was mostly due to lack of
> committer review. The Beam community went ahead and enabled it, but the
> discussion on the Flink side seems to have then died out.
>
> The stale PR bot was raised again in [2019](
> https://lists.apache.org/thread/9gz363224zn48n03yffgsczm8gk3ohro) and had
> a lot of support, including several examples of other Apache projects using
> it to good effect. However, this was again [pushed back against](
> https://lists.apache.org/thread/z5r5sy5j98yb8p02l243gc2qlo267d2w) as
> hiding the symptoms of the underlying problem, namely committers not
> engaging actively enough to close PRs that were no longer relevant or had
> no hope of being merged. The c[ounter argument](
> https://lists.apache.org/thread/rz6vbzs16jbrs4smhqs940g3cg6godl0) to this
> was that the PR closing bot was only one part of a solution, not the whole
> solution and that far from hiding the problem, the stale labelling would
> highlight the scale of the issue.
>
> The Stale PR closing issue was raised further in [2022](
> https://lists.apache.org/thread/2z51fxm201z1tnvm2jklmtrhox4t9olw) and
> [2023](https://lists.apache.org/thread/pml95msx21sdc539404xs9tk209sdd55)
> with similar arguments. Including from CHI's own [David Radley](
> https://lists.apache.org/thread/0poktkjxhdpv15s0xgj7oksr7209w737):
>
> > We have over 1000 open prs. This is a lot of technical debt. I came
> across a 6 month old pr recently that had not been merged.
> > A second Jira issue was raised for the same problem and a second pr
> fixed the issue (identically). The first pr was still on the backlog until
> we noticed it.
>
> What other Apache projects are doing
>
> The [Stale PR/Issue GitHub action](https://github.com/actions/stale) is
> used by many Apache projects including [Beam](
> https://github.com/apache/beam/blob/master/.github/workflows/stale.yml),
> [Kafka](
> https://github.com/apache/kafka/blob/trunk/.github/workflows/stale.yml),
> [Spark](
> https://github.com/apache/spark/blob/master/.github/workflows/stale.yml)
> and [Airflow](
> https://github.com/apache/airflow/blob/main/.github/workflows/stale.yml)
> to name a few.
>
> Apache Kafka uses a 90 day (3 months) limit to define a stale PR and then
> allows a further 30 days (1 month) for the author to refresh the PR before
> it is auto-closed. Even with the Stale PR action enabled, Kafka still has
> over 1000 open PRs. However, all of these PRs were updated, commented on or
> otherwise interacted with in the last 3-4 months. This gives a much better
> overview of the scale of the open PR base.
>
> For comparison, I did some [basic analysis](
> https://gist.github.com/tomncooper/d92985065e948fb8fc6284a5438ff195) of
> Flink's open PRs. 55% were updated in the last 2 years, 41% in the last
> year, 12% in the last 3 months and only 8% in the last month. It is
> reasonable to conclude that over half these PRs are probably not relevant
> anymore or need significant updates to be compatible.
>
> What should we do?
>
> My personal take on this is that, while I agree that the issue is mostly
> one of committer capacity to review these PRs, the upstream PR count is
> currently too high. It is discouraging engagement. But, it is also not fair
> to blame committers for not wanting to spend time on PRs that are years out
> of date and clearly not relevant anymore.
>
> So I think we should declare PR bankruptcy and attempt to clear away the
> bulk of the old PRs. I don't use the word "bankruptcy" flippantly or to
> provoke, just to acknowledge that the scale of the issue has gotten too
> large to be dealt with through the hard work of committers alone. Once we
> get the PR backlog to a manageable size, we can then focus on using
> initiatives like CHI and other other workflow improvements to keep the PR
> count low.
>
> Proposal
>
> Enable the [stale PR GitHub action](https://github.com/actions/stale).
> This action would:
>
> - Identify any PR that has not been interacted with in the last `X` months
> as `Stale`:
>
> - Apply a `Stale` label to the PR
> - Comment on the PR that it is considered `Stale` and what to do to
> refresh it and how to engage further with the community. This will also
> allow committers to easily get a list of stale PRs to review and
> refresh/close.
> - Identify any `Stale` PR that hasn't been refreshed (commented on or
> otherwise updated) after a further `Y` months as closeable.
>
> - Close the PR.
> - Leave a closing comment highlighting that it can be reopened at any
> point with pointers to how to engage the community.
>
> The values of the stale (`X`) and close (`Y`) thresholds is up for
> discussion. At least initially, given the shear number of old PRs, we may
> want to be more lenient. For example X = 1 year, Y = 3 months, would limit
> the initial number of stale PRs and allow committers more time to review
> the stale PR list. Once the PR list has been reduced sufficiently we may
> want to reduce these values in increments until, for example, X = 3 months
> and Y = 1 month which seems to be the values other Apache projects have
> settled on.
>
> Obviously, I am a relative newcomer to the community. I would really like
> to hear what others, especially committers, think of the above proposal and
> hear any other ideas people have for taming the PR count.
>
> Alternatives
>
> Looking through the history of discussion on the subject, on several
> occasions people have suggested doing more fine grained checks before
> closing PRs, [such as](
> https://lists.apache.org/thread/j8s4khdm4wdhvr2px6qkq0f0kwlpc2vg):
>
> > closing up PRs after X days which:
> > a) Don't have a CI that has passed
> > b) Don't follow the code contribution guide (like commit naming
> conventions)
> > c) Have changes requested but aren't being followed-up by the contributor
>
> This is of course an option, but would probably require updating FlinkBot.
> There is no reason we couldn't enable both the Stale PR GitHub Action and
> update Flinkbot to enforce rules like those above.
>
> Tom Cooper
> [@tomncooper](https://twitter.com/tomncooper) | [tomcooper.dev](
> https://tomcooper.dev/)

Reply via email to