Re: [DISCUSS] Make our "ready for review" expectation more explicit and stricter

Jarek Potiuk Wed, 04 Mar 2026 13:14:00 -0800

> I just fear that (soon) if AI costs are put to realistic price levels we
> need to check if contributors still have and get free AI bot access,
> else the idea is melting fast. (Low risk thoug, let's see if this
> happens we need to just change the approach... or look for funding)


If that happens, we will not have to deal with the problem in the
first place, because it will also be costly for those who create the
slop, not only for us.

Also - I assess (I will know more when I start doing it and this is
one of the things I am going to track also over time) that ~90% of
filter for now is purely deterministic and FAST - I think the crux of
the solution is not to employ the AI, but to assess as quickly as
possible whether we should look at the PR at all.

So this change is moslty a change to our process:

a) maintainers won't look at drafts (firmly)
b) clearly communicate to contributors that this will happen and
specify what they need to do
c) Relentlessly and without hesitation (but with oversight) convert
PRs to drafts when we quickly assess they are bad  - and tell aiuthors
how to fix them

The LLM there is just one of the checks - and LLMs check is fired only
when all other easily and deterministically verifiable criteria are
met. And I do hope we reach the checks with LLM will mostly say "fine"
- because it's very likely that those PRs are **actually** worth
looking at. I think most of our future work as maintainers will be
deciding what we want to accept (or work on) - rather than spending
time assessing code quality nitpicks. For me this is a natural
consequence of what we've always been doing with static code checks. I
do remember times when (even in Airflow) our reviews included comments
about bad formatting and missing licences. Yes, that was the case - up
until we introduced pre-commit (one reason I introduced it, and one of
the first rules was to add licence headers automatically). This grew
to over 170 checks that we don't even have to think about. I see what
we are doing here as the natural next step.

I am of course exaggerating a bit. I still review AI generated code
and check its quality, asking agents to correct it when it doesn't
meet my standards. In fact, I review it in detail because I learn
something new every time. But I am exaggerating only slightly when
describing the focus I think we as maintainers will need to prioritize
in the future.

Another thing - ASF is already looking for a sponsor to cover AI usage
for ASF maintainers. I also know at least one company considering
giving free access (under certain conditions - not sponsoring, but
related to what goal the tokens will be used for) to all OSS
maintainers in general in case this will be needed in the future.

J.


On Wed, Mar 4, 2026 at 9:42 PM Jens Scheffler <[email protected]> wrote:
>
> I like the idea and also assume that we can adjust and improve rules and
> expectations over time.
>
> I just fear that (soon) if AI costs are put to realistic price levels we
> need to check if contributors still have and get free AI bot access,
> else the idea is melting fast. (Low risk thoug, let's see if this
> happens we need to just change the approach... or look for funding)
>
> On 04.03.26 08:13, Jarek Potiuk wrote:
> >>   Another manual step (and bottleneck) in triaging PRs is that maintainers
> > will still need to approve CI runs on GitHub.
> >
> > Great point ... and ... it's already handled :)  - look at my PR.
> >
> > When - during the triage - the triager will see that workflow approval is
> > needed, my nice little tool will print the diff of the incoming PR on
> > terminal and ask the triager to confirm that there is nothing suspicious
> > and after saying "y" the workflow run will be approved.
> >
> > J.
> >
> >
> > On Wed, Mar 4, 2026 at 3:35 AM Zhe-You Liu <[email protected]> wrote:
> >
> >> Hi all,
> >>
> >> Thanks Jarek for bringing up the auto-triage idea!
> >> Big +1 from me on the “let’s try” decision.
> >>
> >> I really like this feature; it can help avoid copy‑pasting or repeatedly
> >> writing similar instructions for contributors to fix baseline test
> >> failures.
> >>
> >> I had the same thoughts as Wei regarding flaky tests. Having deterministic
> >> checks or automated comments should be enough to handle flaky test issues,
> >> and contributors can still reach out on Slack to get their PRs reviewed, so
> >> this should not be a problem.
> >>
> >> Another manual step (and bottleneck) in triaging PRs is that maintainers
> >> will still need to approve CI runs on GitHub. It doesn’t seem safe to fully
> >> automate CI approval, as there could still be rare cases where an attacker
> >> creates a vulnerable PR that logs environment variables during tests. Even
> >> though we could use an LLM to check for these kinds of vulnerabilities
> >> before approving a CI run, it is still not as safe as a manual review in
> >> most cases (e.g. prompt injection attack). I’m not sure whether anyone has
> >> a good idea for fully automated PR triaging -- for example, automatically
> >> approving CI, periodically checking test baselines for quality (via the
> >> `breeze pr auto-triage`), re‑approving CI as needed, and continuing this
> >> loop until all CI checks are green.
> >>
> >> Best regards,
> >> Jason
> >>
> >> On Tue, Mar 3, 2026 at 10:48 PM Vincent Beck <[email protected]> wrote:
> >>
> >>> I like the overall strategy, for sure the tool will need continuous
> >>> iterations to handle all the different scenarios. But this is definitely
> >>> needed, the number of open PRs just skyrocketed the last few months, it
> >> is
> >>> very hard/impossible to keep track of everything.
> >>>
> >>> On 2026/03/03 14:39:41 Jarek Potiuk wrote:
> >>>>>
> >>>>> Thanks for bringing this up! Overall, I like this idea, but it's
> >> worth
> >>>>> testing it for a bit before we enforce it, especially the LLM-verify
> >>> part.
> >>>> Oh absolutely. My plan to introduce it is (after the community
> >> hopefully
> >>>> makes an overall "let's try" decision):
> >>>>
> >>>> * The human triager is always in the loop, quickly reviewing comments
> >>> just
> >>>> before they are posted to the user (until we achieve high confidence)
> >>>> * I plan to run it myself as the sole triager for some time to perfect
> >> it
> >>>> and to pay much more attention initially. I will start with smaller
> >>>> groups/areas of code and expand as we go - possibly adding more
> >>> maintainers
> >>>> willing to participate in triaging and testing/improving the tool
> >>>> * See how quickly we can do it on a regular basis - whether we need
> >>> several
> >>>> triagers or perhaps one rotational triager handling all PRs from all
> >>> areas
> >>>> at a time.
> >>>> * Possibly further automate it. My assessment is that we will have 90%
> >> of
> >>>> deterministic "fails"—those we can easily automate without hesitation
> >>> once
> >>>> the process and expectations will be in place. The LLM part is a bit
> >> more
> >>>> nuanced and we can decide after we try.
> >>>>
> >>>>> * The author ensures the PR passes ALL the checks and tests (i.e.
> >>> green).
> >>>>>> It might sometimes mean we have to - even more quickly to `main`
> >>>>> breakages,
> >>>>>> and probably provide some "status" info and exceptions when we know
> >>> main
> >>>>> is
> >>>>>> broken.
> >>>>> Probably, we should exempt some checks that might be flaky?
> >>>>>
> >>>> Yeah - this part is a bit problematic - but we can likely add also an
> >>> easy
> >>>> automated, deterministic check if the failure is happening for others.
> >>>> Sending an automated comment like, "Please rebase now, the issue is
> >>> fixed,"
> >>>> to the authors would be super useful when they see unrelated failures.
> >>> This
> >>>> is something we **should** figure out during testing. There will be
> >>> plenty
> >>>> of opportunities :D
> >>>>
> >>>>
> >>>>>> * All PRs that do not meet this requirement will be converted to
> >>> Drafts
> >>>>>> with automated suggestions (reviewed quickly and efficiently by a
> >>>>>> triager) provided to the author on the next steps.
> >>>>> This will be super helpful! I also do it manually from time to time.
> >>>>
> >>>> Yes. I believe converting to Draft is an extremely strong (but fair)
> >>> signal
> >>>> to the author: "Hey, you have work to do.".
> >>>>
> >>>> Also when this is accompanied by an actionable comment like, "Here is
> >>> what
> >>>> you should do and here is the link describing it," it immediately
> >> filters
> >>>> out people who submit PRs without much work.
> >>>>
> >>>> Surely - they might feed the comment into their agent anyway (or it can
> >>>> read it automatically and act). But if our tool is faster and cheaper
> >> and
> >>>> more accurate (because of smart human in the driver's seat) than their
> >>>> tools, we gain an upper hand.
> >>>> And it should be faster - because we only check the expectation rather
> >>> than
> >>>> figuring out what to do, which should be much faster.
> >>>>
> >>>> Then in the worst case we will have continuous ping-pong (Draft ->
> >>> Undraft
> >>>> -> Draft), but we will control how fast this loop runs. Generally, our
> >>> goal
> >>>> should be to slow it down rather than respond immediately; for example,
> >>>> running it daily or every two days is a good idea.
> >>>>
> >>>> Effectively, if the PR is in the "ready for maintainer review" state,
> >> the
> >>>> maintainer should be quite certain, that the code quality, tests, etc.,
> >>> are
> >>>> all good. Only then should they take a look (and they can immediately
> >>> say,
> >>>> "No, this is not what we want")—and this is absolutely fine as well. We
> >>>> should not optimize for contributors spending time on work we might not
> >>>> accept. This is deliberately not a goal for me. This will automatically
> >>>> mean that new contributors who want to contribute significant changes
> >>> will
> >>>> mostly waste a lot of time and their PRs will be rejected.
> >>>>
> >>>> This is largely what we are already doing, mostly because those PRs do
> >>> not
> >>>> follow our "tribal knowledge," which the agent cannot easily derive.
> >>>> Naturally new contributors should start with small, easy-to-complete
> >>> tasks.
> >>>> that can be easily discarded if reviewers reject them. This is what we
> >>>> always asked people to start with. So this approach with the triage
> >> tool,
> >>>> also largely supports this: someone new rewriting the proverbial
> >>> scheduler
> >>>> will have to spend significant time ensuring "auto-triage" passes, only
> >>> to
> >>>> have the idea completely rejected by the reviewer or be asked for a
> >>>> complete rewrite. And this is perfectly fine. We always encouraged
> >>>> newcomers to start with small tasks, learn the basics, and "grow" until
> >>>> they were ready to propose bigger changes or split it into much smaller
> >>>> chunks. With "auto-triage" this will be natural and expected, requiring
> >>>> authors to invest more time and effort to reach the "ready for review"
> >>>> status.
> >>>>
> >>>> And I think it's absolutely fair and restores the balance we so much
> >> need
> >>>> now.
> >>>>
> >>>>
> >>>>>
> >>>>> Best,
> >>>>> Wei
> >>>>>
> >>>>>> On Mar 3, 2026, at 9:34 PM, Jarek Potiuk <[email protected]> wrote:
> >>>>>>
> >>>>>> *TL;DR; I propose a stricter (automation-assisted) approach for the
> >>>>> "ready
> >>>>>> for review" state and clearer expectations for contributors
> >> regarding
> >>>>> when
> >>>>>> maintainers review PRs of non-collaborators.*
> >>>>>>
> >>>>>> Following the
> >>>>>> https://lists.apache.org/thread/8tzwwwd7jmtmfo4j9pzg27704g10vpr4
> >>> where I
> >>>>>> showcased a tool that I claude-coded, I would like to have a
> >>> (possibly
> >>>>>> short) discussion on this subject and reach a stage where I can
> >>> attempt
> >>>>> to
> >>>>>> try the tool out.
> >>>>>>
> >>>>>> *Why? *
> >>>>>>
> >>>>>> Because we maintainers are overwhelmed and burning out, we no
> >> longer
> >>> see
> >>>>>> how our time invested in Airflow can bring significant returns to
> >> us
> >>>>>> (personally) and the community.
> >>>>>>
> >>>>>> While some of us spend a lot of time reviewing, commenting on, and
> >>>>> merging
> >>>>>> code, with the current rate of AI-generated PRs and other things we
> >>> do,
> >>>>>> this is not sustainable. Also there is a mismatch—or lack of
> >>>>>> clarity—regarding the quality expectations for the PRs we want to
> >>> review.
> >>>>>> *Social Contract Issue*
> >>>>>>
> >>>>>> We are a good (I think) open source project with a thriving
> >> community
> >>>>> and a
> >>>>>> great group of maintainers who are also friends and like to work
> >> with
> >>>>> each
> >>>>>> other but also are very open to bringing new community members in.
> >> As
> >>>>>> maintainers, we are willing to help new contributors grow and
> >>> generally
> >>>>>> willing to spend some of our time doing so. This is the social
> >>> contract
> >>>>> we
> >>>>>> signed up for as OSS maintainers and as committers for the Apache
> >>>>> Software
> >>>>>> Foundation PMC. Community Over Code.
> >>>>>>
> >>>>>> However, this social contract - this community-building aspect is
> >>>>> currently
> >>>>>> heavily imbalanced because AI-generated content takes away time,
> >>> focus
> >>>>> and
> >>>>>> energy from the maintainers. Instead of having meaningful
> >>> discussions in
> >>>>>> PRs about whether changes are needed and communicating with people,
> >>> we
> >>>>>> start losing time talking to - effectively - AI agents about
> >>> hundreds of
> >>>>>> smaller and bigger things that should not be there in a first
> >> place.
> >>>>>> Currently - collaboration and community building suffer. Even if
> >> real
> >>>>>> people submit code generated by agents (which is becoming really
> >>> good,
> >>>>> fast
> >>>>>> and cheap to produce), we simply lack the time as maintainers to
> >> have
> >>>>>> meaningful conversations with the people behind those agents.
> >>>>>>
> >>>>>> Sometimes we lose time talking to agents. Sometimes we lose time on
> >>>>> talking
> >>>>>> to people who have 0 understanding of what they are doing and
> >> submitt
> >>>>>> continuous crap, and we should not be having that conversation at
> >>>>>> all. Sometimes, we just look at the number of PRs opened in a given
> >>> day
> >>>>> in
> >>>>>> despair, dreading even trying to bring order to them.
> >>>>>>
> >>>>>> And many of us also have some "work" to do or a "feature" to work
> >> on
> >>> top
> >>>>> of
> >>>>>> that.
> >>>>>>
> >>>>>> I think we need to reclaim the maintainers' collective time to
> >> focus
> >>> on
> >>>>>> what matters: delegating more responsibility to authors so they
> >> meet
> >>> our
> >>>>>> expected quality bar (and efficiently verifying it with tools
> >> without
> >>>>>> losing time and focus).
> >>>>>>
> >>>>>> *What do we have now?*
> >>>>>>
> >>>>>> We have already done a lot to help with it - AGENTS.The PR
> >>> guidelines,
> >>>>>> overhauled by Kaxil and updated by others, will certainly help
> >>> clarify
> >>>>>> expectations for agents in the future. I know Kaxil is also
> >>> exploring a
> >>>>> way
> >>>>>> to enable automated copilot code reviews in a manner that will not
> >>> be too
> >>>>>> "dehumanizing" and will work well. This is all good. The better the
> >>>>> agents
> >>>>>> people use and the more closely they follow those instructions, the
> >>>>> higher
> >>>>>> the quality of incoming PRs will be. But we also need to help
> >>> maintainers
> >>>>>> easily identify what to focus on—distinguishing work in progress
> >> and
> >>>>>> unfinished PRs that need work from those truly "Ready for (human)
> >>>>> review."
> >>>>>> *How?*
> >>>>>>
> >>>>>> My proposal has two parts:
> >>>>>>
> >>>>>> * Define and communicate expectations for PRs that maintainers can
> >>>>> manage.
> >>>>>> * Relentlessly automate it to ensure expectations are met and that
> >>>>>> maintainers can easily focus on those PRs that "Ready for review."
> >>>>>>
> >>>>>> My tool (needs a bit more fine-tuning and refinement):
> >>>>>> https://github.com/apache/airflow/pull/62682 `*breeze pr
> >>> auto-triage*`
> >>>>> is
> >>>>>> designed to do exactly this: automate those expectations by
> >>> auto-triaging
> >>>>>> the PRs. It not only converts them to Draft when they are not yet
> >>> "Ready
> >>>>>> For Review," but also provides actionable, automated
> >> (deterministic +
> >>>>> LLM)
> >>>>>> comments to the authors. A concrete maintainer (the current
> >> triager)
> >>> is
> >>>>>> using the tool very efficiently.
> >>>>>>
> >>>>>> *Proposed expectations (for non-collaborators):*
> >>>>>>
> >>>>>> Those are not "new" expectations. Really, I'm proposing we
> >> completely
> >>>>>> delegate the responsibility for fulfilling those expectations to
> >> the
> >>>>> author
> >>>>>> (with helpful, automated comments - reviewed and confirmed by a
> >> human
> >>>>>> triager for now). And simply be very clear that generally no
> >>> maintainer
> >>>>>> will look at a PR until:
> >>>>>>
> >>>>>> * The author ensures the PR passes ALL the checks and tests (i.e.
> >>> green).
> >>>>>> It might sometimes mean we have to - even more quickly to `main`
> >>>>> breakages,
> >>>>>> and probably provide some "status" info and exceptions when we know
> >>> main
> >>>>> is
> >>>>>> broken.
> >>>>>>
> >>>>>> * The author follows all PR guidelines (LLM-verified) regarding
> >>>>>> description, content, quality, and presence of tests.
> >>>>>>
> >>>>>> * All PRs that do not meet this requirement will be converted to
> >>> Drafts
> >>>>>> with automated suggestions (reviewed quickly and efficiently by a
> >>>>>> triager) provided to the author on the next steps.
> >>>>>>
> >>>>>> * Drafts with no activity will be more aggressively pruned by our
> >>>>> stalebot.
> >>>>>> The triager is there mostly to quickly assess and generate
> >>> comments—with
> >>>>>> tool/AI assistance. The triager won't be the one who actually
> >> reviews
> >>>>> those
> >>>>>> PRs when they are "ready for review."
> >>>>>>
> >>>>>> * Only after that do we mark the PR as "*ready for maintainer
> >>> review*"
> >>>>>> (label)
> >>>>>>
> >>>>>> * Only such PRs should be reviewed and it is entirely up to the
> >>> author to
> >>>>>> make them ready.
> >>>>>>
> >>>>>> Note: This approach is only for non-collaborators. For
> >>> collaborators: we
> >>>>>> might have just one expectation - mark your PR with "ready for
> >>> maintainer
> >>>>>> review" when you think it's ready.
> >>>>>> We accept people as committers and collaborators because we already
> >>> know
> >>>>>> they generally know and follow the rules; automating this step
> >> isn't
> >>>>>> necessary.
> >>>>>>
> >>>>>> This is nothing new; we've already been doing this with humans
> >>> handling
> >>>>> all
> >>>>>> the heavy lifting without much of strictness or organization, but
> >>> this is
> >>>>>> no longer sustainable.
> >>>>>>
> >>>>>> I propose we make the expectations explicit, communicate them
> >>> clearly,
> >>>>> and
> >>>>>> relentlessly automate their execution.
> >>>>>>
> >>>>>> I would love to hear what y'all think.
> >>>>>>
> >>>>>> J.
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Make our "ready for review" expectation more explicit and stricter

Reply via email to