Hi Dennis, Thanks for the clarifications!
Overall, I think this is an interesting idea and worth trying as a pilot project. I assume these tests are supposed to be executed locally (at least initially), right? Cheers, Dmitri. On Fri, May 22, 2026 at 4:30 AM Dennis Huo <[email protected]> wrote: > Based on feedback, edited the doc with some more detail and some > clarifications worth calling out here: > > 1. The originally stated core concept was aspirational/long-term in nature, > but naturally we're nowhere close to having a reliable, automatable eval > set or framework yet - clarified that the MVP goal here is actually just to > focus on seeding an an initial harness/framework so that we have a common > framework within which to perform meta-analysis towards better > understanding how our code/doc evolution impacts agentic behavior. MVP > scope copied from the doc here for easy reading: > > *Introduce the basic process and machinery as a basic eval framework geared > towards the evolution of AI-facing docs that produces measurable signals to > co-evolve the maturity of the eval framework in conjunction with the rest > of the codebase.* > > *Take advantage of the agentic driver of the harness producing a > meta-analysis to help connect the numerical measurements to concrete > agentic behaviors taken by the test subjects.The eval can initially be run > selectively/ad-hoc for PRs deemed “relevant” for this analysis; having the > shared framework within the project allows different community members to > share and contribute to a common set of metrics and methodologies.* > > 2. Initial target PRs are more for things like changes to AGENTS.md, > addition of rules/skills md files, etc., rather than run-of-the-mill code > changes - the extrapolation of this into "refactoring" and other code > changes is more speculative/experimental. Scenario statement from doc: > > > > > *I added 200 lines of “hints” and “rules” to AGENTS.md-How do I know if > those changes improve anything?-Are there unintended second-order changes > to agentic behavior caused by the change?-How do I prevent unintended > regressions in behavior driven by AGENTS.md changes over time?* > > > > On Thu, May 21, 2026 at 12:57 PM Dennis Huo <[email protected]> wrote: > > > You can basically think of it as unittests and/or benchmarks for > > documentation or agent skills (or codebase health). Except since they > can't > > always be pass/fail, we also need something sliding-scale that measures a > > degree of success/failure. > > > > If we didn't have LLMs, we theoretically could've still "tested" > > documentation by having new developers who know nothing about the project > > get locked in a room with a sample coding task. Group A gets updated > docs. > > Group B gets old docs. Measure how many of them succeed and how long they > > take, ask them how hard the task was. > > > > If Group A always takes 30 minutes to finish and group B takes 60 minutes > > to finish, you have a delta of 30 minutes. > > > > On Thu, May 21, 2026 at 12:35 PM Dmitri Bourlatchkov <[email protected]> > > wrote: > > > >> Hi Dennis, > >> > >> This proposal looks interesting, but I'm not sure I understand the > purpose > >> :) The doc and the PR give a lot of information about what happens, but > >> almost nothing about "why" (at least I could not easily deduce that). > >> > >> Could you expand your proposal a bit on that aspect? > >> > >> More specifically, what is the "quantitative A/B delta" exactly? How is > it > >> envisioned to be used? > >> > >> Thanks, > >> Dmitri. > >> > >> On Thu, May 21, 2026 at 5:13 AM Dennis Huo <[email protected]> wrote: > >> > >> > Hi all, > >> > > >> > Now that agentic development is evolving to be a more fundamental and > >> > pervasive tool, I wanted to explore ways to address both a "need" and > an > >> > "opportunity" under one umbrella - adding an agentic (meta-)skill to > >> start > >> > codifying a way for us to bake in quantifiable metrics to the impact > of > >> > "non-functional" changes on repository "health" (in terms of > >> extensibility > >> > and maintainability). > >> > > >> > Basically, if we extrapolate from getting into the habit of > formalizing > >> our > >> > AGENTS.md files towards likely adding well-defined agent "skills" for > >> > repeatable agentic workflows, and those becoming more ingrained in the > >> > development process over time, the basic "need" is to standardize our > >> evals > >> > against the addition of new skills and mdfile documentation, but also > to > >> > recognize the opportunity of addressing three related types of > >> > nonfunctional changes: > >> > > >> > 1. Refactoring code - sometimes subjective, sometimes partially > >> objective > >> > (consolidating duplicate code), but the *effects* are rarely > >> quantifiable > >> > 2. Adding documentation/code comments - Generally regarded as being > >> good, > >> > but sometimes verbosity can hurt, and certainly "incorrect" > >> documentation > >> > can hurt > >> > 3. Addition of agent skills or rules - possibly manually tested to > some > >> > extent when added, but usually not consistently and rarely with > >> > reproducible evals > >> > > >> > To that end I put together this proposal doc with some lightweight > >> design > >> > elements for this agentic skill: > >> > > >> > > >> > > >> > https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0 > >> > > >> > Would love to discuss folks' thoughts here or in comments in the doc. > >> > Recapping the core concept from the doc: > >> > > >> > *Treat any candidate change as an intervention in a measurable A/B. > >> Take a > >> > baseline ref and a candidate ref, run a fixed set of agent-driven > sample > >> > tasks against both refs, collect a small number of metrics (success > vs. > >> an > >> > oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit > a > >> > delta report a reviewer can actually interpret.* > >> > > >> > And the three component carveouts: > >> > > >> > - Static task corpus - hand curated set of initial development > tasks > >> > (e.g. "Add a new Polaris privilege") that provides basic > >> cross-cutting > >> > signal > >> > - Task synthesizer - More advanced meta-evolution step - the > agentic > >> > driver of the harness can intelligently synthesize tasks that > >> exercise > >> > newly identified segments of coding complexity > >> > - Eval harness - the overall framework for isolating subagents, > sets > >> up > >> > the task experiments, collects metrics, etc. > >> > > >> > I have an initial v1 available for review: > >> > https://github.com/apache/polaris/pull/4519 > >> > > >> > This includes the end-to-end working v1 eval harness and prospective > >> > initial set of static tasks, no codified task synthesizer yet. I ran > an > >> > initial meta-eval on it with a three models (Claude Haiku 4.5, Claude > >> Opus > >> > 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more > >> > detailed results posted in the PR, abridged here - we should iterate a > >> bit > >> > more on the task corpus, but at least it's a proof-of-concept of the > >> > end-to-end flow. > >> > > >> > ## Task & fixture > >> > > >> > - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant > >> > `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`, > >> > ensure compile + `*PolarisAuthorizer*` tests pass without modifying > >> > any test file. The task is a *probe* of the authorizer SPI: a naive > >> > one-file edit (enum only) trips the static initializer in > >> > `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file > >> > change (enum + register call) passes. > >> > - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16). > >> > - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines > — > >> > "Recipes for Common Extension Tasks" section that explicitly tells > >> > agents to also edit `RbacOperationSemantics.register(...)`). The > >> > fixture only changes `AGENTS.md`; no source code differs between BASE > >> > and AFTER. > >> > > >> > The task's deterministic verifier runs out-of-band from the worker > >> > agent (separate `bash` subprocess after the worker's transcript is > >> > captured) so worker self-reports cannot fake a PASS. > >> > > >> > ## Headline results > >> > > >> > | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files > in > >> > diff | > >> > > >> > > >> > |------|---------|---------:|-----------:|-----------:|------:|---------------| > >> > | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) | > >> > | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) | > >> > | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) | > >> > | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) | > >> > | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** | > >> > | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) | > >> > > >> > Per-arm deltas (BEFORE → AFTER, AFTER doc helps): > >> > > >> > | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ | > >> > |--------|-------:|--------:|--------:|-----------| > >> > | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) | > >> > | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) | > >> > | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) | > >> > > >> > Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating > >> > verdict-flip + two consistent ~40% cost reductions on the same > >> > task — clear, replicable signal that the AGENTS.md recipe addition is > >> > agent-load-bearing. > >> > > >> > > >
