Hi all, Now that agentic development is evolving to be a more fundamental and pervasive tool, I wanted to explore ways to address both a "need" and an "opportunity" under one umbrella - adding an agentic (meta-)skill to start codifying a way for us to bake in quantifiable metrics to the impact of "non-functional" changes on repository "health" (in terms of extensibility and maintainability).
Basically, if we extrapolate from getting into the habit of formalizing our AGENTS.md files towards likely adding well-defined agent "skills" for repeatable agentic workflows, and those becoming more ingrained in the development process over time, the basic "need" is to standardize our evals against the addition of new skills and mdfile documentation, but also to recognize the opportunity of addressing three related types of nonfunctional changes: 1. Refactoring code - sometimes subjective, sometimes partially objective (consolidating duplicate code), but the *effects* are rarely quantifiable 2. Adding documentation/code comments - Generally regarded as being good, but sometimes verbosity can hurt, and certainly "incorrect" documentation can hurt 3. Addition of agent skills or rules - possibly manually tested to some extent when added, but usually not consistently and rarely with reproducible evals To that end I put together this proposal doc with some lightweight design elements for this agentic skill: https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0 Would love to discuss folks' thoughts here or in comments in the doc. Recapping the core concept from the doc: *Treat any candidate change as an intervention in a measurable A/B. Take a baseline ref and a candidate ref, run a fixed set of agent-driven sample tasks against both refs, collect a small number of metrics (success vs. an oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a delta report a reviewer can actually interpret.* And the three component carveouts: - Static task corpus - hand curated set of initial development tasks (e.g. "Add a new Polaris privilege") that provides basic cross-cutting signal - Task synthesizer - More advanced meta-evolution step - the agentic driver of the harness can intelligently synthesize tasks that exercise newly identified segments of coding complexity - Eval harness - the overall framework for isolating subagents, sets up the task experiments, collects metrics, etc. I have an initial v1 available for review: https://github.com/apache/polaris/pull/4519 This includes the end-to-end working v1 eval harness and prospective initial set of static tasks, no codified task synthesizer yet. I ran an initial meta-eval on it with a three models (Claude Haiku 4.5, Claude Opus 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more detailed results posted in the PR, abridged here - we should iterate a bit more on the task corpus, but at least it's a proof-of-concept of the end-to-end flow. ## Task & fixture - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`, ensure compile + `*PolarisAuthorizer*` tests pass without modifying any test file. The task is a *probe* of the authorizer SPI: a naive one-file edit (enum only) trips the static initializer in `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file change (enum + register call) passes. - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16). - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines — "Recipes for Common Extension Tasks" section that explicitly tells agents to also edit `RbacOperationSemantics.register(...)`). The fixture only changes `AGENTS.md`; no source code differs between BASE and AFTER. The task's deterministic verifier runs out-of-band from the worker agent (separate `bash` subprocess after the worker's transcript is captured) so worker self-reports cannot fake a PASS. ## Headline results | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in diff | |------|---------|---------:|-----------:|-----------:|------:|---------------| | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) | | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) | | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) | | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) | | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** | | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) | Per-arm deltas (BEFORE → AFTER, AFTER doc helps): | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ | |--------|-------:|--------:|--------:|-----------| | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) | | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) | | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) | Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating verdict-flip + two consistent ~40% cost reductions on the same task — clear, replicable signal that the AGENTS.md recipe addition is agent-load-bearing.
