[DISCUSS] Proposal - Agentic Eval (Meta-)Skill for Extensibility and Maintainability

Dennis Huo Thu, 21 May 2026 02:13:37 -0700

Hi all,

Now that agentic development is evolving to be a more fundamental and
pervasive tool, I wanted to explore ways to address both a "need" and an
"opportunity" under one umbrella - adding an agentic (meta-)skill to start
codifying a way for us to bake in quantifiable metrics to the impact of
"non-functional" changes on repository "health" (in terms of extensibility
and maintainability).


Basically, if we extrapolate from getting into the habit of formalizing our
AGENTS.md files towards likely adding well-defined agent "skills" for
repeatable agentic workflows, and those becoming more ingrained in the
development process over time, the basic "need" is to standardize our evals
against the addition of new skills and mdfile documentation, but also to
recognize the opportunity of addressing three related types of
nonfunctional changes:

1. Refactoring code - sometimes subjective, sometimes partially objective
(consolidating duplicate code), but the *effects* are rarely quantifiable
2. Adding documentation/code comments - Generally regarded as being good,
but sometimes verbosity can hurt, and certainly "incorrect" documentation
can hurt
3. Addition of agent skills or rules - possibly manually tested to some
extent when added, but usually not consistently and rarely with
reproducible evals

To that end I put together this proposal doc with some lightweight design
elements for this agentic skill:

https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0

Would love to discuss folks' thoughts here or in comments in the doc.
Recapping the core concept from the doc:

*Treat any candidate change as an intervention in a measurable A/B. Take a
baseline ref and a candidate ref, run a fixed set of agent-driven sample
tasks against both refs, collect a small number of metrics (success vs. an
oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a
delta report a reviewer can actually interpret.*

And the three component carveouts:

   - Static task corpus - hand curated set of initial development tasks
   (e.g. "Add a new Polaris privilege") that provides basic cross-cutting
   signal
   - Task synthesizer - More advanced meta-evolution step - the agentic
   driver of the harness can intelligently synthesize tasks that exercise
   newly identified segments of coding complexity
   - Eval harness - the overall framework for isolating subagents, sets up
   the task experiments, collects metrics, etc.

I have an initial v1 available for review:
https://github.com/apache/polaris/pull/4519

This includes the end-to-end working v1 eval harness and prospective
initial set of static tasks, no codified task synthesizer yet. I ran an
initial meta-eval on it with a three models (Claude Haiku 4.5, Claude Opus
4.7, and Codex GPT 5.4) and just the "add new privilege" task; more
detailed results posted in the PR, abridged here - we should iterate a bit
more on the task corpus, but at least it's a proof-of-concept of the
end-to-end flow.

## Task & fixture

- **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant
`LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`,
ensure compile + `*PolarisAuthorizer*` tests pass without modifying
any test file. The task is a *probe* of the authorizer SPI: a naive
one-file edit (enum only) trips the static initializer in
`RbacOperationSemantics.java` and breaks 4 tests; the correct two-file
change (enum + register call) passes.
- **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16).
- **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines —
"Recipes for Common Extension Tasks" section that explicitly tells
agents to also edit `RbacOperationSemantics.register(...)`). The
fixture only changes `AGENTS.md`; no source code differs between BASE
and AFTER.

The task's deterministic verifier runs out-of-band from the worker
agent (separate `bash` subprocess after the worker's transcript is
captured) so worker self-reports cannot fake a PASS.

## Headline results

| Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in
diff |
|------|---------|---------:|-----------:|-----------:|------:|---------------|
| haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) |
| haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) |
| opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) |
| opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) |
| codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** |
| codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) |

Per-arm deltas (BEFORE → AFTER, AFTER doc helps):

| Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ |
|--------|-------:|--------:|--------:|-----------|
| haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) |
| opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) |
| codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) |

Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating
verdict-flip + two consistent ~40% cost reductions on the same
task — clear, replicable signal that the AGENTS.md recipe addition is
agent-load-bearing.

[DISCUSS] Proposal - Agentic Eval (Meta-)Skill for Extensibility and Maintainability

Reply via email to