feat(config): support structured custom review agents #278

New issue

Open

opened 2026-05-19 16:50:28 -07:00 by jwilger · 1 comment

jwilger commented

2026-05-19 16:50:28 -07:00

Owner

Problem

Repo-level guidelines currently provide one shared block of free-form review guidance. That is useful for broad conventions, but it does not let a repo define multiple focused review perspectives with distinct checklists.

Repos should be able to ask Auto Review to apply custom, named review agents such as a security-focused checklist, API-design checklist, migration checklist, or domain-specific reviewer, without turning the config into an unconstrained prompt-injection surface.

Proposed enhancement

Add repo-config support for custom named review agents with structured checklist rules.

The config should avoid completely free-form prompt text. Instead, each custom agent should have a stable name and a list of specific checks to perform.

Example shape:

review_agents:
  agent_a:
    - "check this"
    - "check that"
  agent_b:
    - "check for ones"
    - "check for twos"

Exact key names/schema are TBD, but the important constraints are:

Agents are named by the repo config.
Each agent has a bounded list of checklist-style review items.
The review prompt presents these as structured checks, not arbitrary instructions that can override core reviewer/system behavior.
Default review behavior remains unchanged when no custom agents are configured.
Custom agents should supplement the default reviewer rather than replacing safety, bug, and correctness review.

Design considerations

Decide whether custom agents are separate LLM passes or one structured section in the default review prompt.
Decide whether findings should record which custom agent/check produced them.
Consider limits for number of agents, number of checks per agent, and total bytes.
Treat repo config as trusted operator/repo-maintainer configuration but still keep prompt boundaries clear.
Preserve compatibility with existing guidelines while making this the preferred shape for reusable checklists.

Acceptance criteria

.auto_review.yaml can define named custom review agents with checklist items.
The review prompt includes those named agents/checks in a structured way.
The config validator catches malformed custom-agent shapes and unknown keys under --strict.
Existing repos with no custom agents behave exactly as before.
Docs and .auto_review.example.yaml describe the feature and show examples.
Tests cover parsing, validation, prompt rendering, limits, and at least one custom-agent-driven finding path or prompt assertion.

## Problem Repo-level `guidelines` currently provide one shared block of free-form review guidance. That is useful for broad conventions, but it does not let a repo define multiple focused review perspectives with distinct checklists. Repos should be able to ask Auto Review to apply custom, named review agents such as a security-focused checklist, API-design checklist, migration checklist, or domain-specific reviewer, without turning the config into an unconstrained prompt-injection surface. ## Proposed enhancement Add repo-config support for custom named review agents with structured checklist rules. The config should avoid completely free-form prompt text. Instead, each custom agent should have a stable name and a list of specific checks to perform. Example shape: ```yaml review_agents: agent_a: - "check this" - "check that" agent_b: - "check for ones" - "check for twos" ``` Exact key names/schema are TBD, but the important constraints are: - Agents are named by the repo config. - Each agent has a bounded list of checklist-style review items. - The review prompt presents these as structured checks, not arbitrary instructions that can override core reviewer/system behavior. - Default review behavior remains unchanged when no custom agents are configured. - Custom agents should supplement the default reviewer rather than replacing safety, bug, and correctness review. ## Design considerations - Decide whether custom agents are separate LLM passes or one structured section in the default review prompt. - Decide whether findings should record which custom agent/check produced them. - Consider limits for number of agents, number of checks per agent, and total bytes. - Treat repo config as trusted operator/repo-maintainer configuration but still keep prompt boundaries clear. - Preserve compatibility with existing `guidelines` while making this the preferred shape for reusable checklists. ## Acceptance criteria - `.auto_review.yaml` can define named custom review agents with checklist items. - The review prompt includes those named agents/checks in a structured way. - The config validator catches malformed custom-agent shapes and unknown keys under `--strict`. - Existing repos with no custom agents behave exactly as before. - Docs and `.auto_review.example.yaml` describe the feature and show examples. - Tests cover parsing, validation, prompt rendering, limits, and at least one custom-agent-driven finding path or prompt assertion.

jwilger added the

enhancement

label

2026-05-19 16:50:40 -07:00

jwilger added this to the 1.1 milestone

2026-05-19 16:50:40 -07:00

jwilger commented

2026-05-20 15:35:44 -07:00

Author

Owner

Design-plan update: evaluate inline vs focused custom review agents

Initial implementation plan was to add review_agents as structured repo config and render all named agents/checklists into one structured section of the default review prompt. That plan prioritized minimal architecture change, lower cost, lower latency, and reuse of the existing single review/verifier path.

John pushed back that this assumes the model will attend well to several review perspectives in one pass. Production review runs on GPT-4o / GPT-4o-mini tiers, not a stronger high-reasoning model, so combining security, architecture, migration, API-design, and domain-specific checklists into one prompt may dilute the intended focus. The core reason to name custom agents may be to run focused review passes, not just organize prompt text.

Revised plan: do not choose the production strategy by intuition. Add an evaluation path first and compare:

inline: one normal review pass with a structured Custom review agents prompt section.
separate: one focused LLM pass per named review agent, then merge/dedupe findings before the existing verifier.

Evaluation should extend the existing auto-review bench run fixture harness rather than creating a separate tool.

Metrics to capture:

custom-agent recall
overall precision/recall/F1
false positives per fixture
duplicate findings before dedupe
verifier drop rate
JSON/self-heal success rate
latency p50/p95/p99
input/output tokens
estimated USD using the existing price table
cost per true positive
cost per verified finding

Corpus requirements:

fixtures with one custom agent
fixtures with multiple agents
cases where only one agent should fire
negative fixtures where no custom agent should fire
distractor changes near labelled lines
baseline review fixtures to ensure normal safety/bug/correctness review does not regress
labels that can identify custom-agent expected findings, e.g. agent_id / check_id

Predeclared decision rule:

Choose separate focused passes only if, on GPT-4o-mini:

custom-agent recall improves by at least +10 percentage points absolute over inline
overall precision drops by no more than 5 percentage points
false positives increase by no more than 0.2 per fixture
baseline review recall drops by no more than 3 percentage points
JSON/self-heal success remains at least 95%
estimated cost is no more than 2x inline unless the recall gain is clearly material and explicitly accepted
p95 latency is no more than 2x inline, or we explicitly decide parallel focused passes are worth the operational cost

Choose inline if:

it is within 5 percentage points of separate on custom-agent recall, or
separate improves recall but exceeds the precision, false-positive, cost, latency, or success-rate thresholds.

Implementation order:

Add benchmark/evaluation support for custom review-agent strategies.
Add labelled custom-agent fixture corpus.
Run paired GPT-4o-mini evaluation, then confirm with GPT-4o if feasible.
Record the results and selected strategy.
Implement .auto_review.yaml review_agents production behavior using the winning strategy.

We will stop after the eval report and consult before implementing production review behavior beyond what the evaluation harness needs.

## Design-plan update: evaluate inline vs focused custom review agents Initial implementation plan was to add `review_agents` as structured repo config and render all named agents/checklists into one structured section of the default review prompt. That plan prioritized minimal architecture change, lower cost, lower latency, and reuse of the existing single review/verifier path. John pushed back that this assumes the model will attend well to several review perspectives in one pass. Production review runs on GPT-4o / GPT-4o-mini tiers, not a stronger high-reasoning model, so combining security, architecture, migration, API-design, and domain-specific checklists into one prompt may dilute the intended focus. The core reason to name custom agents may be to run focused review passes, not just organize prompt text. Revised plan: do not choose the production strategy by intuition. Add an evaluation path first and compare: 1. `inline`: one normal review pass with a structured `Custom review agents` prompt section. 2. `separate`: one focused LLM pass per named review agent, then merge/dedupe findings before the existing verifier. Evaluation should extend the existing `auto-review bench run` fixture harness rather than creating a separate tool. Metrics to capture: - custom-agent recall - overall precision/recall/F1 - false positives per fixture - duplicate findings before dedupe - verifier drop rate - JSON/self-heal success rate - latency p50/p95/p99 - input/output tokens - estimated USD using the existing price table - cost per true positive - cost per verified finding Corpus requirements: - fixtures with one custom agent - fixtures with multiple agents - cases where only one agent should fire - negative fixtures where no custom agent should fire - distractor changes near labelled lines - baseline review fixtures to ensure normal safety/bug/correctness review does not regress - labels that can identify custom-agent expected findings, e.g. `agent_id` / `check_id` Predeclared decision rule: Choose separate focused passes only if, on GPT-4o-mini: - custom-agent recall improves by at least +10 percentage points absolute over inline - overall precision drops by no more than 5 percentage points - false positives increase by no more than 0.2 per fixture - baseline review recall drops by no more than 3 percentage points - JSON/self-heal success remains at least 95% - estimated cost is no more than 2x inline unless the recall gain is clearly material and explicitly accepted - p95 latency is no more than 2x inline, or we explicitly decide parallel focused passes are worth the operational cost Choose inline if: - it is within 5 percentage points of separate on custom-agent recall, or - separate improves recall but exceeds the precision, false-positive, cost, latency, or success-rate thresholds. Implementation order: 1. Add benchmark/evaluation support for custom review-agent strategies. 2. Add labelled custom-agent fixture corpus. 3. Run paired GPT-4o-mini evaluation, then confirm with GPT-4o if feasible. 4. Record the results and selected strategy. 5. Implement `.auto_review.yaml review_agents` production behavior using the winning strategy. We will stop after the eval report and consult before implementing production review behavior beyond what the evaluation harness needs.