Add per-review cost estimate based on model + tokens used #28

New issue

Open

opened 2026-05-01 23:34:49 -07:00 by jwilger · 0 comments

jwilger commented

2026-05-01 23:34:49 -07:00

Owner

Idea

Every review the bot posts should include a cost estimate — model used per stage × input/output tokens × current per-token price — so the operator can see where the spend is going on a per-PR basis.

Why

Right now the gateway emits Prometheus counters (auto_review_llm_tokens_total{tier=\"reasoning|cheap|embedding\",dir=\"input|output\"}) but there is no per-review attribution. A misconfigured tier (e.g. accidentally pointing the cheap tier at an expensive reasoning model), an oversized prompt, or a runaway agentic-verify loop is invisible until the monthly invoice. Surfacing the cost in the review body itself (and in review history) closes that loop:

PR authors and reviewers see whether the bot was cheap-and-cheerful or expensive on this particular review.
Operators get a per-PR signal for outlier detection without scraping Prometheus.
The history store can summarise spend per repo / per author / per week without a separate counter aggregation pipeline.

Sketch

ar-llm: provider responses already surface usage.prompt_tokens and usage.completion_tokens for OpenAI-compatible APIs — extend the Router so each complete() / embed() call returns a Usage { input, output, model } alongside the response. (Embedding endpoints only report prompt_tokens.)
ar-llm: introduce a PriceTable keyed by (provider_base_url, model_name) → { input_per_1m_usd, output_per_1m_usd, embedding_per_1m_usd }. Default it to current OpenAI public pricing for the models we ship in QUICKSTART.md. Operators override via AR_PRICE_TABLE_PATH (TOML/JSON) so on-prem and OpenRouter / Together / Groq deployments can plug in their own rates.
ar-orchestrator::dispatcher: aggregate Usage across every LLM call in the per-PR pipeline (triage, embedding, reasoning, self-heal, verify, agentic-verify, pre-merge custom). Persist the aggregate in the review-history record alongside the existing finding count + diff size.
ar-review::pipeline: append a one-line cost summary to the review markdown body — e.g.
```
_Reviewed with reasoning=gpt-4o (12 432 in / 1 089 out), cheap=gpt-4o-mini (8 920 in / 312 out), embed=text-embedding-3-small (4 096 in). Estimated cost: \$0.14._
```
Hidden behind a default-on AR_REVIEW_COST_FOOTER env so operators can disable it for repos where the visibility is unwanted.
ar-cli status / ar-cli history: show cost per review in the existing tables; add a --since / --repo filter so an operator can answer "what did this bot cost me last week?" in one command.

Out of scope

Predicting cost before the review runs (would need diff-size → token estimate; deferable).
Provider-side rate limiting / budget enforcement (separate concern; this issue is observability only).
Dollar-denominated invoicing reconciliation against actual provider bills (rounding, tier discounts, batch credits all differ per provider).

Acceptance

A review body posted by the bot ends with a cost-summary line listing each tier's model + token counts + an estimated USD figure (estimate, not invoice).
review-history records carry the per-review aggregate so historical queries don't re-scan logs.
A PriceTable test harness pins the default OpenAI rates so a price drift is visible in git diff.
AR_PRICE_TABLE_PATH allows an operator to override the table without rebuilding.

## Idea Every review the bot posts should include a cost estimate — model used per stage × input/output tokens × current per-token price — so the operator can see where the spend is going on a per-PR basis. ## Why Right now the gateway emits Prometheus counters (`auto_review_llm_tokens_total{tier=\"reasoning|cheap|embedding\",dir=\"input|output\"}`) but there is no per-review attribution. A misconfigured tier (e.g. accidentally pointing the cheap tier at an expensive reasoning model), an oversized prompt, or a runaway agentic-verify loop is invisible until the monthly invoice. Surfacing the cost in the review body itself (and in `review history`) closes that loop: - PR authors and reviewers see whether the bot was cheap-and-cheerful or expensive on this particular review. - Operators get a per-PR signal for outlier detection without scraping Prometheus. - The history store can summarise spend per repo / per author / per week without a separate counter aggregation pipeline. ## Sketch - `ar-llm`: provider responses already surface `usage.prompt_tokens` and `usage.completion_tokens` for OpenAI-compatible APIs — extend the `Router` so each `complete()` / `embed()` call returns a `Usage { input, output, model }` alongside the response. (Embedding endpoints only report `prompt_tokens`.) - `ar-llm`: introduce a `PriceTable` keyed by `(provider_base_url, model_name)` → `{ input_per_1m_usd, output_per_1m_usd, embedding_per_1m_usd }`. Default it to current OpenAI public pricing for the models we ship in `QUICKSTART.md`. Operators override via `AR_PRICE_TABLE_PATH` (TOML/JSON) so on-prem and OpenRouter / Together / Groq deployments can plug in their own rates. - `ar-orchestrator::dispatcher`: aggregate `Usage` across every LLM call in the per-PR pipeline (triage, embedding, reasoning, self-heal, verify, agentic-verify, pre-merge custom). Persist the aggregate in the review-history record alongside the existing finding count + diff size. - `ar-review::pipeline`: append a one-line cost summary to the review markdown body — e.g. ``` _Reviewed with reasoning=gpt-4o (12 432 in / 1 089 out), cheap=gpt-4o-mini (8 920 in / 312 out), embed=text-embedding-3-small (4 096 in). Estimated cost: \$0.14._ ``` Hidden behind a default-on `AR_REVIEW_COST_FOOTER` env so operators can disable it for repos where the visibility is unwanted. - `ar-cli status` / `ar-cli history`: show cost per review in the existing tables; add a `--since` / `--repo` filter so an operator can answer \"what did this bot cost me last week?\" in one command. ## Out of scope - Predicting cost _before_ the review runs (would need diff-size → token estimate; deferable). - Provider-side rate limiting / budget enforcement (separate concern; this issue is observability only). - Dollar-denominated invoicing reconciliation against actual provider bills (rounding, tier discounts, batch credits all differ per provider). ## Acceptance - A review body posted by the bot ends with a cost-summary line listing each tier's model + token counts + an estimated USD figure (estimate, not invoice). - `review-history` records carry the per-review aggregate so historical queries don't re-scan logs. - A `PriceTable` test harness pins the default OpenAI rates so a price drift is visible in `git diff`. - `AR_PRICE_TABLE_PATH` allows an operator to override the table without rebuilding.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

jwilger/auto_review#28

No description provided.

Rows
Columns