Add per-review cost estimate based on model + tokens used #28

Open
opened 2026-05-01 23:34:49 -07:00 by jwilger · 0 comments
Owner

Idea

Every review the bot posts should include a cost estimate — model used per stage × input/output tokens × current per-token price — so the operator can see where the spend is going on a per-PR basis.

Why

Right now the gateway emits Prometheus counters (auto_review_llm_tokens_total{tier=\"reasoning|cheap|embedding\",dir=\"input|output\"}) but there is no per-review attribution. A misconfigured tier (e.g. accidentally pointing the cheap tier at an expensive reasoning model), an oversized prompt, or a runaway agentic-verify loop is invisible until the monthly invoice. Surfacing the cost in the review body itself (and in review history) closes that loop:

  • PR authors and reviewers see whether the bot was cheap-and-cheerful or expensive on this particular review.
  • Operators get a per-PR signal for outlier detection without scraping Prometheus.
  • The history store can summarise spend per repo / per author / per week without a separate counter aggregation pipeline.

Sketch

  • ar-llm: provider responses already surface usage.prompt_tokens and usage.completion_tokens for OpenAI-compatible APIs — extend the Router so each complete() / embed() call returns a Usage { input, output, model } alongside the response. (Embedding endpoints only report prompt_tokens.)
  • ar-llm: introduce a PriceTable keyed by (provider_base_url, model_name){ input_per_1m_usd, output_per_1m_usd, embedding_per_1m_usd }. Default it to current OpenAI public pricing for the models we ship in QUICKSTART.md. Operators override via AR_PRICE_TABLE_PATH (TOML/JSON) so on-prem and OpenRouter / Together / Groq deployments can plug in their own rates.
  • ar-orchestrator::dispatcher: aggregate Usage across every LLM call in the per-PR pipeline (triage, embedding, reasoning, self-heal, verify, agentic-verify, pre-merge custom). Persist the aggregate in the review-history record alongside the existing finding count + diff size.
  • ar-review::pipeline: append a one-line cost summary to the review markdown body — e.g.
    _Reviewed with reasoning=gpt-4o (12 432 in / 1 089 out), cheap=gpt-4o-mini (8 920 in / 312 out), embed=text-embedding-3-small (4 096 in). Estimated cost: \$0.14._
    
    Hidden behind a default-on AR_REVIEW_COST_FOOTER env so operators can disable it for repos where the visibility is unwanted.
  • ar-cli status / ar-cli history: show cost per review in the existing tables; add a --since / --repo filter so an operator can answer "what did this bot cost me last week?" in one command.

Out of scope

  • Predicting cost before the review runs (would need diff-size → token estimate; deferable).
  • Provider-side rate limiting / budget enforcement (separate concern; this issue is observability only).
  • Dollar-denominated invoicing reconciliation against actual provider bills (rounding, tier discounts, batch credits all differ per provider).

Acceptance

  • A review body posted by the bot ends with a cost-summary line listing each tier's model + token counts + an estimated USD figure (estimate, not invoice).
  • review-history records carry the per-review aggregate so historical queries don't re-scan logs.
  • A PriceTable test harness pins the default OpenAI rates so a price drift is visible in git diff.
  • AR_PRICE_TABLE_PATH allows an operator to override the table without rebuilding.
## Idea Every review the bot posts should include a cost estimate — model used per stage × input/output tokens × current per-token price — so the operator can see where the spend is going on a per-PR basis. ## Why Right now the gateway emits Prometheus counters (`auto_review_llm_tokens_total{tier=\"reasoning|cheap|embedding\",dir=\"input|output\"}`) but there is no per-review attribution. A misconfigured tier (e.g. accidentally pointing the cheap tier at an expensive reasoning model), an oversized prompt, or a runaway agentic-verify loop is invisible until the monthly invoice. Surfacing the cost in the review body itself (and in `review history`) closes that loop: - PR authors and reviewers see whether the bot was cheap-and-cheerful or expensive on this particular review. - Operators get a per-PR signal for outlier detection without scraping Prometheus. - The history store can summarise spend per repo / per author / per week without a separate counter aggregation pipeline. ## Sketch - `ar-llm`: provider responses already surface `usage.prompt_tokens` and `usage.completion_tokens` for OpenAI-compatible APIs — extend the `Router` so each `complete()` / `embed()` call returns a `Usage { input, output, model }` alongside the response. (Embedding endpoints only report `prompt_tokens`.) - `ar-llm`: introduce a `PriceTable` keyed by `(provider_base_url, model_name)` → `{ input_per_1m_usd, output_per_1m_usd, embedding_per_1m_usd }`. Default it to current OpenAI public pricing for the models we ship in `QUICKSTART.md`. Operators override via `AR_PRICE_TABLE_PATH` (TOML/JSON) so on-prem and OpenRouter / Together / Groq deployments can plug in their own rates. - `ar-orchestrator::dispatcher`: aggregate `Usage` across every LLM call in the per-PR pipeline (triage, embedding, reasoning, self-heal, verify, agentic-verify, pre-merge custom). Persist the aggregate in the review-history record alongside the existing finding count + diff size. - `ar-review::pipeline`: append a one-line cost summary to the review markdown body — e.g. ``` _Reviewed with reasoning=gpt-4o (12 432 in / 1 089 out), cheap=gpt-4o-mini (8 920 in / 312 out), embed=text-embedding-3-small (4 096 in). Estimated cost: \$0.14._ ``` Hidden behind a default-on `AR_REVIEW_COST_FOOTER` env so operators can disable it for repos where the visibility is unwanted. - `ar-cli status` / `ar-cli history`: show cost per review in the existing tables; add a `--since` / `--repo` filter so an operator can answer \"what did this bot cost me last week?\" in one command. ## Out of scope - Predicting cost _before_ the review runs (would need diff-size → token estimate; deferable). - Provider-side rate limiting / budget enforcement (separate concern; this issue is observability only). - Dollar-denominated invoicing reconciliation against actual provider bills (rounding, tier discounts, batch credits all differ per provider). ## Acceptance - A review body posted by the bot ends with a cost-summary line listing each tier's model + token counts + an estimated USD figure (estimate, not invoice). - `review-history` records carry the per-review aggregate so historical queries don't re-scan logs. - A `PriceTable` test harness pins the default OpenAI rates so a price drift is visible in `git diff`. - `AR_PRICE_TABLE_PATH` allows an operator to override the table without rebuilding.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
jwilger/auto_review#28
No description provided.