RAG diff embedding skipped on large PRs: 32 KiB byte cap exceeds 8192-token model limit #26

Closed
opened 2026-05-01 23:18:40 -07:00 by jwilger · 0 comments
Owner

Symptom

On PR #25 the gateway logged:

2026-05-02T06:13:57.279342Z  WARN ar_review::context_builder: diff embedding failed; skipping RAG context error=provider error 400: {
    "message": "Invalid 'input[0]': maximum input length is 8192 tokens.",
    "type": "invalid_request_error",
    ...
}

The pipeline swallowed the error and continued without RAG context. The review subsequently posted with findings=0 after the severity floor dropped 9 candidate findings — quality silently degraded with only a WARN.

Root cause

crates/ar-review/src/context_builder.rs:112-113 hardcodes the diff-embedding cap as bytes:

const EMBED_QUERY_CAP: usize = 32 * 1024;
let query_text = crate::diff::cap_for_prompt(diff, EMBED_QUERY_CAP, "");

OpenAI's text-embedding-3-small rejects any input over 8192 tokens. The 32 KiB byte cap is a rough match for English prose, but Rust diffs (dense in identifiers, punctuation, leading whitespace) routinely exceed 8192 tokens well below 32 KiB. PR #25 is the proof.

The EmbedConfig infrastructure introduced by #20 (PR #21) exists for exactly this knob — AR_EMBED_INPUT_CAP_BYTES with DEFAULT_EMBED_INPUT_CAP_BYTES = 6 KiB, see crates/ar-index/src/embed.rs:45. But the diff-embedding path in context_builder.rs was never wired through it. So operators raising/lowering AR_EMBED_INPUT_CAP_BYTES change symbol embedding caps without touching the diff query cap.

Proposed fix

  1. Replace the hardcoded EMBED_QUERY_CAP in context_builder.rs with EmbedConfig::from_env().input_cap_bytes (or pass an EmbedConfig through the call site, matching how embed_symbols_with_config already works).
  2. Lower the default cap from 32 KiB toward something safe for the 8192-token ceiling on dense source — somewhere in the 12–16 KiB range. Operators on Ollama with num_ctx=2048 already need to drop to ~6 KiB; OpenAI users with the 8192-token model should land closer to 16 KiB. The EmbedConfig default of 6 KiB is fine as the conservative floor.
  3. Add a regression test covering the diff-embedding path with an oversized input — assert the request body never exceeds the configured cap.
  4. Optionally: when the embedding 400s, prefer to retry once with a halved cap before giving up. The current skip RAG and continue path is fine as a fallback but a halve-and-retry would recover most cases without operator tuning.

Acceptance

  • A diff that previously triggered the 400 finishes its review with non-empty RAG context (verified in a regression test against a wiremock'd 8192-token boundary, or by replaying PR #25 with the new default).
  • AR_EMBED_INPUT_CAP_BYTES overrides the diff-embedding cap, not just the symbol-embedding cap.
  • No new WARN ar_review::context_builder: diff embedding failed entries for PR diffs the size of #25.

References

  • Log: 2026-05-02T06:13:57Z, repo jwilger/auto_review, PR #25
  • Code: crates/ar-review/src/context_builder.rs:99-120
  • Related: #20 (Ollama embed overflow), PR #21 (introduced EmbedConfig + AR_EMBED_*)
## Symptom On PR #25 the gateway logged: ``` 2026-05-02T06:13:57.279342Z WARN ar_review::context_builder: diff embedding failed; skipping RAG context error=provider error 400: { "message": "Invalid 'input[0]': maximum input length is 8192 tokens.", "type": "invalid_request_error", ... } ``` The pipeline swallowed the error and continued without RAG context. The review subsequently posted with `findings=0` after the severity floor dropped 9 candidate findings — quality silently degraded with only a `WARN`. ## Root cause `crates/ar-review/src/context_builder.rs:112-113` hardcodes the diff-embedding cap as **bytes**: ```rust const EMBED_QUERY_CAP: usize = 32 * 1024; let query_text = crate::diff::cap_for_prompt(diff, EMBED_QUERY_CAP, ""); ``` OpenAI's `text-embedding-3-small` rejects any input over **8192 tokens**. The 32 KiB byte cap is a rough match for English prose, but Rust diffs (dense in identifiers, punctuation, leading whitespace) routinely exceed 8192 tokens well below 32 KiB. PR #25 is the proof. The `EmbedConfig` infrastructure introduced by #20 (PR #21) exists for exactly this knob — `AR_EMBED_INPUT_CAP_BYTES` with `DEFAULT_EMBED_INPUT_CAP_BYTES = 6 KiB`, see `crates/ar-index/src/embed.rs:45`. But the diff-embedding path in `context_builder.rs` was never wired through it. So operators raising/lowering `AR_EMBED_INPUT_CAP_BYTES` change symbol embedding caps without touching the diff query cap. ## Proposed fix 1. Replace the hardcoded `EMBED_QUERY_CAP` in `context_builder.rs` with `EmbedConfig::from_env().input_cap_bytes` (or pass an `EmbedConfig` through the call site, matching how `embed_symbols_with_config` already works). 2. Lower the default cap from 32 KiB toward something safe for the 8192-token ceiling on dense source — somewhere in the 12–16 KiB range. Operators on Ollama with `num_ctx=2048` already need to drop to ~6 KiB; OpenAI users with the 8192-token model should land closer to 16 KiB. The `EmbedConfig` default of 6 KiB is fine as the conservative floor. 3. Add a regression test covering the diff-embedding path with an oversized input — assert the request body never exceeds the configured cap. 4. Optionally: when the embedding 400s, prefer to retry once with a halved cap before giving up. The current `skip RAG and continue` path is fine as a fallback but a halve-and-retry would recover most cases without operator tuning. ## Acceptance - A diff that previously triggered the 400 finishes its review with non-empty RAG context (verified in a regression test against a `wiremock`'d 8192-token boundary, or by replaying PR #25 with the new default). - `AR_EMBED_INPUT_CAP_BYTES` overrides the diff-embedding cap, not just the symbol-embedding cap. - No new `WARN ar_review::context_builder: diff embedding failed` entries for PR diffs the size of #25. ## References - Log: 2026-05-02T06:13:57Z, repo `jwilger/auto_review`, PR #25 - Code: `crates/ar-review/src/context_builder.rs:99-120` - Related: #20 (Ollama embed overflow), PR #21 (introduced `EmbedConfig` + `AR_EMBED_*`)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
jwilger/auto_review#26
No description provided.