fix(rag): wire diff embed through EmbedConfig (closes #26) #27

Merged
jwilger merged 1 commit from fix/rag-diff-embed-cap into main 2026-05-01 23:38:04 -07:00
Owner

Summary

  • crates/ar-review/src/context_builder.rs: replace hardcoded 32 KiB diff-embed cap with EmbedConfig::from_env().input_cap_bytes so AR_EMBED_INPUT_CAP_BYTES governs every embed call. Default drops to 6 KiB (DEFAULT_EMBED_INPUT_CAP_BYTES), within the 8192-token ceiling of text-embedding-3-small even on dense source. Symbol-embed pass now reuses the same EmbedConfig instead of re-reading env.
  • crates/ar-review/src/diff.rs: add cap_for_embed helper — strict byte-bounded truncation, no trailing marker (embeddings encode semantic content, the [truncated] tag only inflated token usage and added 2 bytes of overhead).

Why

Closes #26. PR #25's review hit HTTP 400: maximum input length is 8192 tokens on the diff-embedding call, the pipeline silently dropped RAG context, and the review came back with 0 findings after the severity floor dropped 9 candidates. Quality degraded with only a WARN. The EmbedConfig infrastructure introduced for #20 already governs per-symbol embed caps but never reached the diff-embed call site.

Test plan

  • cargo nextest run --workspace --no-tests=pass (998/998 pass)
  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets -- -D warnings (no issues)
  • cargo deny check licenses bans sources (ok)
  • New regression test: diff_embedding_input_is_capped_to_embed_config_default drives a 256 KiB diff through the path and asserts the embedder saw <= 6 KiB.
  • Three new unit tests cover cap_for_embed's strict-bound, short-input, and UTF-8 boundary semantics.
  • Operator with hosted OpenAI: raise AR_EMBED_INPUT_CAP_BYTES=16384 for richer diff context (still well under 8192 tokens for English-heavy diffs).

References

  • Issue: #26
  • Related: #20 / PR #21 (Ollama embed overflow — introduced EmbedConfig)
  • Symptom log: 2026-05-02T06:13:57Z, repo jwilger/auto_review, PR #25
## Summary - `crates/ar-review/src/context_builder.rs`: replace hardcoded 32 KiB diff-embed cap with `EmbedConfig::from_env().input_cap_bytes` so `AR_EMBED_INPUT_CAP_BYTES` governs every embed call. Default drops to 6 KiB (`DEFAULT_EMBED_INPUT_CAP_BYTES`), within the 8192-token ceiling of `text-embedding-3-small` even on dense source. Symbol-embed pass now reuses the same `EmbedConfig` instead of re-reading env. - `crates/ar-review/src/diff.rs`: add `cap_for_embed` helper — strict byte-bounded truncation, no trailing marker (embeddings encode semantic content, the `[truncated]` tag only inflated token usage and added 2 bytes of overhead). ## Why Closes #26. PR #25's review hit `HTTP 400: maximum input length is 8192 tokens` on the diff-embedding call, the pipeline silently dropped RAG context, and the review came back with 0 findings after the severity floor dropped 9 candidates. Quality degraded with only a `WARN`. The `EmbedConfig` infrastructure introduced for #20 already governs per-symbol embed caps but never reached the diff-embed call site. ## Test plan - [x] `cargo nextest run --workspace --no-tests=pass` (998/998 pass) - [x] `cargo fmt --all -- --check` - [x] `cargo clippy --workspace --all-targets -- -D warnings` (no issues) - [x] `cargo deny check licenses bans sources` (ok) - [x] New regression test: `diff_embedding_input_is_capped_to_embed_config_default` drives a 256 KiB diff through the path and asserts the embedder saw <= 6 KiB. - [x] Three new unit tests cover `cap_for_embed`'s strict-bound, short-input, and UTF-8 boundary semantics. - [ ] Operator with hosted OpenAI: raise `AR_EMBED_INPUT_CAP_BYTES=16384` for richer diff context (still well under 8192 tokens for English-heavy diffs). ## References - Issue: https://git.johnwilger.com/jwilger/auto_review/issues/26 - Related: #20 / PR #21 (Ollama embed overflow — introduced `EmbedConfig`) - Symptom log: 2026-05-02T06:13:57Z, repo `jwilger/auto_review`, PR #25
fix(rag): wire diff embed through EmbedConfig (closes #26)
All checks were successful
auto_review auto_review: no findings
CI / Nix flake check (pull_request) Successful in 1m40s
ff4c6bf239
The diff-embedding path in build_review_context capped its input at a
hardcoded 32 KiB before calling the Embedding tier. text-embedding-3-small
rejects any input over 8192 tokens, and dense source crosses that
threshold well before 32 KiB — large PR diffs were hitting HTTP 400 and
the pipeline was silently dropping RAG context for the whole review.

The EmbedConfig knob introduced for #20 was already governing per-symbol
embedding caps but never reached this call site. Route the diff embed
through the same EmbedConfig::from_env() so AR_EMBED_INPUT_CAP_BYTES
controls every embed call. Default cap drops from 32 KiB to 6 KiB,
matching DEFAULT_EMBED_INPUT_CAP_BYTES — operators on hosted OpenAI can
raise it to ~16 KiB; Ollama users stay safe at the floor.

Add crate::diff::cap_for_embed for strict byte-bounded truncation
(unlike cap_for_prompt, no trailing marker — embeddings encode semantic
content, so a "[truncated]" tag would only inflate token usage). The
symbol-embed pass now also threads through the same EmbedConfig instead
of re-reading from env independently.

Test added: diff_embedding_input_is_capped_to_embed_config_default
drives a 256 KiB diff through the path and asserts the embedder saw
<= DEFAULT_EMBED_INPUT_CAP_BYTES (6 KiB).
auto-review left a comment

This PR addresses issue #26 by replacing the hardcoded 32 KiB cap for diff embedding with a configurable cap using EmbedConfig. This change prevents HTTP 400 errors due to exceeding token limits and ensures consistent embedding input sizes. The PR appears safe to merge, with comprehensive tests verifying the new behavior.

Walkthrough

  • context_builder.rs:
    • Replaced the hardcoded 32 KiB cap with EmbedConfig::from_env().input_cap_bytes to ensure embedding input size is governed by environment configuration.
    • Updated embed_and_query_symbols to accept EmbedConfig, ensuring consistent configuration use.
  • diff.rs:
    • Introduced cap_for_embed function for strict byte-bound truncation without a trailing marker, optimizing token usage.
  • Tests:
    • Added tests to verify the new embedding cap behavior and ensure UTF-8 boundary handling.

Pre-merge checks

  • CHANGELOG updated — CHANGELOG.md is in the diff
  • Tests touched — source changed but no test file appears in the diff
  • No new TODO/FIXME comments — no new TODO/FIXME markers
This PR addresses issue #26 by replacing the hardcoded 32 KiB cap for diff embedding with a configurable cap using `EmbedConfig`. This change prevents HTTP 400 errors due to exceeding token limits and ensures consistent embedding input sizes. The PR appears safe to merge, with comprehensive tests verifying the new behavior. ## Walkthrough - **context_builder.rs**: - Replaced the hardcoded 32 KiB cap with `EmbedConfig::from_env().input_cap_bytes` to ensure embedding input size is governed by environment configuration. - Updated `embed_and_query_symbols` to accept `EmbedConfig`, ensuring consistent configuration use. - **diff.rs**: - Introduced `cap_for_embed` function for strict byte-bound truncation without a trailing marker, optimizing token usage. - **Tests**: - Added tests to verify the new embedding cap behavior and ensure UTF-8 boundary handling. ## Pre-merge checks - [x] CHANGELOG updated — CHANGELOG.md is in the diff - [ ] Tests touched — source changed but no test file appears in the diff - [x] No new TODO/FIXME comments — no new TODO/FIXME markers
jwilger deleted branch fix/rag-diff-embed-cap 2026-05-01 23:38:04 -07:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
jwilger/auto_review!27
No description provided.