Embedding pipeline sends too much context for local Ollama embedders #20

New issue

Closed

opened 2026-05-01 19:39:09 -07:00 by jwilger · 0 comments

jwilger commented

2026-05-01 19:39:09 -07:00

Owner

Problem

The embedding pipeline sizes its per-input cap for hosted OpenAI text-embedding-3-* (~8191 tokens / ~32 KiB English). When the embedding tier is pointed at a local Ollama model — e.g. nomic-embed-text (default 2048 tokens / ~8 KiB context) — we routinely send inputs larger than the model's context window. The model silently truncates (or, depending on Ollama version, errors), embedding quality degrades, and the local embedder spends much more compute than it needs to.

Current behaviour

crates/ar-index/src/embed.rs:42-52:

pub const EMBED_BATCH_SIZE: usize = 32;
pub const EMBED_INPUT_CAP_BYTES: usize = 24 * 1024;

EMBED_INPUT_CAP_BYTES = 24 * 1024 is hard-coded and explicitly justified by the OpenAI 8191-token limit in the doc comment.
EMBED_BATCH_SIZE = 32 is the only batching control; there is no aggregate token/byte cap per request.
The truncation in truncate_at_char_boundary happens at a flat byte boundary that is wildly wrong for embedders with smaller context windows (nomic-embed-text's default num_ctx=2048 is roughly 8 KiB, and we feed it up to 24 KiB).
crates/ar-llm/src/openai.rs (the OpenAI-shaped client used for Ollama) sends no options.num_ctx and no Ollama-specific tuning; the per-input cap is the only guardrail.

Why this matters

For the operator's documented dev setup (qwen3-coder:30b reasoning + nomic-embed-text embedding on Ollama), every symbol whose snippet exceeds ~8 KiB exhausts the embedder's context window. The model emits an embedding that doesn't represent the full snippet, so RAG retrieval ranks worse on exactly the long functions where context matters most.
Local embedders run on the host CPU/GPU; sending 3× the model's actual context per call is a real wall-clock cost on every review.
There is no per-batch token budget, so a batch of 32 medium-large snippets can also blow past the model's effective batch limit even if no single input is over-cap.

Proposed direction

Make EMBED_INPUT_CAP_BYTES and EMBED_BATCH_SIZE configurable rather than pub const. Reasonable env-var names: AR_EMBED_INPUT_CAP_BYTES, AR_EMBED_BATCH_SIZE.
Pick defaults that are safe for small local embedders (e.g. cap ≈ 6 KiB to fit comfortably under a 2048-token window with a margin), and document the recommended override for hosted OpenAI-class embedders.
Optionally, expose embedder metadata on LlmProvider (e.g. embedding_context_window() returning Option<usize>) so the embed pass can size the cap from the actual model config when known.
For the OpenAI-shaped client, when the base URL points at Ollama (localhost:11434 or any non-OpenAI host), pass options.num_ctx explicitly so the server doesn't fall back to its default 2048 silently.
Add a debug/trace log on every embed batch with the selected cap, the largest input size in the batch, and the batch input count — so the next person debugging "why is RAG returning garbage on local Ollama" can see immediately whether truncation is happening.

Out of scope

Switching the workspace embedder away from Ollama.
Token-accurate sizing (we don't ship a tokenizer); byte-based caps with a conservative ratio are fine.

## Problem The embedding pipeline sizes its per-input cap for hosted OpenAI `text-embedding-3-*` (~8191 tokens / ~32 KiB English). When the embedding tier is pointed at a local Ollama model — e.g. `nomic-embed-text` (default 2048 tokens / ~8 KiB context) — we routinely send inputs larger than the model's context window. The model silently truncates (or, depending on Ollama version, errors), embedding quality degrades, and the local embedder spends much more compute than it needs to. ## Current behaviour `crates/ar-index/src/embed.rs:42-52`: ```rust pub const EMBED_BATCH_SIZE: usize = 32; pub const EMBED_INPUT_CAP_BYTES: usize = 24 * 1024; ``` - `EMBED_INPUT_CAP_BYTES = 24 * 1024` is hard-coded and explicitly justified by the OpenAI 8191-token limit in the doc comment. - `EMBED_BATCH_SIZE = 32` is the only batching control; there is no aggregate token/byte cap per request. - The truncation in `truncate_at_char_boundary` happens at a flat byte boundary that is wildly wrong for embedders with smaller context windows (`nomic-embed-text`'s default `num_ctx=2048` is roughly 8 KiB, and we feed it up to 24 KiB). - `crates/ar-llm/src/openai.rs` (the OpenAI-shaped client used for Ollama) sends no `options.num_ctx` and no Ollama-specific tuning; the per-input cap is the only guardrail. ## Why this matters - For the operator's documented dev setup (`qwen3-coder:30b` reasoning + `nomic-embed-text` embedding on Ollama), every symbol whose snippet exceeds ~8 KiB exhausts the embedder's context window. The model emits an embedding that doesn't represent the full snippet, so RAG retrieval ranks worse on exactly the long functions where context matters most. - Local embedders run on the host CPU/GPU; sending 3× the model's actual context per call is a real wall-clock cost on every review. - There is no per-batch token budget, so a batch of 32 medium-large snippets can also blow past the model's effective batch limit even if no single input is over-cap. ## Proposed direction 1. Make `EMBED_INPUT_CAP_BYTES` and `EMBED_BATCH_SIZE` configurable rather than `pub const`. Reasonable env-var names: `AR_EMBED_INPUT_CAP_BYTES`, `AR_EMBED_BATCH_SIZE`. 2. Pick defaults that are safe for small local embedders (e.g. cap ≈ 6 KiB to fit comfortably under a 2048-token window with a margin), and document the recommended override for hosted OpenAI-class embedders. 3. Optionally, expose embedder metadata on `LlmProvider` (e.g. `embedding_context_window()` returning `Option<usize>`) so the embed pass can size the cap from the actual model config when known. 4. For the OpenAI-shaped client, when the base URL points at Ollama (`localhost:11434` or any non-OpenAI host), pass `options.num_ctx` explicitly so the server doesn't fall back to its default 2048 silently. 5. Add a debug/trace log on every embed batch with the selected cap, the largest input size in the batch, and the batch input count — so the next person debugging "why is RAG returning garbage on local Ollama" can see immediately whether truncation is happening. ## Out of scope - Switching the workspace embedder away from Ollama. - Token-accurate sizing (we don't ship a tokenizer); byte-based caps with a conservative ratio are fine.

jwilger referenced this issue from a commit

2026-05-01 19:48:01 -07:00

fix(embed): size embedding pass for local Ollama (#20)

jwilger referenced this issue from a pull request that will close it,

2026-05-01 19:48:13 -07:00

fix(embed): size embedding pass for local Ollama (#20) #21

jwilger referenced this issue

2026-05-01 20:15:46 -07:00

fix(embed): size embedding pass for local Ollama (#20) #21

jwilger referenced this issue from a commit

2026-05-01 20:17:06 -07:00

fix(embed): size embedding pass for local Ollama (#20)