fix(pre-merge): scan every marker occurrence in contains_todo_marker #4
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "dogfood/test-review"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
pre_merge::contains_todo_markerusedline.find(marker)which only returns the first occurrence. When that first hit lived inside a longer identifier (failing the word-boundary check), the function gave up on that marker — missing real later occurrences on the same line.match_indicesso every candidate occurrence is boundary-checked.TODOmarker, then TODO again) and suffix collisions (suppressFIXME … FIXME).Test plan
cargo test -p ar-review --lib pre_merge(27 passed, including the two new regression tests)cargo clippy -p ar-review --all-targets -- -D warningscargo fmt --all -- --checkThis PR fixes the
pre_merge::contains_todo_markerfunction to correctly scan all occurrences of markers on a line, preventing missed markers due to substring collisions. It also updates the JSON schema to comply with OpenAI's strict mode and adds regression tests to ensure robustness. The changes appear well-tested and safe to merge.Walkthrough
pre_merge::contains_todo_marker.requiredfields to includewalkthroughandmermaid.line_enda nullable type to comply with OpenAI's strict mode.findwithmatch_indicesto check all occurrences of markers.Pre-merge checks
💡 Note: Ensure that the changelog entry for the
pre_merge::contains_todo_markerfix is clear and concise. Consider adding a brief explanation of why scanning every occurrence is important for clarity.@ -5,3 +5,3 @@"additionalProperties": false,"required": ["summary", "findings"],"required": ["summary", "walkthrough", "mermaid", "findings"],"properties": {💡 Note: The
requiredfield now includeswalkthroughandmermaid. Ensure that these fields are always provided in the JSON response, even if empty, to maintain compatibility with the schema.@ -7,3 +6,4 @@"required": ["summary", "walkthrough", "mermaid", "findings"],"properties": {"summary": {"type": "string",💡 Note: The
line_endproperty is now nullable. Ensure that any code generating JSON responses adheres to this change to prevent schema validation errors.💡 Note: The new test
schema_satisfies_openai_strict_mode_at_every_object_levelensures compliance with OpenAI's strict mode. This is a crucial addition to prevent runtime errors during API interactions.💡 Note: Switching from
findtomatch_indicesensures that all occurrences of markers are checked for word boundaries. This change is crucial for correctly identifying markers that might be missed due to substring collisions.💡 Note: The new regression tests effectively cover scenarios where markers might be missed due to prefix or suffix collisions. These tests are essential for ensuring the robustness of the
contains_todo_markerfunction.* Verification schema (crates/ar-prompts/schemas/verification.json): `reasoning` was in `properties` but missing from `required`. OpenAI strict mode rejects this with HTTP 400, so the verifier failed on every review and no findings were filtered. Add to `required`; the parser already coerces missing/empty via `#[serde(default)]`. * Embed input cap (crates/ar-index/src/embed.rs): a single oversized symbol (>8192 tokens, ~32 KiB English) failed the whole embed batch with HTTP 400, taking the entire RAG pass with it. Truncate each per-symbol snippet at 24 KiB at a UTF-8 char boundary before send. * Forgejo poller URL (crates/ar-forgejo/src/client.rs): `pulls/{n}/comments` does not exist in Forgejo (verified against 15.0.0 — the chat poller was 404'ing every cycle). Switched to `issues/{n}/comments` which is where top-level @-mentions live. Updated all wiremock paths to match. The strict-mode invariant test in crates/ar-prompts/src/schema.rs now walks all four response schemas (review, triage, verification, pre_merge_custom) so a future drift gets caught at `cargo test` not in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>This PR fixes the
contains_todo_markerfunction to correctly scan all occurrences of markers on a line, ensuring accurate detection. It also updates API paths for Forgejo compatibility and improves JSON schema compliance with OpenAI's strict mode. The changes appear well-tested and safe to merge.Walkthrough
find()withmatch_indices()to ensure all occurrences of markers are checked for word boundaries, fixing missed detections./issues/{n}/commentsfor Forgejo compatibility, preventing 404 errors.EMBED_INPUT_CAP_BYTESto truncate oversized symbols, preventing HTTP 400 errors during embedding.required, ensuring compatibility with OpenAI's strict mode and preventing HTTP 400 errors.Pre-merge checks
@ -316,7 +321,7 @@ impl Client {let mut all = Vec::new();💡 Note: The endpoint path has been corrected to use
/issues/{n}/commentsinstead of/pulls/{n}/comments, which does not exist in Forgejo. This change ensures compatibility with Forgejo's API and prevents 404 errors.@ -58,13 +81,16 @@ pub async fn embed_symbols(return Ok(Vec::new());💡 Note: The introduction of
EMBED_INPUT_CAP_BYTESand the truncation logic prevents oversized symbols from causing HTTP 400 errors during embedding. This change improves robustness by ensuring that large symbols do not disrupt the entire embedding process.@ -5,3 +5,3 @@"additionalProperties": false,"required": ["summary", "findings"],"required": ["summary", "walkthrough", "mermaid", "findings"],"properties": {💡 Note: The
requiredfields in the JSON schema have been updated to include all properties, ensuring compatibility with OpenAI's strict mode. This prevents HTTP 400 errors due to missing required fields in the schema.💡 Note: The
requiredfields in the JSON schema have been updated to includereasoning, ensuring compatibility with OpenAI's strict mode. This prevents HTTP 400 errors due to missing required fields in the schema.💡 Note: The change from
find()tomatch_indices()ensures that all occurrences of markers are checked for word boundaries, fixing the issue where only the first occurrence was considered. This is a crucial fix for accurately detecting markers likeTODOorFIXMEin lines with multiple occurrences.This PR fixes several issues, including ensuring all occurrences of TODO markers are checked, updating API endpoints for Forgejo compatibility, and enforcing schema compliance with OpenAI's strict mode. The changes are well-tested and appear safe to merge.
Walkthrough
find()withmatch_indices()to ensure all occurrences of markers are checked for word boundaries, fixing missed markers.issues/{n}/commentsinstead ofpulls/{n}/commentsfor Forgejo compatibility.EMBED_INPUT_CAP_BYTESto truncate oversized inputs, preventing HTTP 400 errors from OpenAI.requiredarrays, preventing HTTP 400 errors.Pre-merge checks
@ -316,7 +321,7 @@ impl Client {let mut all = Vec::new();💡 Note: The change from
pulls/{n}/commentstoissues/{n}/commentsis correct for Forgejo compatibility, as the previous endpoint does not exist and would result in a 404 error.@ -58,13 +81,16 @@ pub async fn embed_symbols(return Ok(Vec::new());💡 Note: The introduction of
EMBED_INPUT_CAP_BYTESand truncation logic is a good safeguard against oversized inputs that could cause HTTP 400 errors from OpenAI. This ensures that the RAG pass is not disrupted by a single large symbol.@ -5,3 +5,3 @@"additionalProperties": false,"required": ["summary", "findings"],"required": ["summary", "walkthrough", "mermaid", "findings"],"properties": {💡 Note: Ensure that all required properties are listed in the
requiredarray to comply with OpenAI's strict mode. This change prevents HTTP 400 errors during schema validation.💡 Note: Adding
reasoningto therequiredarray ensures compliance with OpenAI's strict mode, preventing HTTP 400 errors during schema validation.@ -184,2 +185,2 @@if let Some(idx) = line.find(marker) {let before_ok = idx == 0 || !line.as_bytes()[idx - 1].is_ascii_alphanumeric();for (idx, _) in line.match_indices(marker) {let before_ok = idx == 0 || !bytes[idx - 1].is_ascii_alphanumeric();💡 Note: Switching from
find()tomatch_indices()ensures that all occurrences of markers are checked for word boundaries, fixing the issue where markers could be missed if they appeared after a non-boundary match.Acking all three reviews in one place — every comment was an LGTM-style note confirming the fix in the same commit, so nothing is left to change.
Review 709 (head
690e78a, fix(pre-merge): contains_todo_marker):crates/ar-review/src/pre_merge.rs× 2 — confirmed:find()→match_indices()is the bug fix; regression tests cover prefix + suffix collisions.Review 710 (head
f3c5f69, fix(prompts): OpenAI strict-mode review schema):crates/ar-prompts/schemas/review.json× 2 — confirmed:walkthrough/mermaidadded torequired;line_endis nullable soOption<u32>still deserializes JSONnull.crates/ar-prompts/src/schema.rs— confirmed: the recursive walker locks the strict-mode invariant for every object level.CHANGELOG.md— noted; entry is intentionally short since the rationale is in the commit message.crates/ar-review/src/pre_merge.rs× 2 — same as 709.Review 711 (head
fb19499, three regressions found dogfooding):crates/ar-forgejo/src/client.rs— confirmed:pulls/{n}/commentsreturns 404 on Forgejo 15.0.0; switched toissues/{n}/comments.crates/ar-index/src/embed.rs— confirmed: 24 KiB per-snippet cap stops one oversized symbol from killing the whole RAG pass.crates/ar-prompts/schemas/review.jsonandverification.json— confirmed:requiredis now a superset ofpropertiesin both, satisfying OpenAI strict mode.crates/ar-review/src/pre_merge.rs— same as 709/710.Dismissing 709-711 since each is now superseded by HEAD.
This PR addresses several critical issues, including fixing the
contains_todo_markerfunction to check all occurrences of markers, updating API endpoints for Forgejo compatibility, and ensuring JSON schema compliance with OpenAI's strict mode. The changes appear well-tested and should improve robustness.Walkthrough
contains_todo_markerfunction now usesmatch_indicesto ensure all occurrences of markers are checked, fixing a bug where markers could be missed if they were not the first occurrence on a line.pulls/{n}/commentstoissues/{n}/commentsto align with Forgejo's API, preventing 404 errors.requiredand using nullable types for optional fields.Pre-merge checks
@ -316,7 +321,7 @@ impl Client {let mut all = Vec::new();💡 Note: The change from
pulls/{n}/commentstoissues/{n}/commentsis crucial for compatibility with Forgejo, as the former endpoint does not exist. Ensure that this change is thoroughly tested to prevent any issues with comment retrieval.@ -58,13 +81,16 @@ pub async fn embed_symbols(return Ok(Vec::new());💡 Note: The introduction of
EMBED_INPUT_CAP_BYTESis a good safeguard against oversized inputs causing failures. Ensure that this truncation does not affect the quality of the embeddings significantly.@ -35,9 +35,9 @@"description": "1-based line number in the new file content."💡 Note: The addition of
line_endas a nullable type in the schema ensures compatibility with OpenAI's strict mode. Verify that all other optional fields are similarly handled to prevent API errors.💡 Note: The change to
requiredfields in the verification schema addresses a critical issue with API compatibility. Ensure that all necessary fields are included to avoid HTTP 400 errors.💡 Note: Switching from
findtomatch_indicesensures that all occurrences of markers are checked for word boundaries, fixing the issue where markers could be missed. This change should be validated with comprehensive tests to ensure no regressions.This PR addresses a bug in
pre_merge::contains_todo_markerby ensuring all occurrences of markers are checked, not just the first. It also updates API endpoints to align with Forgejo's structure, preventing errors. The changes are well-tested and documented, making the PR safe to merge.Walkthrough
.forgejo/workflows/ci.yml:
nixis available, addressing potential issues with GITHUB_PATH propagation.CHANGELOG.md:
contains_todo_markerand other improvements, providing clear documentation of changes.crates/ar-forgejo/src/client.rs:
crates/ar-review/src/pre_merge.rs:
findwithmatch_indicesto ensure all marker occurrences are checked, fixing the bug where markers could be missed.Pre-merge checks
💡 Note: Lines 4–11: The inline PATH setting is a good defensive measure against potential issues with GITHUB_PATH propagation. This change ensures that the
nixcommand is available in this step, improving robustness.💡 Note: Lines 6–35: The changelog entry accurately describes the fix for scanning every occurrence of markers in
contains_todo_marker. This provides clarity on the issue and the solution implemented.💡 Note: Lines 6–11: Switching the endpoint from
/pulls/{n}/commentsto/issues/{n}/commentsaligns with Forgejo's API structure, preventing 404 errors and ensuring the correct retrieval of comments.💡 Note: Lines 6–11: The change to use
match_indicesinstead offindensures that all occurrences of markers are checked for word boundaries, fixing the issue where markers could be missed if they were not the first occurrence on a line.This PR addresses a critical bug in
pre_merge::contains_todo_markerby ensuring all occurrences of markers are checked, not just the first. It also updates several schemas for stricter validation and fixes API endpoint paths for Forgejo compatibility. The changes are well-tested and appear safe to merge.Walkthrough
requiredfields.Pre-merge checks
💡 Note: Lines 4–11: The added debug commands in the CI workflow are useful for diagnosing issues with the Nix environment. Ensure these are removed or commented out before merging if they are not intended for production use.
💡 Note: Lines 6–35: The changelog entry accurately describes the fix for scanning all occurrences of markers. Ensure this entry is updated if further changes are made to the implementation.
💡 Note: Lines 7–11: The change from
pulls/{n}/commentstoissues/{n}/commentsaligns with Forgejo's API structure. Ensure this change is tested against the latest Forgejo version to confirm compatibility.💡 Note: Lines 6–29: The introduction of
EMBED_INPUT_CAP_BYTESand the truncation logic is a good safeguard against oversized inputs. Ensure this limit is appropriate for all expected use cases.@ -41,6 +41,29 @@ pub enum EmbedError {/// for not failing the whole RAG pass.pub const EMBED_BATCH_SIZE: usize = 32;/// Per-input byte cap. OpenAI's `text-embedding-3-*` rejects any💡 Note: Lines 44–53: The test
oversized_symbol_snippet_is_truncated_before_embed_calleffectively verifies the truncation logic. Consider adding edge cases with symbols just below and above the limit to ensure robustness.@ -3,7 +3,7 @@"title": "Review",💡 Note: Lines 3–7: The addition of
walkthrough,mermaid, andline_endto therequiredfields in the JSON schema ensures stricter validation. Verify that all existing JSON data adheres to this updated schema.💡 Note: Lines 3–7: Adding
reasoningto therequiredfields in the verification schema ensures compliance with OpenAI's strict mode. Ensure that all verification logic provides this field to avoid errors.💡 Note: Lines 4–53: The test
every_response_schema_satisfies_openai_strict_modeis a comprehensive check for schema compliance. Ensure this test is run regularly to catch any future schema drift.💡 Note: Lines 8–12: Switching from
findtomatch_indicesincontains_todo_markerensures all occurrences are checked. This change is well-tested by the added regression tests.The persistent /nix volume survives between runs, but each runner container starts with a fresh /etc/{group,passwd}. The Determinate Systems installer creates `nixbld` group + 32 build users on first install; on subsequent runs only /nix is reused, leaving nix.conf pointing at a non-existent group. Result: error: the group 'nixbld' specified in 'build-users-group' does not exist — failing the step in 2 seconds with no useful output unless you chase the runner's per-attempt log. Recreate the group and users idempotently after the install/reuse check. Drop the diagnostic shimmery from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>This PR fixes the
contains_todo_markerfunction to correctly scan all occurrences of markers on a line, preventing missed markers due to substring collisions. It also updates several schemas to comply with OpenAI's strict mode and adjusts API endpoints for Forgejo compatibility. The changes are well-tested and appear safe to merge.Walkthrough
nixbldusers if missing, ensuringnix flake checkruns without errors.PATHfornix flake check./pulls/{n}/commentsto/issues/{n}/commentsfor Forgejo compatibility.truncate_at_char_boundaryto handle oversized input for embedding, preventing batch failures.review.jsonandverification.jsonto comply with OpenAI's strict mode by listing all properties inrequired.find()withmatch_indices()to ensure all marker occurrences are checked, fixing a critical bug.Pre-merge checks
@ -316,7 +321,7 @@ impl Client {let mut all = Vec::new();💡 Note: The comment accurately describes the change in endpoint from
/pulls/{n}/commentsto/issues/{n}/comments. Ensure that this change is compatible with all intended Forgejo versions.@ -58,13 +81,16 @@ pub async fn embed_symbols(return Ok(Vec::new());💡 Note: The
truncate_at_char_boundaryfunction is a good addition to prevent oversized input errors. Ensure that truncating at 24 KiB is sufficient for your use case and doesn't omit critical information.@ -243,6 +269,35 @@ mod tests {assert!(matches!(err, EmbedError::MissingFile(p) if p == "does/not/exist.rs"));💡 Note: The test
oversized_symbol_snippet_is_truncated_before_embed_calleffectively checks for oversized input handling. Consider adding more edge cases to ensure robustness.@ -5,3 +5,3 @@"additionalProperties": false,"required": ["summary", "findings"],"required": ["summary", "walkthrough", "mermaid", "findings"],"properties": {🟡 Warning: The addition of
walkthroughandmermaidto therequiredlist in the schema might cause issues if these fields are not always provided. Ensure that all code paths provide these fields or handle their absence gracefully.🟡 Warning: Adding
reasoningto therequiredlist in the verification schema is a breaking change. Ensure that all clients provide this field to avoid runtime errors.💡 Note: The switch from
find()tomatch_indices()ensures all occurrences of markers are checked. This is a crucial fix for detecting all TODO markers correctly.