week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver by PRAteek-singHWY · Pull Request #925 · OWASP/OpenCRE

PRAteek-singHWY · 2026-06-10T21:35:04Z

week_2: Module C (The Librarian) — C.0 deterministic input boundary: SectionValidator + ExplicitLinkResolver

Stacked on #922. This branch is based on gsocmodule_C_week_1; only the top commit is new. I will rebase onto main as soon as #922 merges, which will shrink the diff to the Week 2 files only.

Overview

This is the Week 2 deliverable for Module C (The Librarian): C.0, the deterministic input boundary. Before any retrieval or ML, every incoming chunk passes two deterministic stages:

SectionValidator — is this input well-formed and usable? Validates the row, adapts it into the internal Section the pipeline consumes, and rejects bad input with typed errors.
ExplicitLinkResolver — does the text already cite a CRE id (ddd-ddd)? If exactly one known id is cited, link it directly with no ML. Unknown or conflicting references never auto-link — they route to human review (fail-safe).

A naming note for reviewers: the plan documents called this component "SectionNormalizer", but the RFC (#734) assigns text normalization to Module A (harvest + normalize + chunk), with Module B's sanitizer on top. By the time text reaches Module C it is contractually clean, and re-cleaning it here would silently drift C from what A hashed and B classified. This component validates and adapts — it never transforms text — so it is named SectionValidator.

Scope: 4 new files + 1 updated (the eval harness). No frontend, no migrations, no DB access, no behaviour change to OpenCRE proper.

What changed

Area	Files	Description
C.0 validator	`section_validator.py`	Validates both upstream shapes — Module B's reduced `knowledge_queue` row and the full RFC `KnowledgeItem` envelope — into a frozen internal `Section`. Synthesizes the RFC identity fields from B's row (`chunk_id = chk:{repo}@{sha}:{path}`, `artifact_id = art:{repo}:{path}`, `source`, `locator`). Strips volatile audit metadata (`llm_reasoning`, filter stages, run timestamps) so downstream stages can never key decisions on it. Rejections are typed (`MalformedKnowledgeItemError`, `EmptyTextError`, `UnsupportedLanguageError`, `NotKnowledgeError`); raw Pydantic `ValidationError` never escapes the boundary.
C.0.5 fast path	`explicit_link_resolver.py`	Deterministic regex (`\b\d{3}-\d{3}\b`) extraction + resolution against an injected set of known CRE ids. Outcomes: `resolved` (single known id → auto-link), `no_reference` (continue to the semantic path, W3+), `unknown_reference` / `conflicting_references` (→ review, with the known ids preserved as suggestions). The known-id set is injected so this module stays dependency-free: the harness seeds it from the golden dataset today; the DB-backed `cre.external_id` registry arrives with the retriever (W3).
Eval harness	`evaluate_librarian.py`	Now runs every golden row through C.0: prints the validation pass rate per slice, and gates the explicit slice at 100% resolver correctness — the script exits non-zero on any regression, so CI can block on it. `predict()` makes its first real predictions (explicit path only; semantic path still stubbed).
Tests	`section_validator_test.py`, `explicit_link_resolver_test.py`	15 new tests, table-driven: every rejection class, identity synthesis, volatile-metadata stripping, language variants (`en-GB` accepted, `fr` rejected), and every resolver outcome including pattern boundaries (`027-5555` and `CVE-2024-1234-567` must not match). One test asserts the boundary never leaks a raw Pydantic error.

How the pieces connect

flowchart TB
    subgraph UPSTREAM["Upstream shapes (Week 1 contracts)"]
        row["KnowledgeQueueItem<br/>(Module B's reduced row)"]
        ki["KnowledgeItem<br/>(RFC envelope)"]
    end

    subgraph C0["C.0 — input boundary (this PR)"]
        val["section_validator.py<br/>validate · adapt · synthesize identity"]
        sec["Section<br/>(internal, frozen)"]
        res["explicit_link_resolver.py<br/>regex ddd-ddd, no ML"]
    end

    subgraph OUT["Outcomes"]
        link["resolved →<br/>deterministic link"]
        sem["no_reference →<br/>semantic path (W3+)"]
        rev["unknown / conflicting →<br/>human review"]
        err["typed errors:<br/>Malformed · EmptyText ·<br/>UnsupportedLanguage · NotKnowledge"]
    end

    subgraph HARNESS["Eval harness (updated)"]
        eval["evaluate_librarian.py<br/>pass rate per slice ·<br/>explicit gate 100% (exit 1 on fail)"]
    end

    row --> val
    ki --> val
    val --> sec
    val -. reject .-> err
    sec --> res
    res --> link
    res --> sem
    res --> rev
    sec --> eval
    res --> eval

Results

validation pass rate (C.0 boundary):
  ambiguous      5/5 (100%)
  explicit       5/5 (100%)
  hard_negative  12/12 (100%)
  positive       292/292 (100%)
  update         5/5 (100%)
explicit slice (C.0.5 resolver): 5/5 — gate 100%: PASS

66 tests passing (51 from Week 1 + 15 new)
mypy --strict clean on both new modules (repo flags)
black clean

What is intentionally not here

Retriever, embeddings, cross-encoder, SafetyGuard, decision engine, CLI wiring, and DB models/migrations are later weeks (W3–W8 per the proposal). The semantic path in the harness still predicts nothing — Week 3 plugs the retriever into the same predict() seam.

How to verify locally

# 66 tests
python3 -m unittest discover -s application/tests/librarian -p '*_test.py' -t .

# harness end-to-end: per-slice pass rate + explicit gate
python3 scripts/evaluate_librarian.py --dataset application/tests/librarian/fixtures/golden_dataset.json

# explicit slice only
python3 scripts/evaluate_librarian.py --dataset application/tests/librarian/fixtures/golden_dataset.json --slice explicit

🤖 Generated with Claude Code

dataset Contracts + regression ruler before any pipeline code, per the OIE RFC's 'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in, LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring; 319-row golden dataset derived from standards_cache.sqlite with --check drift detection. One prod edit: pydantic>=2,<3 pin.

…nts, fail-fast build, edge cases

coderabbitai · 2026-06-10T21:35:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9a4092a1-ee50-4ffc-8f75-36e7483fb84e

📥 Commits

Reviewing files that changed from the base of the PR and between b69a735 and b4a6aa2.

📒 Files selected for processing (5)

application/tests/librarian/explicit_link_resolver_test.py
application/tests/librarian/section_validator_test.py
application/utils/librarian/explicit_link_resolver.py
application/utils/librarian/section_validator.py
scripts/evaluate_librarian.py

🚧 Files skipped from review as they are similar to previous changes (5)

application/tests/librarian/section_validator_test.py
application/tests/librarian/explicit_link_resolver_test.py
application/utils/librarian/explicit_link_resolver.py
application/utils/librarian/section_validator.py
scripts/evaluate_librarian.py

Summary by CodeRabbit

New Features
- Deterministic explicit CRE reference fast-path, hub firewall, config loader, schema-backed input validator, scoring, dataset builder, and evaluation harness.
Tests
- New comprehensive test suites covering config, schemas, dataset integrity, explicit-reference behavior, hub firewall, scoring, and section validation.
Documentation
- Added RFC-compliant JSON schemas and module-level librarian documentation.
Chores
- Pydantic pinned to v2.

Walkthrough

Adds Module C (Librarian): RFC JSON schemas + Pydantic models, env-config loader, deterministic explicit-link resolver, hub firewall, scoring, C.0 input boundary, golden dataset builder, evaluation harness, and comprehensive unit tests.

Changes

Module C Librarian Implementation

Layer / File(s)	Summary
RFC contracts and Pydantic models `application/utils/librarian/__init__.py`, `application/utils/librarian/_rfc_schemas/*`, `application/utils/librarian/schemas.py`, `requirements.txt`	Vendored JSON Schemas and Pydantic v2 models implement RFC `#734` envelopes (`KnowledgeItem`, `LinkProposal`, `ReviewItem`) and supporting types; module docstring and `pydantic>=2,<3` pin.
Configuration loader and validation `application/utils/librarian/config_loader.py`, `application/tests/librarian/config_loader_test.py`	Frozen `LibrarianConfig` and `load_config()` parse `CRE_LIBRARIAN_*` env vars, cast and validate numeric bounds/ordering; tests cover defaults, overrides, and invalid inputs.
Core utilities `application/utils/librarian/explicit_link_resolver.py`, `application/utils/librarian/hub_firewall.py`, `application/utils/librarian/knowledge_source.py`, `application/utils/librarian/scoring.py`, `application/tests/librarian/explicit_link_resolver_test.py`, `application/tests/librarian/hub_firewall_test.py`, `application/tests/librarian/scoring_test.py`	Deterministic `ddd-ddd` extraction and resolution with explicit outcomes, hub firewall to remove hub echoes, fixture-backed `KnowledgeSource`, and Jaccard-based scoring with unit tests for edge cases.
C.0 input boundary validation `application/utils/librarian/section_validator.py`, `application/tests/librarian/section_validator_test.py`	`section_from_queue_row` and `section_from_knowledge_item` convert upstream inputs into `Section` objects with synthesized IDs, language gating, and a controlled `SectionValidationError` hierarchy; tests validate happy and rejection paths.
Golden dataset construction and validation `scripts/build_golden_dataset.py`, `application/tests/librarian/fixtures/golden_dataset.schema.json`, `application/tests/librarian/fixtures/sample_knowledge_queue.jsonl`, `application/tests/librarian/dataset_test.py`	Deterministic builder produces golden dataset across slices (positive, multilink, hard_negative, explicit, update, ambiguous); JSON Schema fixture and tests assert shape, provenance, multi-link rows, ID uniqueness, and determinism via `--check`.
Evaluation harness and comprehensive tests `scripts/evaluate_librarian.py`, `application/tests/librarian/schemas_test.py`	Evaluation harness loads golden rows, synthesizes queue rows, applies deterministic explicit-only predictor, scores cases (with optional firewall), enforces explicit-slice gate, and `schemas_test.py` validates canonical schema round-trips and Pydantic model constraints.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the primary change: introducing C.0 input boundary validators (SectionValidator and ExplicitLinkResolver) for Module C Week 2.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the purpose, scope, design, and implementation details of the C.0 input boundary components.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

application/utils/librarian/schemas.py (2)
272-290: 💤 Low value

Consider making the extra field handling explicit for clarity.

The docstring (Line 276-277) states this model "tolerates extra fields so B can extend the row without breaking C," relying on Pydantic v2's default behavior of extra="ignore". While this works correctly, being explicit would improve code clarity and prevent confusion if Pydantic defaults change in the future.
📝 Suggested improvement
 class KnowledgeQueueItem(BaseModel):
     """Read-side mirror of Module B's `knowledge_queue` Postgres row.
 
     Per master guide §1.2: C reads these rows and synthesizes the RFC
     `KnowledgeItem` envelope from them. Not a wire contract; tolerates extra
     fields so B can extend the row without breaking C.
     """
 
+    model_config = ConfigDict(extra="ignore")
     id: str
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/schemas.py` around lines 272 - 290, The
KnowledgeQueueItem Pydantic model currently relies on Pydantic v2's default of
ignoring extra fields; make this explicit by adding an explicit model config to
the KnowledgeQueueItem class (e.g., set model_config = {"extra": "ignore"} or
the equivalent Config/ModelConfig pattern used across the codebase) so BaseModel
subclass KnowledgeQueueItem clearly documents and enforces tolerant handling of
unknown fields from Module B.
99-107: 💤 Low value

Consider aligning the language field default with the JSON schema.

The JSON schema (knowledge-item.json Line 43) specifies "language": { "type": "string", "default": "en" }, but the Pydantic model has language: Optional[str] = None. While JSON Schema default values are not enforced during validation (they're hints for tooling), this creates a semantic mismatch: the schema documents that language should default to "en", but the model defaults to None.

If the intent is for language to always be "en" when unspecified, consider:
language: str = "en"
If None is intentionally allowed (meaning "language unknown"), the current model is correct but consider updating the JSON schema to clarify this.

Since round-trip tests are passing (per PR description), this may be intentional or the tests may not validate default value behavior strictly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/schemas.py` around lines 99 - 107, The
KnowledgeContent Pydantic model's language field currently allows None but the
JSON schema documents a default of "en"; to align them, change the field in the
KnowledgeContent class from Optional[str] = None to a non-optional string with
default "en" (i.e., make language: str = "en") so the model provides the
documented default; if instead None is intended, update the JSON schema to
remove or change the default—refer to the KnowledgeContent class and the
language field to implement the change.
requirements.txt (1)
1-119: ⚖️ Poor tradeoff

Consider deduplicating dependency entries.

The file contains duplicate entries that may cause confusion:

compliance-trestle (lines 3, 35)

setuptools (lines 12, 33)

SQLAlchemy (lines 34, 100)

psycopg2-binary (lines 27, 61)

playwright (lines 26, 54)

scikit_learn/scikit-learn (lines 30, 94 — naming inconsistency)

When pip encounters duplicates, it uses the last occurrence, making earlier entries misleading. Consider consolidating these into single entries.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.txt` around lines 1 - 119, requirements.txt contains duplicate
and inconsistent package entries; remove redundant lines and consolidate into
single canonical entries for each package (e.g., keep one compliance-trestle,
one setuptools, one SQLAlchemy, one psycopg2-binary, one playwright) and
normalize scikit-learn to the correct package name (scikit-learn) so pip
behavior is deterministic; ensure any required version specifiers are preserved
when merging and remove exact duplicates (e.g., duplicate PyYAML/version lines)
so the final file lists each dependency only once.
scripts/build_golden_dataset.py (1)
261-261: ⚡ Quick win

Rename unused loop variable to follow convention.

The variable node_id is not used within the loop body. Per Python convention, prefix unused variables with underscore.
♻️ Proposed fix
-    for node_id, name, section_id, text, cre_concat in rows:
+    for _node_id, name, section_id, text, cre_concat in rows:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/build_golden_dataset.py` at line 261, The loop binding in the for
statement "for node_id, name, section_id, text, cre_concat in rows:" uses an
unused variable node_id; rename it to _node_id to follow Python convention for
unused variables and avoid linter warnings. Update the for header to "for
_node_id, name, section_id, text, cre_concat in rows:" and ensure there are no
references to node_id inside the loop (if any, replace them with the intended
variable or raise if needed).
application/utils/librarian/knowledge_source.py (2)
16-20: ⚡ Quick win

Simplify abstract method body.

The raise NotImplementedError in the abstract method body is redundant. The @abstractmethod decorator already enforces that subclasses must implement this method.
♻️ Proposed fix
 class KnowledgeSource(ABC):
     `@abstractmethod`
     def items(self) -> Iterator[KnowledgeQueueItem]:
         """Yield knowledge_queue rows awaiting classification."""
-        raise NotImplementedError
+        ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/knowledge_source.py` around lines 16 - 20, Remove
the redundant raise in the abstract method: in the KnowledgeSource class remove
the "raise NotImplementedError" from the items method (keep the `@abstractmethod`
decorator and the docstring or replace the body with a simple pass) so
subclasses are still required to implement KnowledgeSource.items without the
unnecessary explicit exception.
9-9: ⚡ Quick win

Remove unused import.

The json module is imported but never used. Line 34 uses KnowledgeQueueItem.model_validate_json, which is a Pydantic method that handles JSON parsing internally.
♻️ Proposed fix
-import json
 from abc import ABC, abstractmethod
 from typing import Iterator
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/knowledge_source.py` at line 9, Remove the unused
import of the json module at the top of knowledge_source.py; since the code uses
KnowledgeQueueItem.model_validate_json (a Pydantic method) for JSON parsing,
delete the "import json" line to avoid an unused-import warning and keep imports
minimal.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@requirements.txt`:
- Line 66: Update the pydantic version range in requirements.txt to exclude
vulnerable 2.x releases by changing the spec for "pydantic" from
"pydantic>=2,<3" to "pydantic>=2.4.0,<3" so that installations will use the
first patched 2.4.0+ release; locate the "pydantic>=2,<3" entry and replace it
with "pydantic>=2.4.0,<3".

---

Nitpick comments:
In `@application/utils/librarian/knowledge_source.py`:
- Around line 16-20: Remove the redundant raise in the abstract method: in the
KnowledgeSource class remove the "raise NotImplementedError" from the items
method (keep the `@abstractmethod` decorator and the docstring or replace the body
with a simple pass) so subclasses are still required to implement
KnowledgeSource.items without the unnecessary explicit exception.
- Line 9: Remove the unused import of the json module at the top of
knowledge_source.py; since the code uses KnowledgeQueueItem.model_validate_json
(a Pydantic method) for JSON parsing, delete the "import json" line to avoid an
unused-import warning and keep imports minimal.

In `@application/utils/librarian/schemas.py`:
- Around line 272-290: The KnowledgeQueueItem Pydantic model currently relies on
Pydantic v2's default of ignoring extra fields; make this explicit by adding an
explicit model config to the KnowledgeQueueItem class (e.g., set model_config =
{"extra": "ignore"} or the equivalent Config/ModelConfig pattern used across the
codebase) so BaseModel subclass KnowledgeQueueItem clearly documents and
enforces tolerant handling of unknown fields from Module B.
- Around line 99-107: The KnowledgeContent Pydantic model's language field
currently allows None but the JSON schema documents a default of "en"; to align
them, change the field in the KnowledgeContent class from Optional[str] = None
to a non-optional string with default "en" (i.e., make language: str = "en") so
the model provides the documented default; if instead None is intended, update
the JSON schema to remove or change the default—refer to the KnowledgeContent
class and the language field to implement the change.

In `@requirements.txt`:
- Around line 1-119: requirements.txt contains duplicate and inconsistent
package entries; remove redundant lines and consolidate into single canonical
entries for each package (e.g., keep one compliance-trestle, one setuptools, one
SQLAlchemy, one psycopg2-binary, one playwright) and normalize scikit-learn to
the correct package name (scikit-learn) so pip behavior is deterministic; ensure
any required version specifiers are preserved when merging and remove exact
duplicates (e.g., duplicate PyYAML/version lines) so the final file lists each
dependency only once.

In `@scripts/build_golden_dataset.py`:
- Line 261: The loop binding in the for statement "for node_id, name,
section_id, text, cre_concat in rows:" uses an unused variable node_id; rename
it to _node_id to follow Python convention for unused variables and avoid linter
warnings. Update the for header to "for _node_id, name, section_id, text,
cre_concat in rows:" and ensure there are no references to node_id inside the
loop (if any, replace them with the intended variable or raise if needed).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c707f462-c16f-4413-acca-0aa2f5c45f32

📥 Commits

Reviewing files that changed from the base of the PR and between d796ff5 and b69a735.

📒 Files selected for processing (28)

application/tests/librarian/__init__.py
application/tests/librarian/config_loader_test.py
application/tests/librarian/dataset_test.py
application/tests/librarian/explicit_link_resolver_test.py
application/tests/librarian/fixtures/golden_dataset.json
application/tests/librarian/fixtures/golden_dataset.schema.json
application/tests/librarian/fixtures/sample_knowledge_queue.jsonl
application/tests/librarian/hub_firewall_test.py
application/tests/librarian/schemas_test.py
application/tests/librarian/scoring_test.py
application/tests/librarian/section_validator_test.py
application/utils/librarian/__init__.py
application/utils/librarian/_rfc_schemas/knowledge-item.json
application/utils/librarian/_rfc_schemas/link-proposal.json
application/utils/librarian/_rfc_schemas/locator.json
application/utils/librarian/_rfc_schemas/proposed-link.json
application/utils/librarian/_rfc_schemas/review-item.json
application/utils/librarian/_rfc_schemas/source-ref.json
application/utils/librarian/config_loader.py
application/utils/librarian/explicit_link_resolver.py
application/utils/librarian/hub_firewall.py
application/utils/librarian/knowledge_source.py
application/utils/librarian/schemas.py
application/utils/librarian/scoring.py
application/utils/librarian/section_validator.py
requirements.txt
scripts/build_golden_dataset.py
scripts/evaluate_librarian.py

coderabbitai · 2026-06-10T21:46:04Z

 pycodestyle
 pycparser
-pydantic
+pydantic>=2,<3


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check for Pydantic security advisories and latest v2 version # Check PyPI for latest Pydantic v2 version curl -s https://pypi.org/pypi/pydantic/json | jq -r '.info.version' # Check GitHub security advisories for Pydantic gh api graphql -f query=' { securityVulnerabilities(first: 10, ecosystem: PIP, package: "pydantic") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'

Repository: OWASP/OpenCRE

Length of output: 1343

🏁 Script executed:

#!/bin/bash set -euo pipefail # Get latest Pydantic version (for context) echo "Latest Pydantic version on PyPI:" curl -s https://pypi.org/pypi/pydantic/json | jq -r '.info.version' echo echo "GitHub advisories (more results), filtered to Pydantic 2.x ranges:" gh api graphql -f query=' { securityVulnerabilities(first: 100, ecosystem: PIP, package: "pydantic") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }' | jq -r ' .data.securityVulnerabilities.nodes[] | select((.vulnerableVersionRange|tostring) | test("2\\.")) | "\nSummary: \(.advisory.summary)\nSeverity: \(.advisory.severity)\nPublishedAt: \(.advisory.publishedAt)\nVulnerableRange: \(.vulnerableVersionRange)\nFirstPatchedVersion: \(.firstPatchedVersion.identifier)" '

Repository: OWASP/OpenCRE

Length of output: 1525

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Latest Pydantic version on PyPI:" curl -s https://pypi.org/pypi/pydantic/json | jq -r '.info.version' echo echo "GitHub advisories (up to 100), filtered to entries whose vulnerableVersionRange mentions '2.':" gh api graphql -f query=' { securityVulnerabilities(first: 100, ecosystem: PIP, package: "pydantic") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }' | jq -r ' .data.securityVulnerabilities.nodes[] | select((.vulnerableVersionRange|tostring) | test("2\\.";"i")) | [ .advisory.severity, .advisory.publishedAt, .advisory.summary, ("vulnerableVersionRange=" + (.vulnerableVersionRange|tostring)), ("firstPatchedVersion=" + .firstPatchedVersion.identifier) ] | `@tsv` '

Repository: OWASP/OpenCRE

Length of output: 1553

Update pydantic range to avoid known v2 ReDoS vulnerability (line 66 in requirements.txt)

GitHub advisory reports “Pydantic regular expression denial of service” affects pydantic >= 2.0.0, < 2.4.0 (first patched in 2.4.0).

pydantic>=2,<3 still allows those vulnerable versions; use pydantic>=2.4.0,<3.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.txt` at line 66, Update the pydantic version range in requirements.txt to exclude vulnerable 2.x releases by changing the spec for "pydantic" from "pydantic>=2,<3" to "pydantic>=2.4.0,<3" so that installations will use the first patched 2.4.0+ release; locate the "pydantic>=2,<3" entry and replace it with "pydantic>=2.4.0,<3".

…inkResolver The C.0 deterministic input boundary, per the proposal's W2/W3 'data preparation layer' (named validator, not normalizer: the RFC assigns text normalization to Module A; C validates and adapts, never transforms text). - section_validator.py: typed-error validation of both upstream shapes (B's knowledge_queue row + RFC KnowledgeItem envelope) into an internal Section; synthesizes RFC identity fields (chunk_id/artifact_id/source/ locator) from B's reduced row; strips volatile audit metadata. - explicit_link_resolver.py: deterministic ddd-ddd fast path, no ML. Fail-safe: only a single known reference auto-links; unknown or conflicting references route to review. - evaluate_librarian.py: harness now runs every golden row through C.0, prints per-slice validation pass rate, and gates the explicit slice at 100% resolver correctness (exit 1 on regression). Gate: PASS 5/5. - Table-driven tests for every rejection class and resolver outcome. mypy --strict clean, black clean, 66 tests green.

PRAteek-singHWY added 2 commits June 9, 2026 17:17

week_1: address review — config validation, schema non-empty constrai…

e097016

…nts, fail-fast build, edge cases

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

PRAteek-singHWY force-pushed the gsocmodule_C_week_2 branch from b69a735 to b4a6aa2 Compare June 10, 2026 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver#925

week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver#925
PRAteek-singHWY wants to merge 3 commits into
OWASP:mainfrom
PRAteek-singHWY:gsocmodule_C_week_2

PRAteek-singHWY commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PRAteek-singHWY commented Jun 10, 2026

week_2: Module C (The Librarian) — C.0 deterministic input boundary: SectionValidator + ExplicitLinkResolver

Overview

What changed

How the pieces connect

Results

What is intentionally not here

How to verify locally

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading