week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver#925
week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver#925PRAteek-singHWY wants to merge 3 commits into
Conversation
dataset Contracts + regression ruler before any pipeline code, per the OIE RFC's 'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in, LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring; 319-row golden dataset derived from standards_cache.sqlite with --check drift detection. One prod edit: pydantic>=2,<3 pin.
…nts, fail-fast build, edge cases
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (5)
Summary by CodeRabbit
WalkthroughAdds Module C (Librarian): RFC JSON schemas + Pydantic models, env-config loader, deterministic explicit-link resolver, hub firewall, scoring, C.0 input boundary, golden dataset builder, evaluation harness, and comprehensive unit tests. ChangesModule C Librarian Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
application/utils/librarian/schemas.py (2)
272-290: 💤 Low valueConsider making the
extrafield handling explicit for clarity.The docstring (Line 276-277) states this model "tolerates extra fields so B can extend the row without breaking C," relying on Pydantic v2's default behavior of
extra="ignore". While this works correctly, being explicit would improve code clarity and prevent confusion if Pydantic defaults change in the future.📝 Suggested improvement
class KnowledgeQueueItem(BaseModel): """Read-side mirror of Module B's `knowledge_queue` Postgres row. Per master guide §1.2: C reads these rows and synthesizes the RFC `KnowledgeItem` envelope from them. Not a wire contract; tolerates extra fields so B can extend the row without breaking C. """ + model_config = ConfigDict(extra="ignore") id: str🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/schemas.py` around lines 272 - 290, The KnowledgeQueueItem Pydantic model currently relies on Pydantic v2's default of ignoring extra fields; make this explicit by adding an explicit model config to the KnowledgeQueueItem class (e.g., set model_config = {"extra": "ignore"} or the equivalent Config/ModelConfig pattern used across the codebase) so BaseModel subclass KnowledgeQueueItem clearly documents and enforces tolerant handling of unknown fields from Module B.
99-107: 💤 Low valueConsider aligning the
languagefield default with the JSON schema.The JSON schema (
knowledge-item.jsonLine 43) specifies"language": { "type": "string", "default": "en" }, but the Pydantic model haslanguage: Optional[str] = None. While JSON Schemadefaultvalues are not enforced during validation (they're hints for tooling), this creates a semantic mismatch: the schema documents that language should default to"en", but the model defaults toNone.If the intent is for language to always be
"en"when unspecified, consider:language: str = "en"If
Noneis intentionally allowed (meaning "language unknown"), the current model is correct but consider updating the JSON schema to clarify this.Since round-trip tests are passing (per PR description), this may be intentional or the tests may not validate default value behavior strictly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/schemas.py` around lines 99 - 107, The KnowledgeContent Pydantic model's language field currently allows None but the JSON schema documents a default of "en"; to align them, change the field in the KnowledgeContent class from Optional[str] = None to a non-optional string with default "en" (i.e., make language: str = "en") so the model provides the documented default; if instead None is intended, update the JSON schema to remove or change the default—refer to the KnowledgeContent class and the language field to implement the change.requirements.txt (1)
1-119: ⚖️ Poor tradeoffConsider deduplicating dependency entries.
The file contains duplicate entries that may cause confusion:
compliance-trestle(lines 3, 35)setuptools(lines 12, 33)SQLAlchemy(lines 34, 100)psycopg2-binary(lines 27, 61)playwright(lines 26, 54)scikit_learn/scikit-learn(lines 30, 94 — naming inconsistency)When pip encounters duplicates, it uses the last occurrence, making earlier entries misleading. Consider consolidating these into single entries.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.txt` around lines 1 - 119, requirements.txt contains duplicate and inconsistent package entries; remove redundant lines and consolidate into single canonical entries for each package (e.g., keep one compliance-trestle, one setuptools, one SQLAlchemy, one psycopg2-binary, one playwright) and normalize scikit-learn to the correct package name (scikit-learn) so pip behavior is deterministic; ensure any required version specifiers are preserved when merging and remove exact duplicates (e.g., duplicate PyYAML/version lines) so the final file lists each dependency only once.scripts/build_golden_dataset.py (1)
261-261: ⚡ Quick winRename unused loop variable to follow convention.
The variable
node_idis not used within the loop body. Per Python convention, prefix unused variables with underscore.♻️ Proposed fix
- for node_id, name, section_id, text, cre_concat in rows: + for _node_id, name, section_id, text, cre_concat in rows:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/build_golden_dataset.py` at line 261, The loop binding in the for statement "for node_id, name, section_id, text, cre_concat in rows:" uses an unused variable node_id; rename it to _node_id to follow Python convention for unused variables and avoid linter warnings. Update the for header to "for _node_id, name, section_id, text, cre_concat in rows:" and ensure there are no references to node_id inside the loop (if any, replace them with the intended variable or raise if needed).application/utils/librarian/knowledge_source.py (2)
16-20: ⚡ Quick winSimplify abstract method body.
The
raise NotImplementedErrorin the abstract method body is redundant. The@abstractmethoddecorator already enforces that subclasses must implement this method.♻️ Proposed fix
class KnowledgeSource(ABC): `@abstractmethod` def items(self) -> Iterator[KnowledgeQueueItem]: """Yield knowledge_queue rows awaiting classification.""" - raise NotImplementedError + ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/knowledge_source.py` around lines 16 - 20, Remove the redundant raise in the abstract method: in the KnowledgeSource class remove the "raise NotImplementedError" from the items method (keep the `@abstractmethod` decorator and the docstring or replace the body with a simple pass) so subclasses are still required to implement KnowledgeSource.items without the unnecessary explicit exception.
9-9: ⚡ Quick winRemove unused import.
The
jsonmodule is imported but never used. Line 34 usesKnowledgeQueueItem.model_validate_json, which is a Pydantic method that handles JSON parsing internally.♻️ Proposed fix
-import json from abc import ABC, abstractmethod from typing import Iterator🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/knowledge_source.py` at line 9, Remove the unused import of the json module at the top of knowledge_source.py; since the code uses KnowledgeQueueItem.model_validate_json (a Pydantic method) for JSON parsing, delete the "import json" line to avoid an unused-import warning and keep imports minimal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@requirements.txt`:
- Line 66: Update the pydantic version range in requirements.txt to exclude
vulnerable 2.x releases by changing the spec for "pydantic" from
"pydantic>=2,<3" to "pydantic>=2.4.0,<3" so that installations will use the
first patched 2.4.0+ release; locate the "pydantic>=2,<3" entry and replace it
with "pydantic>=2.4.0,<3".
---
Nitpick comments:
In `@application/utils/librarian/knowledge_source.py`:
- Around line 16-20: Remove the redundant raise in the abstract method: in the
KnowledgeSource class remove the "raise NotImplementedError" from the items
method (keep the `@abstractmethod` decorator and the docstring or replace the body
with a simple pass) so subclasses are still required to implement
KnowledgeSource.items without the unnecessary explicit exception.
- Line 9: Remove the unused import of the json module at the top of
knowledge_source.py; since the code uses KnowledgeQueueItem.model_validate_json
(a Pydantic method) for JSON parsing, delete the "import json" line to avoid an
unused-import warning and keep imports minimal.
In `@application/utils/librarian/schemas.py`:
- Around line 272-290: The KnowledgeQueueItem Pydantic model currently relies on
Pydantic v2's default of ignoring extra fields; make this explicit by adding an
explicit model config to the KnowledgeQueueItem class (e.g., set model_config =
{"extra": "ignore"} or the equivalent Config/ModelConfig pattern used across the
codebase) so BaseModel subclass KnowledgeQueueItem clearly documents and
enforces tolerant handling of unknown fields from Module B.
- Around line 99-107: The KnowledgeContent Pydantic model's language field
currently allows None but the JSON schema documents a default of "en"; to align
them, change the field in the KnowledgeContent class from Optional[str] = None
to a non-optional string with default "en" (i.e., make language: str = "en") so
the model provides the documented default; if instead None is intended, update
the JSON schema to remove or change the default—refer to the KnowledgeContent
class and the language field to implement the change.
In `@requirements.txt`:
- Around line 1-119: requirements.txt contains duplicate and inconsistent
package entries; remove redundant lines and consolidate into single canonical
entries for each package (e.g., keep one compliance-trestle, one setuptools, one
SQLAlchemy, one psycopg2-binary, one playwright) and normalize scikit-learn to
the correct package name (scikit-learn) so pip behavior is deterministic; ensure
any required version specifiers are preserved when merging and remove exact
duplicates (e.g., duplicate PyYAML/version lines) so the final file lists each
dependency only once.
In `@scripts/build_golden_dataset.py`:
- Line 261: The loop binding in the for statement "for node_id, name,
section_id, text, cre_concat in rows:" uses an unused variable node_id; rename
it to _node_id to follow Python convention for unused variables and avoid linter
warnings. Update the for header to "for _node_id, name, section_id, text,
cre_concat in rows:" and ensure there are no references to node_id inside the
loop (if any, replace them with the intended variable or raise if needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: c707f462-c16f-4413-acca-0aa2f5c45f32
📒 Files selected for processing (28)
application/tests/librarian/__init__.pyapplication/tests/librarian/config_loader_test.pyapplication/tests/librarian/dataset_test.pyapplication/tests/librarian/explicit_link_resolver_test.pyapplication/tests/librarian/fixtures/golden_dataset.jsonapplication/tests/librarian/fixtures/golden_dataset.schema.jsonapplication/tests/librarian/fixtures/sample_knowledge_queue.jsonlapplication/tests/librarian/hub_firewall_test.pyapplication/tests/librarian/schemas_test.pyapplication/tests/librarian/scoring_test.pyapplication/tests/librarian/section_validator_test.pyapplication/utils/librarian/__init__.pyapplication/utils/librarian/_rfc_schemas/knowledge-item.jsonapplication/utils/librarian/_rfc_schemas/link-proposal.jsonapplication/utils/librarian/_rfc_schemas/locator.jsonapplication/utils/librarian/_rfc_schemas/proposed-link.jsonapplication/utils/librarian/_rfc_schemas/review-item.jsonapplication/utils/librarian/_rfc_schemas/source-ref.jsonapplication/utils/librarian/config_loader.pyapplication/utils/librarian/explicit_link_resolver.pyapplication/utils/librarian/hub_firewall.pyapplication/utils/librarian/knowledge_source.pyapplication/utils/librarian/schemas.pyapplication/utils/librarian/scoring.pyapplication/utils/librarian/section_validator.pyrequirements.txtscripts/build_golden_dataset.pyscripts/evaluate_librarian.py
| pycodestyle | ||
| pycparser | ||
| pydantic | ||
| pydantic>=2,<3 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check for Pydantic security advisories and latest v2 version
# Check PyPI for latest Pydantic v2 version
curl -s https://pypi.org/pypi/pydantic/json | jq -r '.info.version'
# Check GitHub security advisories for Pydantic
gh api graphql -f query='
{
securityVulnerabilities(first: 10, ecosystem: PIP, package: "pydantic") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'Repository: OWASP/OpenCRE
Length of output: 1343
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Get latest Pydantic version (for context)
echo "Latest Pydantic version on PyPI:"
curl -s https://pypi.org/pypi/pydantic/json | jq -r '.info.version'
echo
echo "GitHub advisories (more results), filtered to Pydantic 2.x ranges:"
gh api graphql -f query='
{
securityVulnerabilities(first: 100, ecosystem: PIP, package: "pydantic") {
nodes {
advisory { summary severity publishedAt }
vulnerableVersionRange
firstPatchedVersion { identifier }
}
}
}'
| jq -r '
.data.securityVulnerabilities.nodes[]
| select((.vulnerableVersionRange|tostring) | test("2\\."))
| "\nSummary: \(.advisory.summary)\nSeverity: \(.advisory.severity)\nPublishedAt: \(.advisory.publishedAt)\nVulnerableRange: \(.vulnerableVersionRange)\nFirstPatchedVersion: \(.firstPatchedVersion.identifier)"
'Repository: OWASP/OpenCRE
Length of output: 1525
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Latest Pydantic version on PyPI:"
curl -s https://pypi.org/pypi/pydantic/json | jq -r '.info.version'
echo
echo "GitHub advisories (up to 100), filtered to entries whose vulnerableVersionRange mentions '2.':"
gh api graphql -f query='
{
securityVulnerabilities(first: 100, ecosystem: PIP, package: "pydantic") {
nodes {
advisory { summary severity publishedAt }
vulnerableVersionRange
firstPatchedVersion { identifier }
}
}
}'
| jq -r '
.data.securityVulnerabilities.nodes[]
| select((.vulnerableVersionRange|tostring) | test("2\\.";"i"))
| [
.advisory.severity,
.advisory.publishedAt,
.advisory.summary,
("vulnerableVersionRange=" + (.vulnerableVersionRange|tostring)),
("firstPatchedVersion=" + .firstPatchedVersion.identifier)
]
| `@tsv`
'Repository: OWASP/OpenCRE
Length of output: 1553
Update pydantic range to avoid known v2 ReDoS vulnerability (line 66 in requirements.txt)
- GitHub advisory reports “Pydantic regular expression denial of service” affects
pydantic >= 2.0.0, < 2.4.0(first patched in2.4.0). pydantic>=2,<3still allows those vulnerable versions; usepydantic>=2.4.0,<3.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@requirements.txt` at line 66, Update the pydantic version range in
requirements.txt to exclude vulnerable 2.x releases by changing the spec for
"pydantic" from "pydantic>=2,<3" to "pydantic>=2.4.0,<3" so that installations
will use the first patched 2.4.0+ release; locate the "pydantic>=2,<3" entry and
replace it with "pydantic>=2.4.0,<3".
…inkResolver The C.0 deterministic input boundary, per the proposal's W2/W3 'data preparation layer' (named validator, not normalizer: the RFC assigns text normalization to Module A; C validates and adapts, never transforms text). - section_validator.py: typed-error validation of both upstream shapes (B's knowledge_queue row + RFC KnowledgeItem envelope) into an internal Section; synthesizes RFC identity fields (chunk_id/artifact_id/source/ locator) from B's reduced row; strips volatile audit metadata. - explicit_link_resolver.py: deterministic ddd-ddd fast path, no ML. Fail-safe: only a single known reference auto-links; unknown or conflicting references route to review. - evaluate_librarian.py: harness now runs every golden row through C.0, prints per-slice validation pass rate, and gates the explicit slice at 100% resolver correctness (exit 1 on regression). Gate: PASS 5/5. - Table-driven tests for every rejection class and resolver outcome. mypy --strict clean, black clean, 66 tests green.
b69a735 to
b4a6aa2
Compare
week_2: Module C (The Librarian) — C.0 deterministic input boundary: SectionValidator + ExplicitLinkResolver
Overview
This is the Week 2 deliverable for Module C (The Librarian): C.0, the deterministic input boundary. Before any retrieval or ML, every incoming chunk passes two deterministic stages:
Sectionthe pipeline consumes, and rejects bad input with typed errors.ddd-ddd)? If exactly one known id is cited, link it directly with no ML. Unknown or conflicting references never auto-link — they route to human review (fail-safe).A naming note for reviewers: the plan documents called this component "SectionNormalizer", but the RFC (#734) assigns text normalization to Module A (harvest + normalize + chunk), with Module B's sanitizer on top. By the time text reaches Module C it is contractually clean, and re-cleaning it here would silently drift C from what A hashed and B classified. This component validates and adapts — it never transforms text — so it is named SectionValidator.
Scope: 4 new files + 1 updated (the eval harness). No frontend, no migrations, no DB access, no behaviour change to OpenCRE proper.
What changed
section_validator.pyknowledge_queuerow and the full RFCKnowledgeItemenvelope — into a frozen internalSection. Synthesizes the RFC identity fields from B's row (chunk_id = chk:{repo}@{sha}:{path},artifact_id = art:{repo}:{path},source,locator). Strips volatile audit metadata (llm_reasoning, filter stages, run timestamps) so downstream stages can never key decisions on it. Rejections are typed (MalformedKnowledgeItemError,EmptyTextError,UnsupportedLanguageError,NotKnowledgeError); raw PydanticValidationErrornever escapes the boundary.explicit_link_resolver.py\b\d{3}-\d{3}\b) extraction + resolution against an injected set of known CRE ids. Outcomes:resolved(single known id → auto-link),no_reference(continue to the semantic path, W3+),unknown_reference/conflicting_references(→ review, with the known ids preserved as suggestions). The known-id set is injected so this module stays dependency-free: the harness seeds it from the golden dataset today; the DB-backedcre.external_idregistry arrives with the retriever (W3).evaluate_librarian.pypredict()makes its first real predictions (explicit path only; semantic path still stubbed).section_validator_test.py,explicit_link_resolver_test.pyen-GBaccepted,frrejected), and every resolver outcome including pattern boundaries (027-5555andCVE-2024-1234-567must not match). One test asserts the boundary never leaks a raw Pydantic error.How the pieces connect
flowchart TB subgraph UPSTREAM["Upstream shapes (Week 1 contracts)"] row["KnowledgeQueueItem<br/>(Module B's reduced row)"] ki["KnowledgeItem<br/>(RFC envelope)"] end subgraph C0["C.0 — input boundary (this PR)"] val["section_validator.py<br/>validate · adapt · synthesize identity"] sec["Section<br/>(internal, frozen)"] res["explicit_link_resolver.py<br/>regex ddd-ddd, no ML"] end subgraph OUT["Outcomes"] link["resolved →<br/>deterministic link"] sem["no_reference →<br/>semantic path (W3+)"] rev["unknown / conflicting →<br/>human review"] err["typed errors:<br/>Malformed · EmptyText ·<br/>UnsupportedLanguage · NotKnowledge"] end subgraph HARNESS["Eval harness (updated)"] eval["evaluate_librarian.py<br/>pass rate per slice ·<br/>explicit gate 100% (exit 1 on fail)"] end row --> val ki --> val val --> sec val -. reject .-> err sec --> res res --> link res --> sem res --> rev sec --> eval res --> evalResults
mypy --strictclean on both new modules (repo flags)blackcleanWhat is intentionally not here
Retriever, embeddings, cross-encoder, SafetyGuard, decision engine, CLI wiring, and DB models/migrations are later weeks (W3–W8 per the proposal). The semantic path in the harness still predicts nothing — Week 3 plugs the retriever into the same
predict()seam.How to verify locally
🤖 Generated with Claude Code