feat: vulnerability scanning within git integration (IN-956) by epipav · Pull Request #3892 · linuxfoundation/crowd.dev

epipav · 2026-03-03T16:44:12Z

Adds automated vulnerability scanning for all git repositories using the Google OSV Scanner SDK. Runs on the first clone batch per repo and persists results directly to the insights database.

Architecture
Go binary wrapped in Python — OSV Scanner is a Go library with no Python bindings. We embed it as an SDK dependency and call it programmatically, following the same subprocess + JSON stdout pattern as the software-value service.

The binary exits with code 0 and communicates errors through the JSON payload, so the Python subprocess machinery never misinterprets a non-zero exit as a crash.

Design decisions
Vulnerability identity: (repo_url, vulnerability_id, package_name, source_path) — same CVE can appear in multiple packages and lockfiles

ID classification: primary ID + aliases sorted into cve_ids, ghsa_ids, other_ids arrays by prefix

Severity: derived from CVSS numeric score using standard thresholds (CRITICAL/HIGH/MEDIUM/LOW)

Status tracking: OPEN (no fix known), FIX_AVAILABLE (patch exists), RESOLVED (no longer detected)

Database strategy: upsert + mark-resolved (not delete + insert) — preserves full history of when vulnerabilities were first detected, last seen, and resolved

Transitive scanning: resolves full dependency graph by default; falls back to direct-only on timeout (3min) for first scans; subsequent scans reuse the previous mode

OOM handling: on any scanner crash, marks stale running scan records as failure; on OOM specifically (SIGKILL), retries with --no-transitive to skip the most memory-intensive part

Scan tracking: every invocation creates a vulnerability_scans row (running → success/failure/no_packages_found) with duration, counts, and errors

Note

Medium Risk
Adds a new repository-processing step that executes a new Go scanner binary and writes/upserts vulnerability data into the insights database, which could impact worker throughput and DB load if scanning is slow or failure handling is imperfect.

Overview
Adds automated vulnerability scanning to the git integration pipeline by introducing a new VulnerabilityScannerService that runs on the first clone batch and records execution metrics.

Implements a new Go vulnerability-scanner binary (built and shipped via Dockerfile.git_integration) that runs OSV scans, tracks scan runs in vulnerability_scans, and upserts findings into vulnerabilities using a resolve+upsert strategy with transitive-scanning mode selection and a 3-minute timeout.

Improves subprocess handling by streaming stderr logs and propagating process exit codes via CommandExecutionError, enabling the Python wrapper to mark stale running scans as failed and retry on OOM/timeout; updates env templates with INSIGHTS_DB_* settings needed for scanner DB access.

^{Written by Cursor Bugbot for commit 689d394. This will update automatically on new commits. Configure here.}

github-actions · 2026-03-03T16:44:21Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2026-03-03T16:44:35Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

…ransitive deps

github-actions · 2026-03-05T14:51:42Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2026-03-05T14:52:17Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

epipav · 2026-03-05T15:07:13Z

related: linuxfoundation/insights#1725

github-actions · 2026-03-05T15:09:21Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2026-03-05T15:09:39Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

services/apps/git_integration/src/crowdgit/services/utils.py

...git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py

...es/apps/git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner.go

github-actions · 2026-03-05T15:17:37Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2026-03-05T15:18:42Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2026-03-05T15:32:03Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

cursor · 2026-03-05T15:33:14Z

services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/main.go

+		errorCode := getErrorCodeFromConfigError(err)
+		errorMessage := err.Error()
+		return StandardResponse{Status: StatusFailure, ErrorCode: &errorCode, ErrorMessage: &errorMessage}
+	}


Config error checked after config is already used

Medium Severity

getConfig returns (config, err), but the code iterates over os.Args and sets config.NoTransitive on lines 28-31 before checking err on line 33. If getConfig fails (e.g., target path doesn't exist), the partially-initialized config is modified before the error is handled. The error check needs to happen immediately after the getConfig call, before the --no-transitive flag parsing loop.

cursor · 2026-03-05T15:33:14Z

services/apps/git_integration/src/crowdgit/services/utils.py

+            coro = _run_with_stderr_logging()
+            stdout = await (asyncio.wait_for(coro, timeout=timeout) if timeout else coro)
+            stdout_text = _safe_decode(stdout).strip() if stdout else ""
+            stderr_text = "\n".join(stderr_lines)


stdin_input silently dropped when stderr_logger is provided

Medium Severity

When stderr_logger is provided, the _run_with_stderr_logging path reads stdout and streams stderr but never writes stdin_input to the process. The process.communicate(input=stdin_input) call only exists in the else branch. Any caller combining stderr_logger with input_text will silently lose the stdin data. While the vulnerability scanner doesn't currently use input_text, this is a general-purpose utility function shared across services.

cursor · 2026-03-05T17:50:44Z

...git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py

+                execution_status = ExecutionStatus.FAILURE
+                error_code = ErrorCode.UNKNOWN.value
+                error_message = repr(e)
+                output = None


Unhandled exception in stale scan cleanup bypasses retry

Medium Severity

_mark_stale_scans_failed is called inside the except block without its own error handling. If it throws (e.g., DB connection failure or missing env var via os.environ[]), the exception propagates out of the except block entirely, bypassing the OOM/timeout retry logic and preventing the service execution record from being saved at the end of run().

mbani01 · 2026-03-06T14:05:50Z

...git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py

+        conn = await asyncpg.connect(
+            user=os.environ["INSIGHTS_DB_USERNAME"],
+            password=os.environ["INSIGHTS_DB_PASSWORD"],
+            database=os.environ["INSIGHTS_DB_DATABASE"],
+            host=os.environ["INSIGHTS_DB_WRITE_HOST"],
+            port=int(os.environ.get("INSIGHTS_DB_PORT", "5432")),
+        )
+        try:


In case of connection errors, it won't be caught, shouldn't we include it in the try block ?

Suggested change

conn = await asyncpg.connect(

user=os.environ["INSIGHTS_DB_USERNAME"],

password=os.environ["INSIGHTS_DB_PASSWORD"],

database=os.environ["INSIGHTS_DB_DATABASE"],

host=os.environ["INSIGHTS_DB_WRITE_HOST"],

port=int(os.environ.get("INSIGHTS_DB_PORT", "5432")),

)

try:

try:

conn = await asyncpg.connect(

user=os.environ["INSIGHTS_DB_USERNAME"],

password=os.environ["INSIGHTS_DB_PASSWORD"],

database=os.environ["INSIGHTS_DB_DATABASE"],

host=os.environ["INSIGHTS_DB_WRITE_HOST"],

port=int(os.environ.get("INSIGHTS_DB_PORT", "5432")),

)

services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/go.mod

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-18T10:00:30Z

...git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py

+            error_message=error_message,
+            execution_time_sec=execution_time,
+        )
+        await save_service_execution(service_execution)


Service execution not recorded when exceptions escape run

Medium Severity

Unlike SoftwareValueService which places save_service_execution in a finally block, here it sits at the end of run() with no finally protection. Multiple failure paths — _mark_stale_scans_failed throwing, json.loads raising on malformed output — cause the method to exit without recording the service execution, silently losing operational tracking data.

Copilot

Pull request overview

Adds an automated vulnerability-scanning step to the git integration worker, implemented as a Go OSV-Scanner-based binary invoked from Python, with results persisted to the insights DB.

Changes:

Run a new VulnerabilityScannerService on the first clone batch and record an execution via OperationType.VULNERABILITY_SCAN.
Introduce a new Go-based vulnerability-scanner module/binary (OSV Scanner SDK) plus Docker build plumbing.
Extend run_shell_command to propagate return codes and optionally stream stderr.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
services/apps/git_integration/src/crowdgit/worker/repository_worker.py	Invokes vulnerability scan on first clone batch.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py	Python wrapper for scanner subprocess + execution tracking + stale scan cleanup.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner.go	Core Go scanning logic + normalization + DB persistence.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/types.go	Shared response / DB model types for scanner.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/main.go	CLI entrypoint + JSON stdout formatting.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/go.mod	Go module definition for scanner.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/go.sum	Go dependency lockfile for scanner.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/db.go	Insights DB connection + upsert/resolve strategy + scan tracking.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/config.go	Reads target path + insights DB env configuration.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/README.md	Design/behavior documentation for scanner component.
services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/.gitignore	Ignores local Go build artifacts.
services/apps/git_integration/src/crowdgit/services/utils.py	Adds stderr streaming + returncode propagation in `run_shell_command`.
services/apps/git_integration/src/crowdgit/services/init.py	Exposes `VulnerabilityScannerService`.
services/apps/git_integration/src/crowdgit/server.py	Wires scanner service into app lifecycle / worker init.
services/apps/git_integration/src/crowdgit/errors.py	Adds `returncode` field to `CommandExecutionError`.
services/apps/git_integration/src/crowdgit/enums.py	Adds `OperationType.VULNERABILITY_SCAN`.
scripts/services/docker/Dockerfile.git_integration	Builds + ships `vulnerability-scanner` binary in the image.
backend/.env.dist.local	Adds local insights DB env vars.
backend/.env.dist.composed	Adds composed insights DB host env var.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/README.md

+The rest of the git integration is Python, but the OSV Scanner is a Go library with no Python bindings. Rather than shelling out to the `osv-scanner` CLI (which is fragile and adds an extra process layer), we embed the scanner as a Go SDK dependency and call it programmatically. This gives us full control over scan parameters, result access, and error handling.
+
+The binary follows the same **subprocess + JSON stdout** pattern used by the `software-value` service. The Python wrapper calls it, reads the JSON response from stdout, and treats it as a black box. The binary always exits with code 0 — errors are communicated through the JSON payload — so the Python subprocess machinery never misinterprets a non-zero exit as a crash.
+


services/apps/git_integration/src/crowdgit/services/utils.py

+        if stderr_logger:
+            stderr_lines: list[str] = []
+
+            async def _run_with_stderr_logging() -> bytes:
+                async def _stream() -> None:
+                    async for raw_line in process.stderr:
+                        line = _safe_decode(raw_line).rstrip()
+                        if line:
+                            stderr_logger.info(line)
+                            stderr_lines.append(line)
+
+                stdout, _ = await asyncio.gather(process.stdout.read(), _stream())
+                await process.wait()
+                return stdout
+
+            coro = _run_with_stderr_logging()
+            stdout = await (asyncio.wait_for(coro, timeout=timeout) if timeout else coro)
+            stdout_text = _safe_decode(stdout).strip() if stdout else ""
+            stderr_text = "\n".join(stderr_lines)


services/apps/git_integration/src/crowdgit/services/utils.py

        input_text: Text (str) or bytes to send to stdin (will automatically append newline if not present)
+        stderr_logger: If provided, stderr lines are streamed in real-time and logged via this callable


...git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py

+            json_output = json.loads(output)
+            status = json_output.get("status")
+
+            if status == "success":
+                execution_status = ExecutionStatus.SUCCESS
+            elif status == "no_packages_found":
+                execution_status = ExecutionStatus.SUCCESS
+                self.logger.info(f"No scannable packages found for {repo_url}")
+            else:
+                execution_status = ExecutionStatus.FAILURE
+                error_code = json_output.get("error_code")
+                error_message = json_output.get("error_message")
+                self.logger.error(
+                    f"Vulnerability scan failed: {error_message} (code: {error_code})"
+                )
+


...git_integration/src/crowdgit/services/vulnerability_scanner/vulnerability_scanner_service.py

+        conn = await asyncpg.connect(
+            user=os.environ["INSIGHTS_DB_USERNAME"],
+            password=os.environ["INSIGHTS_DB_PASSWORD"],
+            database=os.environ["INSIGHTS_DB_DATABASE"],
+            host=os.environ["INSIGHTS_DB_WRITE_HOST"],
+            port=int(os.environ.get("INSIGHTS_DB_PORT", "5432")),


services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/config.go

+	config.InsightsDatabase.User = os.Getenv("INSIGHTS_DB_USERNAME")
+	config.InsightsDatabase.Password = os.Getenv("INSIGHTS_DB_PASSWORD")
+	config.InsightsDatabase.DBName = os.Getenv("INSIGHTS_DB_DATABASE")
+	config.InsightsDatabase.Host = os.Getenv("INSIGHTS_DB_WRITE_HOST")
+	if portStr := os.Getenv("INSIGHTS_DB_PORT"); portStr != "" {
+		if port, err := strconv.Atoi(portStr); err == nil {
+			config.InsightsDatabase.Port = port
+		}
+	}
+	config.InsightsDatabase.SSLMode = os.Getenv("INSIGHTS_DB_SSLMODE")
+	if poolMaxStr := os.Getenv("INSIGHTS_DB_POOL_MAX"); poolMaxStr != "" {
+		if poolMax, err := strconv.Atoi(poolMaxStr); err == nil {
+			config.InsightsDatabase.PoolMax = poolMax
+		}
+	}


feat: vulnerability scanning within git integration

803e49e

epipav marked this pull request as draft March 3, 2026 16:44

epipav added 9 commits March 3, 2026 17:46

chore: formatting

28a077b

Merge branch 'main' into feat/git-osv-vulnerabilities

bcf68d3

feat: improve logging, better statuses and transitive dep checks

39f8935

chore: remove unnecessary normalization

ed3308a

chore: update docs

a6c7fd4

chore: also return the exit code for debugging in run_shell_command

fda4ee8

fix: better error handling and visibility. OOM errors retry without t…

670e779

…ransitive deps

chore: formatting

1458167

refactor: simpler exception handling and docs

572449e

epipav marked this pull request as ready for review March 5, 2026 14:48

epipav requested a review from mbani01 March 5, 2026 14:52

chore: formatting

91403d5

cursor bot reviewed Mar 5, 2026

View reviewed changes

epipav added 2 commits March 5, 2026 16:29

refactor: prevent concurrent scans on retries

6c57c4a

fix: wait for process to exit before checking returncode

7ec4c04

epipav changed the title ~~feat: vulnerability scanning within git integration~~ feat: vulnerability scanning within git integration (IN-956) Mar 5, 2026

cursor bot reviewed Mar 5, 2026

View reviewed changes

fix: add raw cvss score to vulnerabilities

b4e9df5

cursor bot reviewed Mar 5, 2026

View reviewed changes

mbani01 reviewed Mar 6, 2026

View reviewed changes

mbani01 approved these changes Mar 6, 2026

View reviewed changes

github-license-compliance bot found potential problems Mar 17, 2026

View reviewed changes

Merge branch 'main' into feat/git-osv-vulnerabilities

689d394

Copilot AI review requested due to automatic review settings March 18, 2026 09:56

Copilot started reviewing on behalf of epipav March 18, 2026 09:57 View session

cursor bot reviewed Mar 18, 2026

View reviewed changes

Copilot AI reviewed Mar 18, 2026

View reviewed changes

		The rest of the git integration is Python, but the OSV Scanner is a Go library with no Python bindings. Rather than shelling out to the `osv-scanner` CLI (which is fragile and adds an extra process layer), we embed the scanner as a Go SDK dependency and call it programmatically. This gives us full control over scan parameters, result access, and error handling.

		The binary follows the same subprocess + JSON stdout pattern used by the `software-value` service. The Python wrapper calls it, reads the JSON response from stdout, and treats it as a black box. The binary always exits with code 0 — errors are communicated through the JSON payload — so the Python subprocess machinery never misinterprets a non-zero exit as a crash.

		input_text: Text (str) or bytes to send to stdin (will automatically append newline if not present)
		stderr_logger: If provided, stderr lines are streamed in real-time and logged via this callable

Conversation

epipav commented Mar 3, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

epipav commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Config error checked after config is already used

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

stdin_input silently dropped when stderr_logger is provided

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Unhandled exception in stale scan cleanup bypasses retry

Uh oh!

mbani01 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 18, 2026

Choose a reason for hiding this comment

Service execution not recorded when exceptions escape run

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

epipav commented Mar 3, 2026 •

edited by cursor bot

Loading