DeepSourceCorp · unnat-deepsource · Apr 16, 2026 · Apr 21, 2026 · Apr 21, 2026
diff --git a/secrets-benchmarks/README.md b/secrets-benchmarks/README.md
@@ -0,0 +1,237 @@
+# DeepSource Secrets Benchmark
+
+Benchmark evaluating the DeepSource secret-detection pipeline on a synthetic
+dataset of hardcoded-secret-bearing code snippets.
+
+## Pipeline under test
+
+```
+                snippets/
+         (240 snippets, 518 secrets)
+                   │
+                   ▼
+      ┌─────────────────────────────┐
+      │  Stage 1: scanner           │   pattern + high-entropy
+      │  (candidate detector)       │   heuristics (entropy >= 4.0)
+      └─────────────────────────────┘
+                   │
+                   │  453 TPs, 65 missed, 696 raw FPs
+                   ▼
+      ┌─────────────────────────────┐
+      │  Stage 2: classifier        │   fine-tuned small language
+      │  (false-positive filter)    │   model on a custom endpoint
+      └─────────────────────────────┘
+                   │
+                   │  Rejects 690 / 696 FPs
+                   ▼
+        Final: 453 TP · 65 FN · 6 FP
+        Acc 87.45 · Prec 98.69 · Rec 87.45 · F1 92.78
+```
+
+## Final numbers
+
+| Metric            | DeepSource |
+|-------------------|-----------:|
+| Perfect Matches   | 453        |
+| Partial Matches   | 0          |
+| Missed Secrets    | 65         |
+| False Positives   | 6          |
+| Accuracy          | 87.45%     |
+| Precision         | 98.69%     |
+| Recall            | 87.45%     |
+| F1 Score          | 92.78%     |
+
+Detailed derivation (per-stage breakdowns, formulas, and the tiny rounding
+delta between the reported 92.78% F1 and the count-derived 92.73%) is captured
+in [`results/deepsource.json`](results/deepsource.json).
+
+## Repository layout
+
+```
+secrets-benchmarks/
+├── README.md                         # this file
+├── generation-prompt.py              # the system prompt used to generate the
+│                                     # synthetic dataset
+├── raw-dataset.jsonl                 # raw generator output; `snippets/` is derived from this
+├── snippets/                         # 240 language-native snippets with preserved line numbers
+│   └── NNN/                          # 001-240, one dir per snippet
+│       ├── snippet.{ext}             # code in its native extension
+│       └── ground-truth.json         # ground-truth secrets + metadata
+├── raw-output/                       # exact per-stage outputs
+│   ├── scanner.json                  # stage-1: per-snippet scan results incl. all 696 raw FPs
+│   └── classifier.jsonl              # stage-2: raw per-snippet predictions
+├── processed/                        # normalized per-secret comparisons
+│   ├── scanner.jsonl                 # 518 ground-truth lines + 696 false-positive lines
+│   └── classifier.jsonl              # 513 per-secret comparisons
+└── results/                          # metrics
+    ├── scanner.json                  # stage-1 standalone metrics
+    ├── classifier.json               # stage-2 standalone metrics (on all-TP eval)
+    └── deepsource.json               # final pipeline numbers (= table above)
+```
+
+## File formats
+
+### `raw-dataset.jsonl`
+
+Raw output from the generator. One top-level entry per line, 48 entries (IDs
+1-50 with 6 and 27 dropped during manual review). Each entry bundles 5
+sub-examples together; `snippets/` is the exploded, language-native form used
+for evaluation.
+
+```jsonc
+{
+  "id": 1,
+  "findings": [                       // 5 sub-examples per entry
+    {
+      "code": "78: import boto3\n79: ...\n83: aws_access_key = 'AKIA...'",
+      "findings": [
+        {"line_number": 83, "secret": "AKIA...", "label": "True Positive"}
+      ]
+    }
+    // ... 4 more sub-examples
+  ]
+}
+```
+
+### `snippets/`
+
+240 language-native snippets designed for direct consumption by code scanners.
+Each snippet is a real source file with the correct language extension, so
+scanners can infer language and report accurate line numbers without any
+translation. The dataset was AI-generated synthetically, then manually
+verified.
+
+**Directory structure:**
+
+```
+snippets/001/snippet.py                # one dir per snippet, 001-240
+snippets/002/snippet.yml
+...
+snippets/240/snippet.properties
+```
+
+Each directory holds the code plus a `ground-truth.json` with the expected
+findings.
+
+**Line numbers.** Each `snippet.{ext}` starts at line 1. `ground-truth.json`
+reports the line within that file where each secret appears, so a scanner's
+reported line number can be compared to `line_number` directly. The snippet's
+original line offset (from the generator prompt) is preserved in
+`raw-dataset.jsonl` if needed.
+
+**`ground-truth.json` format:**
+
+```jsonc
+{
+  "entry_id": 1,                   // original generation-prompt id
+  "language": "python",
+  "findings": [
+    {"line_number": 6, "secret": "AKIA...", "label": "True Positive"}
+  ]
+}
+```
+
+**Languages present:** python, javascript, typescript, go, java, yaml,
+terraform, properties, json, csharp, groovy, kotlin, swift, dart, php (14
+total).
+
+### `raw-output/scanner.json`
+
+```jsonc
+{
+  "stage": "stage-1 scanner (pattern + high-entropy heuristics)",
+  "per_entry_results": {
+    "1": [                            // array indexed by sub-example
+      {
+        "found_entries":       [...], // TPs (exact or partial)
+        "not_found_entries":   [...], // missed ground-truth secrets
+        "false_positives":     [...], // scanner detections with no gt match
+        "total_actual":  2, "total_found": 2,
+        "total_missed": 0, "total_false_positives": 5
+      }
+      // ... one per sub-example
+    ]
+  }
+}
+```
+
+### `raw-output/classifier.jsonl`
+
+One line per top-level entry. Each line contains the classifier's raw
+predictions for every sub-example of that entry:
+
+```jsonc
+{
+  "id": 1,
+  "findings": [
+    {
+      "index": 0, "sub_index": 0,
+      "completion": "<json>{\"line_number\": 83, \"label\": \"True Positive\", \"secret_value\": \"AKIA...\", \"reason\": \"...\"}</json>"
+    }
+    // ... one per sub-example
+  ]
+}
+```
+
+### `processed/scanner.jsonl`
+
+1,214 lines = 518 ground-truth comparisons + 696 raw false positives.
+
+```jsonc
+// kind=ground_truth (one per gt secret; match_type ∈ {exact, partial, missed})
+{"kind": "ground_truth", "id": 1, "match_type": "exact",
+ "expected": {"line_number": 83, "secret": "AKIA...", "label": "True Positive"},
+ "actual":   {"line_number": 6,  "secret": "AKIA..."}}
+
+// kind=false_positive (one per raw scanner FP)
+{"kind": "false_positive", "id": 1, "sub_index": 0, "match_type": "false_positive",
+ "actual": {"line_number": 29, "secret": "File '{file_name}' uploaded to ..."}}
+```
+
+### `processed/classifier.jsonl`
+
+513 lines (the 5 missing vs. the 518 gt total are sub-examples where the
+classifier's response failed JSON-parsing).
+
+```jsonc
+{
+  "id": 1, "index": 0, "sub_index": 0,
+  "perfect_match": true,
+  "error_fields": [],
+  "expected": {"line_number": 83, "secret": "AKIA...", "label": "True Positive"},
+  "actual":   {"line_number": 83, "secret_value": "AKIA...", "label": "True Positive",
+               "reason": "The value 'AKIA...' matches the AWS access-key format ..."}
+}
+```
+
+### `results/*.json`
+
+- `scanner.json` — stage-1 standalone metrics plus baseline comparisons
+  between a vanilla configuration, the SDK with default plugins, and the SDK
+  with the high-entropy-string detector enabled.
+- `classifier.json` — stage-2 standalone metrics. **Caveat:** the eval set
+  is all-TP, so its `precision = 1.0` is not a measurement of how well the
+  classifier rejects stage-1 FPs — it's a measurement of "classifier doesn't
+  misclassify real secrets."
+- `deepsource.json` — pipeline end-to-end numbers matching the headline
+  scorecard. Includes both the reported percentages and the recomputed ones
+  with a note on the tiny F1 rounding delta.
+
+## Caveats
+
+1. **Benchmark is synthetic.** Snippets are generated, 25-35 lines, packed
+   with 1-4 secrets each. Real repo noise (minified JS, lockfiles, large
+   vendor blobs, long docs) isn't represented. Treat raw FP counts here as a
+   lower bound relative to real-world scanning.
+2. **No full-file context.** Each sub-example was scanned as its own isolated
+   buffer, not as a file inside a repository. Cross-file context (env var
+   references elsewhere, `.gitignore`, allowlists) isn't tested.
+3. **Stage-2 per-FP verdicts aren't saved here.** The stage-2 data in
+   `classifier.jsonl` runs the classifier as a standalone detector on the
+   golden set — not as an FP filter on the 696 stage-1 FPs. The "6" in the
+   final metrics is the headline number from the reported benchmark run; to
+   re-derive it you'd need to pipe stage-1's `false_positives` back through
+   the classifier and record its verdict per FP.
+4. **Synthetic AWS-key-shaped strings** in the golden set will trigger GitHub
+   push-protection on public repos. Keep this dataset in a private repo, or
+   strip/allowlist the affected strings before pushing.
diff --git a/secrets-benchmarks/generation-prompt.py b/secrets-benchmarks/generation-prompt.py
@@ -0,0 +1,153 @@
+"""System prompt used to generate the synthetic secret-detection benchmark
+dataset (`raw-dataset.jsonl`). The prompt was fed to a large language model
+which emitted batches of 5 synthetic code snippets (25-35 numbered lines each,
+containing 1-4 realistic-looking hardcoded secrets per snippet).
+"""
+
+SYNTHETIC_SYSTEM_PROMPT = """
+You are a data generation engine that produces **synthetic code snippets** containing **realistic-looking hardcoded secrets** for use in cybersecurity training datasets. Your primary goal is to create a diverse and realistic dataset.
+
+---
+
+## Rules
+
+### 1. Output Format
+Always return a JSON structure, starting with the `<json>` tag and ending with the `</json>` tag.
+
+The top-level output MUST be a JSON array containing exactly **5 objects**.
+Each object must follow this structure:
+
+{
+  "example_id": <1 through 5>,
+  "code": "<string with numbered code lines>",
+  "findings": [
+    {
+      "line_number": <integer of line containing the secret>,
+      "secret": "<the exact secret value>",
+      "label": "True Positive"
+    },
+    ...
+  ]
+}
+
+No extra commentary, text, or markdown outside the `<json>...</json>` block.
+
+---
+
+### **2. Code Snippet Requirements**
+
+The code must be a plausible, high-quality snapshot that imitates a real-world project. Generic, simplistic, or repetitive code is unacceptable.
+- **Length & Numbering:** Each snippet must be **25-35 lines long**. Code lines must be **numbered** as `<line_number>: <code>`, starting from an arbitrary line number (e.g., `42:`, `115:`).
+
+#### **2.1. Mandate for Contextual & Syntactic Diversity**
+Each of the 5 generated examples must be a **distinct** and **unique** snapshot of a real-world project. The primary goal is to maximize diversity across the set, avoiding any repetition in the scenario, language, or overall structure.
+The 5 generated examples **must be written in a distinct and unique** primary language or configuration format. In a single response, you are **strictly prohibited** from generating, similar looking more than two snapshot of the same format/programming language.
+
+To ensure variety, you **must select 5 different options** from the languages and formats listed in Section 2.2 for each response.
+- **Strict Uniqueness:** Each snippet **must** represent a unique development scenario and use a different primary language or configuration format. For example, generating two Python backend apps or two Terraform files in the same response is strictly prohibited.
+- **Plausible Secret Pairing:** The type of hardcoded secret must logically match the code's context. For instance, an SSH key is plausible in a CI/CD pipeline, while a Stripe API key is plausible in a backend payment processor.
+
+#### **2.2. Scenario & Language Variety**
+To ensure diversity, select from a wide range of contexts and languages for each of the 5 snippets.
+
+**A. Example Scenarios & Use Cases:**
+- **Backend Services:** API endpoints, database initializers, authentication middleware, or background workers (e.g., Python/Flask, Go/Gin, Ruby/Rails, C#/ASP.NET, Java/Spring).
+- **Frontend Components:** Configuration objects or service initializers inside UI code (e.g., TypeScript/React, JavaScript/Vue) that handle keys for services like Firebase, Mapbox, or Sentry.
+- **Infrastructure as Code (IaC):** Resource definitions with hardcoded provider credentials or variables (e.g., Terraform/HCL, Pulumi/TypeScript, AWS CDK).
+- **CI/CD Pipelines:** Build, test, and deployment steps with integrated secrets (e.g., YAML for GitHub Actions/GitLab CI, Groovy for Jenkinsfiles).
+- **Configuration Files:** Standalone configuration for applications or services (e.g., YAML, JSON, `.env`, Java `.properties`, `.tfvars`).
+- **Data & Utility Scripts:** Standalone scripts for automation, data processing, or sending notifications (e.g., Python with `boto3` or `smtplib`, PHP scripts, PowerShell).
+- **Mobile App Configuration:** Build configurations or property lists containing API keys (e.g., `build.gradle` for Android, `Info.plist` or Swift configuration files for iOS).
+
+**B. Example Languages & Formats:**
+- **Languages:** Python, Go, TypeScript, JavaScript, C#, Java, Ruby, PHP, Swift, Kotlin.
+- **Config Formats:** YAML, JSON, HCL (Terraform), `.env`, `.properties`, XML.
+
+---
+
+### **3. Secret Injection Rules **
+
+The goal is to generate code snippets with hardcoded secrets that are **indistinguishable from real-world secrets** at a glance. They must be synthetically generated but adhere strictly to the format, character set, and apparent randomness of genuine credentials.
+- **Secret Count:** Each snippet must contain **at least 1 and at most 4** hardcoded secrets. The exact number should vary randomly across the dataset (e.g., some snippets with 1, some with 2 or 3, and occasionally 4).
+- **No Metadata:** Do not include any comments (`// fake key`), docstrings, or other markers that reveal the secrets are synthetic, for training, or are placeholders.
+
+***
+
+### **3.1 Mandate for Authentic Realism**
+
+All secrets must be generated based on two core principles: **authentic structure** and **high-entropy payloads**.
+
+#### **A. Authentic Structure**
+
+Secrets must precisely replicate the real-world format for their type. This includes:
+- **Prefixes:** Use the correct, well-known prefixes (e.g., `sk_live_` for Stripe, `AKIA` for AWS, `ghp_` for GitHub, `xoxb-` for Slack).
+- **Character Set:** Use the appropriate character set (e.g., alphanumeric, Base64, hex).
+- **Length:** Adhere to the standard length or length range for the specific secret type.
+- **Formatting:** Complex secrets like database connection strings must use the correct URI format and include realistic (but synthetic) hostnames, usernames, and databases.
+
+#### **B. High-Entropy Payloads**
+
+The variable portion of the secret **must appear to be a cryptographically random string**. Generation must **strictly avoid** common anti-patterns that make secrets look fake.
+
+**Prohibited Patterns (Do NOT use):**
+- **Leet Speak:** `D3m0T0k3n`, `S3cr3t`
+- **Dictionary Words:** `MyP@ssword`, `StagingKey`
+- **Sequential Chars/Keyboard Walks:** `abcdefg`, `12345678`, `qwerty`
+- **Simple, Repetitive Patterns:** `abababab`, `testtest`
+- **Obvious Placeholders:** `AKIAYOURSECRETKEYHERE`, `ghp_XXXXXXXXXXXXXXXXXXXX`
+
+Below are examples illustrating the required level of realism.
+
+**GitHub Token:**
+- Bad Example (Looks Fake): `ghp_D3m0L0ngPers0nalAcc3ssT0k3nAbCdEf123456`
+- Good Example (Looks Real): `ghp_aV4gH9rT2pL7xJ5sK1mF3bZ8oN6cW0qYdE`
+
+**AWS Access Key:**
+- Bad Example (Looks Fake): `AKIA2QW3E4R5T6Y7U8I9`
+- Good Example (Looks Real): `AKIAY3R4WZ76X2P5QJ6M`
+
+**Stripe API Key:**
+- Bad Example (Looks Fake): `sk_live_test_key_for_payments_12345`
+- Good Example (Looks Real): `sk_live_51Kk0L2ApB8fG1tY9cRzXvWqSjU3mB`
+
+**Postgres URI:**
+- Bad Example (Looks Fake): `postgres://admin:password@localhost:5432/testdb`
+- Good Example (Looks Real): `postgres://prod_user_rw:8!hG#kL$pQ2s@db.prod.internal:5432/main`
+
+**JWT Token:**
+- Bad Example (Looks Fake): `eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.test`
+- Good Example (Looks Real): `eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c`
+
+***
+
+### **3.2 Diversity of Secret Types**
+
+Across the entire dataset, the generated secrets must represent a wide variety of realistic secret categories. Snippets should combine different types where natural. The list of secret types includes, but is not limited to:
+
+- **API Keys:** Cloud providers (AWS, GCP, Azure), payment processors (Stripe, Braintree), SaaS platforms (Twilio, SendGrid), and AI services (OpenAI, Anthropic).
+- **Authentication Tokens:** OAuth 2.0 tokens, session tokens, bearer tokens, JWTs.
+- **Database Connection Strings:** Postgres, MySQL, MongoDB, Redis, etc.
+- **Cloud Storage Keys:** AWS S3 access keys, Azure Blob Storage keys, GCP Cloud Storage keys.
+- **Credentials:** Username/password combinations (for services, not end-users).
+- **Cryptographic Material:** Raw encryption keys (AES, RSA), initialization vectors (IVs), or salts.
+- **SSH Keys & Certificates:** Private keys (RSA, ED25519) or PEM-encoded certificates.
+
+---
+
+### **4. Output & Generation Rules**
+This section defines the strict structural and content requirements for the final output.
+
+- **JSON Array Structure:** The final output **MUST** be a single, valid JSON array that contains exactly **5 unique JSON objects**. Each object represents one complete example.
+- **Object Content:** Each object in the array must include three keys: `"example_id"` (numbered sequentially from 1 to 5), a `"code"` snippet, and a `"findings"` array.
+- **Strict Uniqueness Mandate:** The 5 generated code snippets **MUST BE UNIQUE**. Do not repeat or slightly modify a previous example. This is a critical requirement, as the data will be used for model training.
+- **No Extraneous Text:** There **MUST NOT** be any text, explanations, or formatting outside the main JSON array (i.e., no text before or after the `[` and `]` brackets of the array).
+- **Self-Correction:** Before finalizing your response, you must verify that the JSON array contains exactly 5 objects. If it does not, you must regenerate the entire response to meet the requirement.
+
+---
+### **5. Content Integrity Rules**
+These rules apply to the secrets and findings generated within each code snippet.
+
+- **True Positives Only:** All generated secrets **MUST** be true positives. Do not generate examples of false positives, commented-out secrets, placeholders (e.g., `'YOUR_KEY_HERE'`), or other non-sensitive values.
+- **Full-Length Secrets:** All secrets **MUST** be included in their entirety, without any truncation, ellipsis (`...`), or shortening. This rule applies to all secret types, including long JWTs, multi-line SSH private keys, or PEM certificates.
+
+"""