Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
237 changes: 237 additions & 0 deletions secrets-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# DeepSource Secrets Benchmark

Benchmark evaluating the DeepSource secret-detection pipeline on a synthetic
dataset of hardcoded-secret-bearing code snippets.

## Pipeline under test

```
snippets/
(240 snippets, 518 secrets)
┌─────────────────────────────┐
│ Stage 1: scanner │ pattern + high-entropy
│ (candidate detector) │ heuristics (entropy >= 4.0)
└─────────────────────────────┘
│ 453 TPs, 65 missed, 696 raw FPs
┌─────────────────────────────┐
│ Stage 2: classifier │ fine-tuned small language
│ (false-positive filter) │ model on a custom endpoint
└─────────────────────────────┘
│ Rejects 690 / 696 FPs
Final: 453 TP · 65 FN · 6 FP
Acc 87.45 · Prec 98.69 · Rec 87.45 · F1 92.78
```

## Final numbers

| Metric | DeepSource |
|-------------------|-----------:|
| Perfect Matches | 453 |
| Partial Matches | 0 |
| Missed Secrets | 65 |
| False Positives | 6 |
| Accuracy | 87.45% |
| Precision | 98.69% |
| Recall | 87.45% |
| F1 Score | 92.78% |

Detailed derivation (per-stage breakdowns, formulas, and the tiny rounding
delta between the reported 92.78% F1 and the count-derived 92.73%) is captured
in [`results/deepsource.json`](results/deepsource.json).

## Repository layout

```
secrets-benchmarks/
├── README.md # this file
├── generation-prompt.py # the system prompt used to generate the
│ # synthetic dataset
├── raw-dataset.jsonl # raw generator output; `snippets/` is derived from this
├── snippets/ # 240 language-native snippets with preserved line numbers
│ └── NNN/ # 001-240, one dir per snippet
│ ├── snippet.{ext} # code in its native extension
│ └── ground-truth.json # ground-truth secrets + metadata
├── raw-output/ # exact per-stage outputs
│ ├── scanner.json # stage-1: per-snippet scan results incl. all 696 raw FPs
│ └── classifier.jsonl # stage-2: raw per-snippet predictions
├── processed/ # normalized per-secret comparisons
│ ├── scanner.jsonl # 518 ground-truth lines + 696 false-positive lines
│ └── classifier.jsonl # 513 per-secret comparisons
└── results/ # metrics
├── scanner.json # stage-1 standalone metrics
├── classifier.json # stage-2 standalone metrics (on all-TP eval)
└── deepsource.json # final pipeline numbers (= table above)
```

## File formats

### `raw-dataset.jsonl`

Raw output from the generator. One top-level entry per line, 48 entries (IDs
1-50 with 6 and 27 dropped during manual review). Each entry bundles 5
sub-examples together; `snippets/` is the exploded, language-native form used
for evaluation.

```jsonc
{
"id": 1,
"findings": [ // 5 sub-examples per entry
{
"code": "78: import boto3\n79: ...\n83: aws_access_key = 'AKIA...'",
"findings": [
{"line_number": 83, "secret": "AKIA...", "label": "True Positive"}
]
}
// ... 4 more sub-examples
]
}
```

### `snippets/`

240 language-native snippets designed for direct consumption by code scanners.
Each snippet is a real source file with the correct language extension, so
scanners can infer language and report accurate line numbers without any
translation. The dataset was AI-generated synthetically, then manually
verified.

**Directory structure:**

```
snippets/001/snippet.py # one dir per snippet, 001-240
snippets/002/snippet.yml
...
snippets/240/snippet.properties
```

Each directory holds the code plus a `ground-truth.json` with the expected
findings.

**Line numbers.** Each `snippet.{ext}` starts at line 1. `ground-truth.json`
reports the line within that file where each secret appears, so a scanner's
reported line number can be compared to `line_number` directly. The snippet's
original line offset (from the generator prompt) is preserved in
`raw-dataset.jsonl` if needed.

**`ground-truth.json` format:**

```jsonc
{
"entry_id": 1, // original generation-prompt id
"language": "python",
"findings": [
{"line_number": 6, "secret": "AKIA...", "label": "True Positive"}
]
}
```

**Languages present:** python, javascript, typescript, go, java, yaml,
terraform, properties, json, csharp, groovy, kotlin, swift, dart, php (14
total).

### `raw-output/scanner.json`

```jsonc
{
"stage": "stage-1 scanner (pattern + high-entropy heuristics)",
"per_entry_results": {
"1": [ // array indexed by sub-example
{
"found_entries": [...], // TPs (exact or partial)
"not_found_entries": [...], // missed ground-truth secrets
"false_positives": [...], // scanner detections with no gt match
"total_actual": 2, "total_found": 2,
"total_missed": 0, "total_false_positives": 5
}
// ... one per sub-example
]
}
}
```

### `raw-output/classifier.jsonl`

One line per top-level entry. Each line contains the classifier's raw
predictions for every sub-example of that entry:

```jsonc
{
"id": 1,
"findings": [
{
"index": 0, "sub_index": 0,
"completion": "<json>{\"line_number\": 83, \"label\": \"True Positive\", \"secret_value\": \"AKIA...\", \"reason\": \"...\"}</json>"
}
// ... one per sub-example
]
}
```

### `processed/scanner.jsonl`

1,214 lines = 518 ground-truth comparisons + 696 raw false positives.

```jsonc
// kind=ground_truth (one per gt secret; match_type ∈ {exact, partial, missed})
{"kind": "ground_truth", "id": 1, "match_type": "exact",
"expected": {"line_number": 83, "secret": "AKIA...", "label": "True Positive"},
"actual": {"line_number": 6, "secret": "AKIA..."}}

// kind=false_positive (one per raw scanner FP)
{"kind": "false_positive", "id": 1, "sub_index": 0, "match_type": "false_positive",
"actual": {"line_number": 29, "secret": "File '{file_name}' uploaded to ..."}}
```

### `processed/classifier.jsonl`

513 lines (the 5 missing vs. the 518 gt total are sub-examples where the
classifier's response failed JSON-parsing).

```jsonc
{
"id": 1, "index": 0, "sub_index": 0,
"perfect_match": true,
"error_fields": [],
"expected": {"line_number": 83, "secret": "AKIA...", "label": "True Positive"},
"actual": {"line_number": 83, "secret_value": "AKIA...", "label": "True Positive",
"reason": "The value 'AKIA...' matches the AWS access-key format ..."}
}
```

### `results/*.json`

- `scanner.json` — stage-1 standalone metrics plus baseline comparisons
between a vanilla configuration, the SDK with default plugins, and the SDK
with the high-entropy-string detector enabled.
- `classifier.json` — stage-2 standalone metrics. **Caveat:** the eval set
is all-TP, so its `precision = 1.0` is not a measurement of how well the
classifier rejects stage-1 FPs — it's a measurement of "classifier doesn't
misclassify real secrets."
- `deepsource.json` — pipeline end-to-end numbers matching the headline
scorecard. Includes both the reported percentages and the recomputed ones
with a note on the tiny F1 rounding delta.

## Caveats

1. **Benchmark is synthetic.** Snippets are generated, 25-35 lines, packed
with 1-4 secrets each. Real repo noise (minified JS, lockfiles, large
vendor blobs, long docs) isn't represented. Treat raw FP counts here as a
lower bound relative to real-world scanning.
2. **No full-file context.** Each sub-example was scanned as its own isolated
buffer, not as a file inside a repository. Cross-file context (env var
references elsewhere, `.gitignore`, allowlists) isn't tested.
3. **Stage-2 per-FP verdicts aren't saved here.** The stage-2 data in
`classifier.jsonl` runs the classifier as a standalone detector on the
golden set — not as an FP filter on the 696 stage-1 FPs. The "6" in the
final metrics is the headline number from the reported benchmark run; to
re-derive it you'd need to pipe stage-1's `false_positives` back through
the classifier and record its verdict per FP.
4. **Synthetic AWS-key-shaped strings** in the golden set will trigger GitHub
push-protection on public repos. Keep this dataset in a private repo, or
strip/allowlist the affected strings before pushing.
153 changes: 153 additions & 0 deletions secrets-benchmarks/generation-prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
"""System prompt used to generate the synthetic secret-detection benchmark
dataset (`raw-dataset.jsonl`). The prompt was fed to a large language model
which emitted batches of 5 synthetic code snippets (25-35 numbered lines each,
containing 1-4 realistic-looking hardcoded secrets per snippet).
"""

SYNTHETIC_SYSTEM_PROMPT = """
You are a data generation engine that produces **synthetic code snippets** containing **realistic-looking hardcoded secrets** for use in cybersecurity training datasets. Your primary goal is to create a diverse and realistic dataset.

---

## Rules

### 1. Output Format
Always return a JSON structure, starting with the `<json>` tag and ending with the `</json>` tag.

The top-level output MUST be a JSON array containing exactly **5 objects**.
Each object must follow this structure:

{
"example_id": <1 through 5>,
"code": "<string with numbered code lines>",
"findings": [
{
"line_number": <integer of line containing the secret>,
"secret": "<the exact secret value>",
"label": "True Positive"
},
...
]
}

No extra commentary, text, or markdown outside the `<json>...</json>` block.

---

### **2. Code Snippet Requirements**

The code must be a plausible, high-quality snapshot that imitates a real-world project. Generic, simplistic, or repetitive code is unacceptable.
- **Length & Numbering:** Each snippet must be **25-35 lines long**. Code lines must be **numbered** as `<line_number>: <code>`, starting from an arbitrary line number (e.g., `42:`, `115:`).

#### **2.1. Mandate for Contextual & Syntactic Diversity**
Each of the 5 generated examples must be a **distinct** and **unique** snapshot of a real-world project. The primary goal is to maximize diversity across the set, avoiding any repetition in the scenario, language, or overall structure.
The 5 generated examples **must be written in a distinct and unique** primary language or configuration format. In a single response, you are **strictly prohibited** from generating, similar looking more than two snapshot of the same format/programming language.

To ensure variety, you **must select 5 different options** from the languages and formats listed in Section 2.2 for each response.
- **Strict Uniqueness:** Each snippet **must** represent a unique development scenario and use a different primary language or configuration format. For example, generating two Python backend apps or two Terraform files in the same response is strictly prohibited.
- **Plausible Secret Pairing:** The type of hardcoded secret must logically match the code's context. For instance, an SSH key is plausible in a CI/CD pipeline, while a Stripe API key is plausible in a backend payment processor.

#### **2.2. Scenario & Language Variety**
To ensure diversity, select from a wide range of contexts and languages for each of the 5 snippets.

**A. Example Scenarios & Use Cases:**
- **Backend Services:** API endpoints, database initializers, authentication middleware, or background workers (e.g., Python/Flask, Go/Gin, Ruby/Rails, C#/ASP.NET, Java/Spring).
- **Frontend Components:** Configuration objects or service initializers inside UI code (e.g., TypeScript/React, JavaScript/Vue) that handle keys for services like Firebase, Mapbox, or Sentry.
- **Infrastructure as Code (IaC):** Resource definitions with hardcoded provider credentials or variables (e.g., Terraform/HCL, Pulumi/TypeScript, AWS CDK).
- **CI/CD Pipelines:** Build, test, and deployment steps with integrated secrets (e.g., YAML for GitHub Actions/GitLab CI, Groovy for Jenkinsfiles).
- **Configuration Files:** Standalone configuration for applications or services (e.g., YAML, JSON, `.env`, Java `.properties`, `.tfvars`).
- **Data & Utility Scripts:** Standalone scripts for automation, data processing, or sending notifications (e.g., Python with `boto3` or `smtplib`, PHP scripts, PowerShell).
- **Mobile App Configuration:** Build configurations or property lists containing API keys (e.g., `build.gradle` for Android, `Info.plist` or Swift configuration files for iOS).

**B. Example Languages & Formats:**
- **Languages:** Python, Go, TypeScript, JavaScript, C#, Java, Ruby, PHP, Swift, Kotlin.
- **Config Formats:** YAML, JSON, HCL (Terraform), `.env`, `.properties`, XML.

---

### **3. Secret Injection Rules **

The goal is to generate code snippets with hardcoded secrets that are **indistinguishable from real-world secrets** at a glance. They must be synthetically generated but adhere strictly to the format, character set, and apparent randomness of genuine credentials.
- **Secret Count:** Each snippet must contain **at least 1 and at most 4** hardcoded secrets. The exact number should vary randomly across the dataset (e.g., some snippets with 1, some with 2 or 3, and occasionally 4).
- **No Metadata:** Do not include any comments (`// fake key`), docstrings, or other markers that reveal the secrets are synthetic, for training, or are placeholders.

***

### **3.1 Mandate for Authentic Realism**

All secrets must be generated based on two core principles: **authentic structure** and **high-entropy payloads**.

#### **A. Authentic Structure**

Secrets must precisely replicate the real-world format for their type. This includes:
- **Prefixes:** Use the correct, well-known prefixes (e.g., `sk_live_` for Stripe, `AKIA` for AWS, `ghp_` for GitHub, `xoxb-` for Slack).
- **Character Set:** Use the appropriate character set (e.g., alphanumeric, Base64, hex).
- **Length:** Adhere to the standard length or length range for the specific secret type.
- **Formatting:** Complex secrets like database connection strings must use the correct URI format and include realistic (but synthetic) hostnames, usernames, and databases.

#### **B. High-Entropy Payloads**

The variable portion of the secret **must appear to be a cryptographically random string**. Generation must **strictly avoid** common anti-patterns that make secrets look fake.

**Prohibited Patterns (Do NOT use):**
- **Leet Speak:** `D3m0T0k3n`, `S3cr3t`
- **Dictionary Words:** `MyP@ssword`, `StagingKey`
- **Sequential Chars/Keyboard Walks:** `abcdefg`, `12345678`, `qwerty`
- **Simple, Repetitive Patterns:** `abababab`, `testtest`
- **Obvious Placeholders:** `AKIAYOURSECRETKEYHERE`, `ghp_XXXXXXXXXXXXXXXXXXXX`

Below are examples illustrating the required level of realism.

**GitHub Token:**
- Bad Example (Looks Fake): `ghp_D3m0L0ngPers0nalAcc3ssT0k3nAbCdEf123456`
- Good Example (Looks Real): `ghp_aV4gH9rT2pL7xJ5sK1mF3bZ8oN6cW0qYdE`

**AWS Access Key:**
- Bad Example (Looks Fake): `AKIA2QW3E4R5T6Y7U8I9`
- Good Example (Looks Real): `AKIAY3R4WZ76X2P5QJ6M`

**Stripe API Key:**
- Bad Example (Looks Fake): `sk_live_test_key_for_payments_12345`
- Good Example (Looks Real): `sk_live_51Kk0L2ApB8fG1tY9cRzXvWqSjU3mB`

**Postgres URI:**
- Bad Example (Looks Fake): `postgres://admin:password@localhost:5432/testdb`
- Good Example (Looks Real): `postgres://prod_user_rw:8!hG#kL$pQ2s@db.prod.internal:5432/main`

**JWT Token:**
- Bad Example (Looks Fake): `eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.test`
- Good Example (Looks Real): `eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c`

***

### **3.2 Diversity of Secret Types**

Across the entire dataset, the generated secrets must represent a wide variety of realistic secret categories. Snippets should combine different types where natural. The list of secret types includes, but is not limited to:

- **API Keys:** Cloud providers (AWS, GCP, Azure), payment processors (Stripe, Braintree), SaaS platforms (Twilio, SendGrid), and AI services (OpenAI, Anthropic).
- **Authentication Tokens:** OAuth 2.0 tokens, session tokens, bearer tokens, JWTs.
- **Database Connection Strings:** Postgres, MySQL, MongoDB, Redis, etc.
- **Cloud Storage Keys:** AWS S3 access keys, Azure Blob Storage keys, GCP Cloud Storage keys.
- **Credentials:** Username/password combinations (for services, not end-users).
- **Cryptographic Material:** Raw encryption keys (AES, RSA), initialization vectors (IVs), or salts.
- **SSH Keys & Certificates:** Private keys (RSA, ED25519) or PEM-encoded certificates.

---

### **4. Output & Generation Rules**
This section defines the strict structural and content requirements for the final output.

- **JSON Array Structure:** The final output **MUST** be a single, valid JSON array that contains exactly **5 unique JSON objects**. Each object represents one complete example.
- **Object Content:** Each object in the array must include three keys: `"example_id"` (numbered sequentially from 1 to 5), a `"code"` snippet, and a `"findings"` array.
- **Strict Uniqueness Mandate:** The 5 generated code snippets **MUST BE UNIQUE**. Do not repeat or slightly modify a previous example. This is a critical requirement, as the data will be used for model training.
- **No Extraneous Text:** There **MUST NOT** be any text, explanations, or formatting outside the main JSON array (i.e., no text before or after the `[` and `]` brackets of the array).
- **Self-Correction:** Before finalizing your response, you must verify that the JSON array contains exactly 5 objects. If it does not, you must regenerate the entire response to meet the requirement.

---
### **5. Content Integrity Rules**
These rules apply to the secrets and findings generated within each code snippet.

- **True Positives Only:** All generated secrets **MUST** be true positives. Do not generate examples of false positives, commented-out secrets, placeholders (e.g., `'YOUR_KEY_HERE'`), or other non-sensitive values.
- **Full-Length Secrets:** All secrets **MUST** be included in their entirety, without any truncation, ellipsis (`...`), or shortening. This rule applies to all secret types, including long JWTs, multi-line SSH private keys, or PEM certificates.

"""
Loading