Skip to content

Implement Code Block Failure Detection (Issue #5) and Extra CWE Detector (Issue #6)#289

Open
TheAuditorTool wants to merge 1 commit intoOWASP-Benchmark:mainfrom
TheAuditorTool:feature/codeblock-analysis-issues-5-6
Open

Implement Code Block Failure Detection (Issue #5) and Extra CWE Detector (Issue #6)#289
TheAuditorTool wants to merge 1 commit intoOWASP-Benchmark:mainfrom
TheAuditorTool:feature/codeblock-analysis-issues-5-6

Conversation

@TheAuditorTool
Copy link
Copy Markdown

PR: Implement Code Block Failure Detection (Issue #5) and Extra CWE Detector (Issue #6)

Summary

This PR completes the CalculateToolCodeBlocksSupport tool (Issue #5) and adds a new DetectExtraCWEs tool (Issue #6). Both are standalone Java tools in the tools/ package that analyze which code constructs cause security tools to fail on OWASP Benchmark test suites.

What Changed

1. CodeBlockSupportResults.java (modified)

Added fields for Issue #5's isolation analysis:

  • Set<String> fnTestCases -- tracks which FN test cases use this snippet
  • Set<String> isolatedFnCause -- FNs where THIS snippet is the single unsupported one (isolated root cause)
  • Set<String> isolatedFpCause -- FPs where THIS snippet is the single safe snippet the tool fails to recognize
  • toIsolationString() -- formatted output for isolated root cause reporting

These track actual test case names (not just counts), enabling bidirectional lookup: given a snippet, find all test cases it causes to fail; given a test case, find which snippet is the likely root cause.

2. CalculateToolCodeBlocksSupport.java (modified)

Three additions to Dave's existing implementation:

A. Single-unknown isolation (Issue #5 Pass 2)

After Pass 1 marks supported snippets from TPs, a new pass iterates each FN test case and counts how many of its snippets are NOT supported:

  • 0 unknown: All snippets are individually understood, but the tool still misses the vuln. This is a "combination failure" -- the tool can't handle the specific combination of source + dataflow + sink. These are reported separately.
  • 1 unknown: That single unsupported snippet is the likely root cause. Added to isolatedFnCause on the snippet. These are the most actionable findings.
  • 2+ unknown: Ambiguous -- can't isolate which snippet is the problem. Skipped.

B. FP isolation (Issue #5 Stage 2)

For each FP test case (tool flags a safe test case as vulnerable), identifies which snippet(s) make the test case safe using the truePositive metadata from the .xml files:

  • If exactly 1 snippet is the "safe" component and the tool still flags it, that snippet is reported as an isolated FP cause.
  • Sanity check: FP test cases where ALL snippets are individually "supported" from TP analysis are flagged. The tool understands each part but gets confused by the combination.

C. Scorecard directory support

The -r parameter now accepts either a single CSV file (original behavior) or a directory path. When given a directory, it discovers all *Scorecard_for_*.csv files and processes each tool sequentially. Each tool gets its own analysis section in the output.

3. DetectExtraCWEs.java (new)

Implements Issue #6. A standalone Java tool that detects CWE findings outside expected test cases.

How it works:

  1. Loads the expected results CSV to build a map of {test_number -> (name, category, cwe)}
  2. For each raw tool result file in the results directory, uses the existing Reader.allReaders() parsers (57 parsers covering FindBugs, PMD, ZAP, Semgrep, Checkmarx, Fortify, etc.)
  3. For each finding the tool reports, checks if the reported CWE matches the expected CWE for that test case

Two modes:

  • Normal: Reports findings where a known benchmark CWE (one of the 11 categories) is detected in a test case for a different CWE. Example: tool reports CWE-89 (SQLi) in BenchmarkTest00003 which is actually a hash (CWE-328) test case.
  • Hard: Also reports findings where a CWE not in the benchmark's expected set is detected. Example: tool reports CWE-200 (Information Disclosure) which is not one of the 11 tested categories.

Known limitation: Existing Reader parsers filter findings at the parser level -- most parsers only retain findings in BenchmarkTest* files and discard findings in helper classes (e.g., DatabaseHelper.java). Detecting extra CWEs in non-test-case infrastructure files would require extending individual parsers, which is a separate effort.

Usage

Issue #5: Code Block Analysis

# Single tool
java -cp target/classes:... org.owasp.benchmarkutils.tools.CalculateToolCodeBlocksSupport \
  -f data/benchmark-attack-http.xml \
  -r scorecard/Benchmark_v1.2_Scorecard_for_FBwFindSecBugs.csv

# All tools in a directory
java -cp target/classes:... org.owasp.benchmarkutils.tools.CalculateToolCodeBlocksSupport \
  -f data/benchmark-attack-http.xml \
  -r scorecard/

# Via Maven
mvn benchmarkutils:calculate-codeblock-support \
  -DcrawlerFile=data/benchmark-attack-http.xml \
  -DresultsCSVFile=scorecard/

Issue #6: Extra CWE Detector

java -cp target/classes:... org.owasp.benchmarkutils.tools.DetectExtraCWEs \
  -e expectedresults-1.2.csv \
  -r results/ \
  -m both

# Modes: normal, hard, both (default: both)

Output Examples

Issue #5 Output (per tool)

=== Tool: FBwFindSecBugs ===

Always FN Codeblock type: SINK (cmdi), name: RuntimeExec-S^-S[]-F.code, ...
...

--- Pass 2: Isolated FN Root Causes ---
  [SINK] RuntimeExec-S^-S[]-F.code (cmdi) -- 6 FNs isolated to this snippet
  [DATAFLOW] Reflection.code -- 15 FNs isolated to this snippet
  Combination failures (all snippets supported, tool still misses): 3

--- Sanity Check: FPs where all snippets are supported (2) ---
  Testcase #: 1234, Category: weakrand, ...

--- Stage 2: FP Root Causes (safe snippets tool doesn't recognize) ---
  [SINK] MessageDigestGetInstance-S-P2.code (hash, safe) -- 22 FPs isolated to this snippet
  [DATAFLOW] SafeQuestionmarkConditional.code (safe) -- 18 FPs isolated to this snippet

Issue #6 Output (per tool)

=== FBwFindSecBugs v1.4.6 ===
  NORMAL: Known CWEs in wrong test cases (5 findings):
    CWE-89 found in 3 non-matching test cases:
      BenchmarkTest00442 (expected CWE-328)
      BenchmarkTest00891 (expected CWE-79)
      BenchmarkTest01234 (expected CWE-22)
  HARD: Non-benchmark CWEs detected (2 findings):
    CWE-200 found in 2 test cases:
      BenchmarkTest00100 (expected CWE-89)

Design Decisions

  1. Kept Dave's existing analysis passes intact. The new Pass 2 and Stage 2 are additive -- they run after the original analysis and produce additional output sections.

  2. Reused existing Reader parsers for Issue Add extra CWEs found detector #6 rather than writing new XML parsers. This means all 57 tool formats are supported automatically. The tradeoff is that findings in non-test-case files are not captured (parsers filter them).

  3. Single-unknown isolation produces the most actionable results. When a tool misses a vulnerability and exactly one snippet in that test case is unsupported, that snippet is almost certainly the cause. This is the core insight from Dave's Issue Create Tool for Detecting which codeblocks are causing tools to fail #5 spec.

  4. Combination failures are reported separately. When all snippets are individually supported but the tool still fails, the specific combination is the problem -- not any individual snippet. These need manual investigation.

…extra CWE detector (OWASP-Benchmark#6)

Issue OWASP-Benchmark#5 - CalculateToolCodeBlocksSupport:
- Add single-unknown isolation (Pass 2): for each FN, if exactly 1 snippet
  is unsupported, isolate it as the root cause
- Add FP isolation (Stage 2): identify safe snippets that tools fail to
  recognize, with sanity check for FPs where all snippets are supported
- Add scorecard directory support: -r now accepts a directory to process
  all tool scorecards automatically
- Track actual test case names per snippet for bidirectional mapping

Issue OWASP-Benchmark#6 - DetectExtraCWEs (new):
- Detect CWE findings outside expected test cases using existing Reader parsers
- Normal mode: known CWEs in wrong test cases (e.g. CWE-89 in a hash test)
- Hard mode: any CWE not in the benchmark's expected set
@TheAuditorTool TheAuditorTool force-pushed the feature/codeblock-analysis-issues-5-6 branch from be7426e to f813694 Compare April 13, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant