Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "fastaguard"
version = "0.2.0"
version = "0.3.0"
edition = "2021"
license = "MIT"
description = "FASTA preflight QC for assembly pipelines"
Expand Down
29 changes: 23 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

FastaGuard is a fast, explainable FASTA QC tool for validating assembly FASTA files before expensive downstream analysis.

The assembly FASTA gate before expensive QC.

It is not intended to compete with QUAST, BUSCO, BlobToolKit, FastQC, or MultiQC. FastaGuard is the earlier preflight and triage layer: the first command that answers whether a FASTA file is valid, sane, interpretable, and ready for downstream tools.

```text
Expand Down Expand Up @@ -57,9 +59,15 @@ fastaguard sample.fa \
Pipeline gate example:

```bash
fastaguard sample.fa --fail-on duplicate_ids,invalid_chars,high_n_rate
fastaguard sample.fa --profile assembly --gate pipeline
```

The `pipeline` gate is the v0.3 assembly preset for workflow stop/go decisions.
It fails on duplicate IDs, invalid characters, invalid FASTA structure, and
high-N content. GC and length outliers remain advisory by default because they
are routing signals, not proof of contamination or misassembly. To make an
advisory finding block a pipeline, add it explicitly with `--fail-on`.

Inspect the machine-readable contract:

```bash
Expand All @@ -80,7 +88,8 @@ docker run --rm -v "$PWD:/data" fastaguard:local /data/sample.fa \
--multiqc /data/fastaguard_mqc.json
```

Use the generated BioContainers image in workflow engines:
Published BioContainers currently provides the v0.2 image, which does not
include v0.3 gate behavior yet:

```bash
docker pull quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0
Expand Down Expand Up @@ -116,6 +125,7 @@ FastaGuard is assembly-first.
```bash
fastaguard sample.fa \
--profile assembly \
--gate pipeline \
--out fastaguard_report.html \
--json fastaguard.json \
--tsv fastaguard.tsv \
Expand Down Expand Up @@ -147,6 +157,14 @@ v0.2 expands the assembly preflight layer with:
- richer provenance, taxonomy context, and routing hints
- hardened MultiQC and pipeline adoption material

v0.3 adds the assembly gate contract:

- `--gate pipeline` for default workflow blocking behavior
- `gate.blocking_findings` for machine stop/go decisions
- checksum provenance with `provenance.input_sha256`
- explicit advisory findings for evidence that should route follow-up QC rather
than stop a pipeline by default

## Positioning

FastaGuard should recommend deeper tools when they are appropriate:
Expand Down Expand Up @@ -189,7 +207,6 @@ serves v0.2.0 for `linux-64`, `linux-aarch64`, `osx-64`, and `osx-arm64`.
BioContainers also publishes the pinned workflow image
`quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0`.

The next internal milestone is the
[v0.2 evidence pack](docs/evidence/fastaguard-v0.2-evidence.md): reproducible
local and public FASTA runs that document runtime, verdicts, and top findings
before new biological profiles are added.
The current development milestone is v0.3: evidence, checksum provenance, and
the assembly gate contract. Published Bioconda and BioContainers packages remain
v0.2.0 until a v0.3 release is cut.
28 changes: 21 additions & 7 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ python3 scripts/benchmark_large_fasta.py \

This should finish quickly and produce `fastaguard.json`, `fastaguard.tsv`, `fastaguard_report.html`, and `fastaguard_mqc.json` in `target/bench-smoke/`.

For the v0.3 assembly gate contract, add the pipeline gate preset:

```bash
fastaguard sample.fa --profile assembly --gate pipeline
```

## Larger Local Benchmark

Build an optimized binary:
Expand Down Expand Up @@ -84,17 +90,21 @@ Do not use it to claim performance on contaminated assemblies, highly ambiguous
## v0.2 Evidence Targets

FastaGuard should prove four preflight categories with small reproducible
fixtures:
fixtures. For v0.3, the same evidence should also show whether each category
blocks the pipeline gate:

| Evidence case | What FastaGuard catches | Why it should run before heavier tools |
| --- | --- | --- |
| duplicate IDs | repeated FASTA identifiers | helps prevent workflow joins, indexes, and annotations from becoming ambiguous |
| invalid characters | non-IUPAC sequence symbols | flags inputs that may trigger downstream parser and aligner failures |
| high-N | ambiguous scaffolds and gap-heavy records | flags low-confidence mapping and annotation inputs before they are treated as clean |
| GC outliers | composition-anomalous records | supports routing suspicious records to BlobToolKit, sourmash, Kraken, or other follow-up tools |
| Evidence case | Gate behavior | What FastaGuard catches | Why it should run before heavier tools |
| --- | --- | --- | --- |
| duplicate IDs | blocking | repeated FASTA identifiers | helps prevent workflow joins, indexes, and annotations from becoming ambiguous |
| invalid characters | blocking | non-IUPAC sequence symbols | flags inputs that may trigger downstream parser and aligner failures |
| high-N | blocking | ambiguous scaffolds and gap-heavy records | flags low-confidence mapping and annotation inputs before they are treated as clean |
| GC outliers | advisory by default | composition-anomalous records | supports routing suspicious records to BlobToolKit, sourmash, Kraken, or other follow-up tools |

FastaGuard should not replace QUAST, BUSCO, or BlobToolKit. It should make their
inputs safer and make obvious FASTA-level problems visible before those tools run.
For automated workflows, record `gate.blocking_findings` and
`provenance.input_sha256` alongside runtime and verdict so the gate decision can
be audited against exact input bytes.

## Evidence To Collect Next

Expand All @@ -115,13 +125,17 @@ For each run, record:
- peak memory if measured externally
- verdict and top findings
- whether downstream tools would have been blocked or recommended
- gate status and `gate.blocking_findings` when run with `--gate pipeline`
- `provenance.input_sha256`

This evidence matters more than synthetic speed alone because it shows the wedge: cheap FASTA preflight before expensive downstream QC.

## Evidence Pack Workflow

The v0.2 evidence workflow is documented in
`docs/evidence/fastaguard-v0.2-evidence.md`.
The published evidence document remains v0.2-focused; v0.3 gate evidence should
extend that workflow after the gate contract is released.

CI-safe local run:

Expand Down
105 changes: 105 additions & 0 deletions docs/evidence/fastaguard-v0.3-evidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# FastaGuard v0.3 Evidence Pack

This page records the evidence workflow for FastaGuard v0.3. The purpose is to
make the assembly gate inspectable before expanding into broader biological
profiles.

FastaGuard is FASTA preflight QC. It is not biological completeness analysis,
not assembly correctness analysis, and not contamination confirmation. Passing
the v0.3 gate means the FASTA-level contract is sane enough to continue into
downstream tools such as QUAST, BUSCO, BlobToolKit, CheckM, seqkit, or
annotation.

## Local Evidence Run

Build the release binary:

```bash
cargo build --release --locked
```

Run the CI-safe local evidence path:

```bash
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/v0.3-local \
--local-only
```

Local-only mode does not require network access or the NCBI Datasets CLI. It
runs:

- a deterministic synthetic FASTA
- `testdata/problem_assembly.fa`
- a gzipped copy of `testdata/valid_assembly.fa`

The evidence command runs FastaGuard with `--profile assembly --gate pipeline`
and keeps `--min-contig-length 1` so tiny local fixtures remain useful for
contract testing.

## Public NCBI Evidence Run

Install the NCBI Datasets CLI, then run:

```bash
cargo build --release --locked
python3 scripts/collect_evidence.py \
--binary target/release/fastaguard \
--out-dir target/evidence/v0.3
```

The public workflow downloads genomic FASTA packages with commands shaped like:

```bash
datasets download genome accession GCF_000005845.2 --include genome --filename target/evidence/v0.3/ecoli_k12_mg1655/ncbi_dataset.zip
```

If `datasets` is not installed, use `--local-only` for offline smoke tests. The
default public manifest is:

```text
docs/evidence/public_assemblies.json
```

It currently includes:

- `GCF_000005845.2`: Escherichia coli K-12 MG1655
- `GCF_000182925.2`: Neurospora crassa OR74A

## Outputs

Each case writes FastaGuard artifacts under the selected output directory, for
example `target/evidence/v0.3-local/<case>/` or `target/evidence/v0.3/<case>/`:

- `fastaguard.json`
- `fastaguard.tsv`
- `fastaguard_report.html`
- `fastaguard_mqc.json`

The workflow also writes compact summaries:

- `evidence_summary.json`
- `evidence_summary.tsv`

The summaries include verdict, gate status, blocking findings, top findings,
runtime, input size, sequence counts, and `input_sha256`. Commit compact
summaries when useful. Do not commit downloaded FASTA files, NCBI zip archives,
or full generated per-case reports.

## Interpretation

The v0.3 gate means the FASTA-level contract is sane enough to continue through
the pipeline. It checks validity, duplicate identifiers, invalid characters,
composition red flags, gap signals, and related FASTA-level evidence.

The gate does not prove biological completeness, does not prove assembly
correctness, and does not rule out contamination. Use QUAST, BUSCO,
BlobToolKit, CheckM, sourmash, Kraken, or other tools for deeper biological
interpretation after FastaGuard has checked the FASTA-level contract.

## References

- [NCBI Datasets genome download reference](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/download/genome/)
- [NCBI Datasets genome download guide](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/download-genome/)
- [Neurospora crassa OR74A BioProject](https://www.ncbi.nlm.nih.gov/bioproject/132)
Loading