Skip to content

Add spec-quality-first philosophy to autonomous workflow#43

Merged
TechNickAI merged 5 commits intomainfrom
spec-quality-autonomous-workflow
Feb 19, 2026
Merged

Add spec-quality-first philosophy to autonomous workflow#43
TechNickAI merged 5 commits intomainfrom
spec-quality-autonomous-workflow

Conversation

@TechNickAI
Copy link
Owner

@TechNickAI TechNickAI commented Feb 19, 2026

Summary

Applies the "spec quality as bottleneck" insight from Dan Shapiro's 5 Levels of AI Coding framework to our autonomous workflow tooling. The core idea: AI spending more time understanding the problem than implementing produces better outcomes than iterating on an ambiguous spec.

  • autonomous-development-workflow.mdc v2.0 → v3.0: Added "Spec Quality Is the Bottleneck" section, replaced "Self-Review" with "Evaluate Outcomes" (shifted from code approval framing to problem-solving framing), fixed /load-rules portability, made fallback path tool-agnostic
  • autotask.md v2.1 → v2.3: Added <task-preparation> section with complexity-scaled spec gates (quick/balanced/deep), added outcome check to <completion-verification>, applied prompt engineering review fixes (motivation-based language, real agent names, normalized terminology, removed redundant sections)
  • Marketplace v9.16.0 → v9.17.0: Bumped all three version locations

Design Decisions

Spec quality gate scales with complexity — demanding a full spec for a typo fix is a failure mode. Quick tasks with no behavioral impact proceed directly; balanced tasks verify expected behavior; deep tasks write a formal spec. Quick gate has explicit fallthrough: behavioral impact → treat as balanced.

Outcome-centric not implementation-centric — "does this changeset solve the problem cleanly?" vs "would we approve this in code review." The git diff check leads the Evaluate Outcomes section as the concrete behavioral anchor.

Principles over decision trees in error-recovery — the previous version over-prescribed specific failure sequences. The current version preserves the critical safety constraints (--no-verify prohibition explicit) and fallback categories (merge conflicts, auth failures, CI issues) while removing per-scenario procedural steps.

Removed Requirements/Configuration/Key Principles sections — these are agent-facing operational docs. Agents running /autotask read the workflow sections, not marketing bullets. The removed content duplicated what the workflow sections already express.

Accessibility removed from ux-designer agent description — project policy per CLAUDE.md is to decline accessibility suggestions. Listing it would cause agents to select ux-designer for accessibility reviews we don't want.

Complexity Level

balanced — multi-file rule changes with design decisions about how to express the spec-quality concept across two different document types (command vs. rule)

Validation Performed

  • validate CI check passing
  • Multi-review with prompt-engineer, architecture-auditor, style-reviewer agents pre-submission
  • Three rounds of claude[bot] review feedback incorporated

Inspired by the "5 Levels of AI Coding" framework, shift both the autonomous
development rule and autotask command from implementation-centric to
outcome-centric thinking. The key insight: spec quality is the bottleneck in
autonomous development, not implementation speed.

Changes:
- autonomous-development-workflow.mdc v3.0: Add spec quality framing, outcome
  evaluation, thin duplicated content to let autotask own operational detail
- autotask.md v2.2: Add spec quality gate in task-preparation scaled by
  complexity, outcome evaluation in completion-verification

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@claude
Copy link

claude bot commented Feb 19, 2026

Code Review

This PR makes a meaningful philosophical shift — from implementation-centric to outcome-centric autonomous workflow — and the execution is solid. Here's my assessment:


Strengths

"Spec Quality Is the Bottleneck" framing is well-done. It's behavioral, not motivational. Telling an LLM "a precise understanding of the problem produces better code than any amount of iteration on an ambiguous one" gives it a concrete mental model to act on — not just encouragement. This is the right kind of content for LLM-directed rules.

Complexity-scaled gates (quick/balanced/deep) are practical. Scaling spec evaluation to task complexity avoids the failure mode of demanding a full spec for a typo fix. The concrete definition for "quick" ("single file, no behavioral impact") is a real improvement over the old "self-evident."

"Evaluate Outcomes" is a genuine semantic upgrade over "Self-Review." The shift from "would we approve this in code review" to "does this changeset solve the problem cleanly" changes what an LLM optimizes for. Good.

Emoji removal is consistent with project conventions (only use when explicitly requested).

/load-rules portability fix is the right call — rules/ is a repo-local symlink, not a general pattern.

Consolidating repeated outcome evaluation (was in 3 places, now 1) is good DRY hygiene.


Issues

1. Escape hatch in the workflow rule may undermine the intent

In autonomous-development-workflow.mdc:

Load project standards via `/load-rules` or read rules in `.cursor/rules/`.

The or gives LLMs an easy off-ramp to skip /load-rules and just read rules directly. If /load-rules is the preferred path, make it prescriptive — drop the alternative or invert the phrasing: "Read rules via /load-rules; if not available, fall back to .cursor/rules/."

2. Trailing redundancy in <task-preparation>

The final two sentences of the new <task-preparation> section are now partially redundant:

If the task description is unclear or ambiguous, use AskUserQuestion to clarify
requirements. If clear, proceed to planning or implementation based on complexity level.

The "If unclear, ask" instruction is now covered by the detailed spec evaluation bullets above. The "if clear, proceed" part still earns its place, but consider trimming to just: "If ambiguity remains after evaluation, use AskUserQuestion before proceeding."

3. "Single file" quick gate may over-block trivial multi-file changes

Some genuinely trivial changes span files — renaming a constant used in two places, bumping a version number across package.json and a changelog. The intent is right (prevent premature implementation on behavioral changes), but "single file" is a proxy that will occasionally fire on non-behavioral edits. Something like "no behavioral impact, localized change" might be more durable than the file count.

4. "Document any assumptions" needs a destination

In autonomous-development-workflow.mdc:

Document any assumptions or choices the reviewer needs to understand.

Document where? PR description, code comments, inline markdown? Without a target, this instruction is vague enough that an LLM might satisfy it in unhelpful ways (a comment inside a function, for example). Adding "in the PR description" would make it actionable.

5. Version bump scale

2.0 → 3.0 for the workflow rule feels like major version inflation for what is refinement + one new section. Not blocking, but worth considering whether the project version semantics are intended to signal breaking changes (3.0 implies the interface is incompatible with 2.0).


Minor

  • The -- em-dash convention used in autotask.md ("not just passes tests -- this final check") is inconsistent with the --- separator style used elsewhere in the file. Small but noticeable.
  • The test plan items are all unchecked — expected for a workflow/rule change, but noting for completeness.

Overall

The core ideas in this PR are sound and the multi-review process clearly caught real issues before submission. The "Spec Quality Is the Bottleneck" addition + "Evaluate Outcomes" reframe are both net improvements. The issues above are mostly precision/clarity gaps, not conceptual problems. Addressing items 1 and 2 would tighten the PR meaningfully; items 3-5 are worth considering but not blocking.

🤖 Reviewed with Claude Code

- Fix /load-rules as prescriptive path with explicit fallback
- Document assumptions target clarified to PR description
- Quick gate: behavioral impact + localized (not file count)
- Trim redundant trailing sentences from task-preparation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TechNickAI
Copy link
Owner Author

Fixed (4):

  • Issue 1: /load-rules is now prescriptive with explicit fallback (If not available, fall back to .cursor/rules/)
  • Issue 2: Redundant trailing sentences in <task-preparation> trimmed to single line
  • Issue 3: Quick gate rewritten from file-count proxy to behavioral impact + locality
  • Issue 4: "Document assumptions" now targets "PR description" explicitly

WONTFIX (2):

  • Issue 5 (version bump scale): 3.0 is intentional — this is a philosophical reframe of what autonomous development means, not a refinement. The interface (what users pass in) is unchanged but the agent's orientation is fundamentally different. Treating that as minor felt wrong.
  • Minor em-dash: -- is the correct prose convention throughout this file; --- only appears in YAML frontmatter. No inconsistency.

Fixes landed in 7964929.

@claude
Copy link

claude bot commented Feb 19, 2026

Code Review

Overall this is a solid improvement. The shift from implementation-centric to outcome-centric framing is well-executed, and the complexity-scaled spec gates are cleanly integrated into the existing quick/balanced/deep structure. The multi-review process the PR went through shows.

Issues Worth Addressing

1. Terminology imprecision in autotask.md:122

"Ambiguous specs produce software that fills gaps with AI guesses instead of customer-centric decisions."

"Customer-centric" assumes a commercial product context. This codebase is also used for internal tools, libraries, and scripts where "customer" is a stretch. "User-centric" or "intentional" would be more accurate across use cases.


2. Quick gate definition has an internal contradiction (autotask.md:134-135)

"If the task has no behavioral impact and is a localized change (typo fixes, comment updates, config tweaks)"

Config tweaks frequently do have behavioral impact (timeouts, feature flags, thresholds). The condition should either be:

  • no behavioral impact AND localized change — but then config tweaks shouldn't be listed as examples
  • OR just localized single-file change with clear requirements — consistent with the existing quick complexity level definition at line 46-47

3. Tension between the Cursor rule's absolute framing and autotask's nuanced gates

autonomous-development-workflow.mdc says:

"If the task description is ambiguous, clarify before spending tokens on implementation."

autotask.md correctly allows quick tasks to skip deep evaluation. When both are in context simultaneously, an agent following the Cursor rule literally would pause on quick tasks that autotask explicitly says should proceed. The rule could add "— see autotask for complexity-scaled evaluation" or soften from "clarify before spending tokens" to something that acknowledges the nuanced approach.


4. "Evaluate Outcomes" lost some behavioral specificity

The old Self-Review section had a concrete behavioral anchor:

"Run git diff and review every change, as a senior developer would."

The replacement retains git diff but buries it in more philosophical framing. The question "does this changeset solve the problem cleanly?" is good, but the section could lose one of the softer sentences to preserve the concrete action as the lead instruction.


Minor / Observations

  • <context-preservation> closing tag has no newline before it (line 118 in autotask.md) — pre-existing issue but easy fix while in the file
  • The -- em-dash substitution is consistent with existing conventions, not a concern
  • Version bump to 3.0.0 on the Cursor rule is justified given the conceptual reframe
  • Removing the emoji from the Success section is the right call

The spec quality gate addition in <completion-verification> ("The implementation solves the stated problem — not just passes tests") is the strongest addition. That framing should stick.

Autotask v2.3.0: Fix motivation-based language in bot-feedback-loop,
simplify error-recovery to boundaries, replace ghost code-reviewer
agent with real names, normalize sub-agents→agents terminology,
remove redundant Key Principles/Configuration/Notes sections.

Marketplace v9.17.0: Bump all three version locations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 19, 2026

Code Review

Overall, this is a well-considered change that makes both documents more principled and outcome-focused. The "spec quality as bottleneck" framing is genuinely useful and the task-preparation expansion adds concrete criteria where the old version had vague process steps. A few things worth addressing:


Issues

Version inconsistency in autotask.md
The PR description says v2.1 → v2.2, but the file diff shows version: 2.1.0version: 2.3.0. One of these is wrong. Either update the PR description or correct the version bump.

Error recovery information loss
The old <error-recovery> section gave agents specific handling strategies per failure type:

  • Git conflicts → pause for user resolution
  • Push rejected → pull and rebase if safe
  • Auth issues → run gh auth status
  • Rate limits → log and suggest waiting
  • Sub-agent failures → retry once with simplified scope

The new version replaces all of this with three sentences. For an autonomous agent, "retry once before escalating" is not the same as knowing when to retry vs. when to stop. The condensation aligns with the document's new principle-based style, but at the cost of actionable guidance that agents need. Consider whether this belongs in a referenced document or if the specific cases should be preserved inline.

Removed Requirements/Configuration sections
The deleted sections listed concrete dependencies (gh CLI, Node.js, npm) and auto-detection capabilities (detects hooks, test runners, lint configs). This is user-facing context that helps people know what's needed before running /autotask. If this info is moving elsewhere, the PR description should say where; if it's just being removed, that's a regression for first-time users.


Suggestions

autonomous-development-workflow.mdc — "Before Implementation" section
The old version said "Read all cursor rules in rules/." The new version says:

Load project standards via /load-rules. If that's not available, fall back to reading rules in .cursor/rules/ directly.

This is a good portability improvement. One nit: the fallback path (.cursor/rules/) is Cursor-specific, which is somewhat ironic given the motivation was cross-tool portability. Could say "fall back to reading applicable rules directly" without hardcoding the path.

autotask.md — spec quality gate wording
In <task-preparation>, the quick gate reads:

If the task has no behavioral impact and is a localized change (typo fixes, comment updates, config tweaks), proceed without deep evaluation.

This is good, but "config tweaks" could be ambiguous — a config change can absolutely have behavioral impact. Tightening to "single-file changes with no behavioral impact" (as noted in the PR description as the intent) would be more precise.

<completion-verification> — outcome check wording

If a user encounters this feature tomorrow, will it make sense?

This is the right instinct. The framing could be slightly tighter: "will it do what was asked" is more directly evaluable than "will it make sense," which could mean many things.


Positives

  • The "Spec Quality Is the Bottleneck" section is substantive, not motivational — good call distinguishing these in the PR description.
  • Complexity-scaled spec evaluation (quick/balanced/deep) maps cleanly to the existing complexity model. No new concepts introduced.
  • "sub-agents" → "agents" is a good consistency cleanup.
  • Adding architecture-auditor and security-reviewer as named agents instead of vague roles (code-reviewer) improves precision.
  • Removing the emoji from the Success section and elsewhere is consistent with the project's professional tone.
  • The bot-feedback-loop section now explains why it matters ("catches issues the author missed") rather than just asserting it's mandatory. More persuasive.

Summary

The philosophical direction is right and the execution is mostly clean. The version number discrepancy should be fixed before merge. The error-recovery and removed sections deserve a deliberate decision: either document where that info lives now, or accept that the loss is intentional and note it in the PR description.

- customer-centric → user-centric (broader than commercial context)
- Remove config tweaks from quick gate examples (config changes can have behavioral impact)
- Soften ambiguity gate to acknowledge quick path explicitly
- Lead Evaluate Outcomes with git diff (concrete action before philosophy)
- Fix missing newline before </context-preservation>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 19, 2026

Code Review

Overall, this is a well-reasoned PR. The shift from process-centric to outcome-centric thinking is the right direction, and the multi-review process clearly improved the content before submission. Here are my observations:

Strengths

  • "Spec Quality Is the Bottleneck" is strong framing. It changes the mental model from "follow a checklist" to "understand the problem first" — that's a meaningful behavioral difference.
  • Outcome-centric completion check ("does this solve the stated problem, not just pass tests") is valuable. The reflection prompt before reporting completion is particularly good.
  • Spec quality gate scaled by complexity is well-structured. The concrete criteria (problem clarity, acceptance criteria, edge cases, assumptions) give the AI something actionable to evaluate, not just "think about it."
  • agent rename (from sub-agents) is a clean simplification with no information loss.
  • Specific agent types in <implementation> (architecture-auditor, security-reviewer instead of generic code-reviewer) are more actionable.

Issues to Consider

1. Version number inconsistency in PR description
The PR summary says "autotask.md v2.1 → v2.2" but the actual diff shows v2.1.0 → v2.3.0. Minor, but the PR metadata is inaccurate.

2. <error-recovery> may be over-simplified
The old version had scenario-specific guidance that was genuinely useful:

  • "Merge conflicts → pause for user resolution"
  • "Hook failures → fix the issue, never use --no-verify"
  • "Push rejected → pull and rebase if safe, ask if not"

The new version distills this to two general principles. The --no-verify prohibition in particular was explicit and important — an AI under pressure to complete a task might not infer this from "Never bypass safety checks." Consider preserving at least the most critical failure-specific guidance.

3. Spec quality gate for quick tasks with behavioral impact
The quick gate says "no behavioral impact and localized change" → proceed. But the quick complexity level definition includes "single function" changes, which can have behavioral impact. There's a gap: what should an AI do when a task is quick scope but does have behavioral impact? The old code just fell through to AskUserQuestion; now there's no clear path for that case.

4. Removed sections lose useful context
The dropped "Requirements" (gh CLI, Node.js), "Configuration" (detect hooks, test runners), and "Key Principles" sections had operational value. An AI running autotask in a fresh environment no longer gets reminded to verify gh auth status before attempting PR creation. The error-recovery section previously covered auth failures explicitly ("Auth issues → run gh auth status, inform user"); now that's gone too.

5. Accessibility dropped from ux-designer bullet
- ux-designer: User-facing text, accessibility, UX consistency became - ux-designer: User-facing text, UX consistency. Removing "accessibility" from the stated scope may cause it to be overlooked when selecting agents for accessibility-relevant changes.

Neutral Observations

  • The Cursor rule's fallback ("If /load-rules not available, fall back to reading rules in .cursor/rules/ directly") is self-referential since the rule lives in .cursor/rules/ — but this is intentional and fine in practice.
  • The test plan items are all unchecked. These would be worth verifying before merge, especially the "no contradictions when both files are in context simultaneously" check.

Summary

The core idea is sound and the execution is mostly good. The spec quality gate and outcome-centric framing are genuine improvements. The main things worth addressing before merge are the <error-recovery> simplification (which loses concrete, safety-critical guidance) and the quick gate gap for tasks with behavioral impact.

- Add explicit fallthrough for quick gate with behavioral impact: "Otherwise, treat as balanced"
- Remove hardcoded .cursor/rules/ path from fallback (tool-agnostic now)
- Update PR description to correct version number (v2.1→v2.3 not v2.1→v2.2) and document design decisions

WONTFIX: error-recovery oversimplification (--no-verify still explicit, other condensation intentional per prompt engineering review), removed requirements/config sections (agent-facing not user-facing), accessibility removed from ux-designer (project policy), completion check wording (both angles covered)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TechNickAI
Copy link
Owner Author

👎 WONTFIX on error-recovery, requirements sections, and accessibility:

Error-recovery: prohibition is still explicit in the current text ("Never use --no-verify. Never silently swallow errors."). The per-scenario guidance (push rejected → pull and rebase, sub-agent failures → retry with simplified scope) was intentionally condensed to principles per a prompt engineering review that found the old version over-prescribed specific failure sequences. The merge conflicts/auth failures/CI issues categories are preserved.

Removed requirements/configuration sections: These sections were agent-facing operational docs, not user-facing setup guides. Agents running read the workflow sections to understand the goal; the removed content duplicated what the workflow sections already express. First-time users learn from project setup, not from a command's internal config detection list.

Accessibility removed from ux-designer: Project policy per CLAUDE.md is to decline accessibility suggestions. Listing it as a ux-designer scope would cause agents to select that agent for reviews we don't want.

Fixed in commit 79f837e: quick gate fallthrough ("Otherwise, treat as balanced"), hardcoded path removed, PR description version corrected.

@TechNickAI TechNickAI merged commit c45f3da into main Feb 19, 2026
2 checks passed
@TechNickAI TechNickAI deleted the spec-quality-autonomous-workflow branch February 19, 2026 05:36
@claude
Copy link

claude bot commented Feb 19, 2026

Code Review

Overall this is a solid improvement. The philosophical shift from implementation-centric to outcome-centric framing is well-executed, and the spec quality gate design is thoughtful. A few observations:


Strengths

Spec quality gate design — The quick/balanced/deep gates with explicit fallthrough ("Otherwise, treat as balanced") are well-reasoned. The most common failure mode in spec gates is treating too many things as "quick"; the explicit fallthrough guards against that.

Outcome-centric self-review — "Does this changeset solve the problem cleanly?" is a better anchor than "would we approve this in code review." The distinction matters: the former is user-value-oriented, the latter is approval-oriented. Good framing shift.

Agent name normalization — Splitting code-reviewerarchitecture-auditor + security-reviewer aligns the doc with real agent names. Same with sub-agentsagents. Both are correctness fixes.

Removing emoji from operational prose — The removed and 🎉 were noise in agent-facing docs. Right call.


Issues to Consider

Version jump in autotask.md: v2.1.0 → v2.3.0 skips v2.2.0. If this reflects intermediate work, fine — but if it's a typo, the version should be v2.2.0. Minor, but version numbers are how consumers track what changed.

<error-recovery> condensation trade-off: The original gave per-scenario guidance (git failures, GitHub CLI auth, rate limits, PR creation failures, sub-agent failures). The new version gives principles only:

Recover from failures without bypassing safety checks. Never use --no-verify. Never
silently swallow errors. Retry once before escalating.

The PR argues "principles over decision trees" — fair, but the original specificity had value for novel failure scenarios agents haven't seen. "Retry once before escalating" is underspecified: retry what? The command, the whole phase, or just the tool call? The original's pattern of "retry once with simplified scope" for sub-agent failures was actionable. Consider: does "retry once" apply to all failure types uniformly, or does that need a word or two of guidance?

Accessibility justification: The PR description states "project policy per CLAUDE.md is to decline accessibility suggestions" as justification for removing accessibility from ux-designer. That policy isn't visible in the CLAUDE.md files loaded in this session. Either the policy lives elsewhere (fine), or the justification is forward-referencing an unwritten policy. Worth verifying the referenced policy actually exists, or noting where it lives.

<task-preparation> and <planning> overlap: Both sections use the quick/balanced/deep scaffolding. <task-preparation> is about understanding the problem; <planning> is about approach. The distinction is real, but a reader new to the doc might not see why both sections ask them to assess complexity level. A one-line bridge between the two ("After completing spec evaluation, scale planning as follows") would clarify the sequence.


Removed Sections

The removed Requirements/Configuration/Key Principles/Notes sections: the justification holds for agent-facing docs. The Configuration section (auto-detection of hooks, test runners, linting) is useful context, but agents can observe this directly at runtime. No objection to the removal.


Minor

  • The <bot-feedback-loop> now has motivation framing ("Bot feedback catches issues the author missed...") before the mechanics. That's better than the previous "This phase is MANDATORY" imperative opener — motivation lands better than directives in agent prompts.
  • Version bumped consistently in all three locations (marketplace.json, plugin.json, autotask.md). No gaps.

Verdict: The design decisions are sound. The error-recovery condensation and the version jump are the two items worth a second look before merge. Neither is blocking if intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant