Add spec-quality-first philosophy to autonomous workflow#43
Conversation
Inspired by the "5 Levels of AI Coding" framework, shift both the autonomous development rule and autotask command from implementation-centric to outcome-centric thinking. The key insight: spec quality is the bottleneck in autonomous development, not implementation speed. Changes: - autonomous-development-workflow.mdc v3.0: Add spec quality framing, outcome evaluation, thin duplicated content to let autotask own operational detail - autotask.md v2.2: Add spec quality gate in task-preparation scaled by complexity, outcome evaluation in completion-verification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Code ReviewThis PR makes a meaningful philosophical shift — from implementation-centric to outcome-centric autonomous workflow — and the execution is solid. Here's my assessment: Strengths"Spec Quality Is the Bottleneck" framing is well-done. It's behavioral, not motivational. Telling an LLM "a precise understanding of the problem produces better code than any amount of iteration on an ambiguous one" gives it a concrete mental model to act on — not just encouragement. This is the right kind of content for LLM-directed rules. Complexity-scaled gates (quick/balanced/deep) are practical. Scaling spec evaluation to task complexity avoids the failure mode of demanding a full spec for a typo fix. The concrete definition for "quick" ("single file, no behavioral impact") is a real improvement over the old "self-evident." "Evaluate Outcomes" is a genuine semantic upgrade over "Self-Review." The shift from "would we approve this in code review" to "does this changeset solve the problem cleanly" changes what an LLM optimizes for. Good. Emoji removal is consistent with project conventions (only use when explicitly requested).
Consolidating repeated outcome evaluation (was in 3 places, now 1) is good DRY hygiene. Issues1. Escape hatch in the workflow rule may undermine the intent In The 2. Trailing redundancy in The final two sentences of the new The "If unclear, ask" instruction is now covered by the detailed spec evaluation bullets above. The "if clear, proceed" part still earns its place, but consider trimming to just: "If ambiguity remains after evaluation, use AskUserQuestion before proceeding." 3. "Single file" quick gate may over-block trivial multi-file changes Some genuinely trivial changes span files — renaming a constant used in two places, bumping a version number across 4. "Document any assumptions" needs a destination In Document where? PR description, code comments, inline markdown? Without a target, this instruction is vague enough that an LLM might satisfy it in unhelpful ways (a comment inside a function, for example). Adding "in the PR description" would make it actionable. 5. Version bump scale 2.0 → 3.0 for the workflow rule feels like major version inflation for what is refinement + one new section. Not blocking, but worth considering whether the project version semantics are intended to signal breaking changes (3.0 implies the interface is incompatible with 2.0). Minor
OverallThe core ideas in this PR are sound and the multi-review process clearly caught real issues before submission. The "Spec Quality Is the Bottleneck" addition + "Evaluate Outcomes" reframe are both net improvements. The issues above are mostly precision/clarity gaps, not conceptual problems. Addressing items 1 and 2 would tighten the PR meaningfully; items 3-5 are worth considering but not blocking. 🤖 Reviewed with Claude Code |
- Fix /load-rules as prescriptive path with explicit fallback - Document assumptions target clarified to PR description - Quick gate: behavioral impact + localized (not file count) - Trim redundant trailing sentences from task-preparation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Fixed (4):
WONTFIX (2):
Fixes landed in 7964929. |
Code ReviewOverall this is a solid improvement. The shift from implementation-centric to outcome-centric framing is well-executed, and the complexity-scaled spec gates are cleanly integrated into the existing quick/balanced/deep structure. The multi-review process the PR went through shows. Issues Worth Addressing1. Terminology imprecision in
"Customer-centric" assumes a commercial product context. This codebase is also used for internal tools, libraries, and scripts where "customer" is a stretch. "User-centric" or "intentional" would be more accurate across use cases. 2. Quick gate definition has an internal contradiction (
Config tweaks frequently do have behavioral impact (timeouts, feature flags, thresholds). The condition should either be:
3. Tension between the Cursor rule's absolute framing and autotask's nuanced gates
4. "Evaluate Outcomes" lost some behavioral specificity The old Self-Review section had a concrete behavioral anchor:
The replacement retains Minor / Observations
The spec quality gate addition in |
Autotask v2.3.0: Fix motivation-based language in bot-feedback-loop, simplify error-recovery to boundaries, replace ghost code-reviewer agent with real names, normalize sub-agents→agents terminology, remove redundant Key Principles/Configuration/Notes sections. Marketplace v9.17.0: Bump all three version locations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code ReviewOverall, this is a well-considered change that makes both documents more principled and outcome-focused. The "spec quality as bottleneck" framing is genuinely useful and the task-preparation expansion adds concrete criteria where the old version had vague process steps. A few things worth addressing: IssuesVersion inconsistency in Error recovery information loss
The new version replaces all of this with three sentences. For an autonomous agent, "retry once before escalating" is not the same as knowing when to retry vs. when to stop. The condensation aligns with the document's new principle-based style, but at the cost of actionable guidance that agents need. Consider whether this belongs in a referenced document or if the specific cases should be preserved inline. Removed Requirements/Configuration sections Suggestions
This is a good portability improvement. One nit: the fallback path (
This is good, but "config tweaks" could be ambiguous — a config change can absolutely have behavioral impact. Tightening to "single-file changes with no behavioral impact" (as noted in the PR description as the intent) would be more precise.
This is the right instinct. The framing could be slightly tighter: "will it do what was asked" is more directly evaluable than "will it make sense," which could mean many things. Positives
SummaryThe philosophical direction is right and the execution is mostly clean. The version number discrepancy should be fixed before merge. The error-recovery and removed sections deserve a deliberate decision: either document where that info lives now, or accept that the loss is intentional and note it in the PR description. |
- customer-centric → user-centric (broader than commercial context) - Remove config tweaks from quick gate examples (config changes can have behavioral impact) - Soften ambiguity gate to acknowledge quick path explicitly - Lead Evaluate Outcomes with git diff (concrete action before philosophy) - Fix missing newline before </context-preservation> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code ReviewOverall, this is a well-reasoned PR. The shift from process-centric to outcome-centric thinking is the right direction, and the multi-review process clearly improved the content before submission. Here are my observations: Strengths
Issues to Consider1. Version number inconsistency in PR description 2.
The new version distills this to two general principles. The 3. Spec quality gate for 4. Removed sections lose useful context 5. Accessibility dropped from Neutral Observations
SummaryThe core idea is sound and the execution is mostly good. The spec quality gate and outcome-centric framing are genuine improvements. The main things worth addressing before merge are the |
- Add explicit fallthrough for quick gate with behavioral impact: "Otherwise, treat as balanced" - Remove hardcoded .cursor/rules/ path from fallback (tool-agnostic now) - Update PR description to correct version number (v2.1→v2.3 not v2.1→v2.2) and document design decisions WONTFIX: error-recovery oversimplification (--no-verify still explicit, other condensation intentional per prompt engineering review), removed requirements/config sections (agent-facing not user-facing), accessibility removed from ux-designer (project policy), completion check wording (both angles covered) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
👎 WONTFIX on error-recovery, requirements sections, and accessibility: Error-recovery: prohibition is still explicit in the current text ("Never use --no-verify. Never silently swallow errors."). The per-scenario guidance (push rejected → pull and rebase, sub-agent failures → retry with simplified scope) was intentionally condensed to principles per a prompt engineering review that found the old version over-prescribed specific failure sequences. The merge conflicts/auth failures/CI issues categories are preserved. Removed requirements/configuration sections: These sections were agent-facing operational docs, not user-facing setup guides. Agents running read the workflow sections to understand the goal; the removed content duplicated what the workflow sections already express. First-time users learn from project setup, not from a command's internal config detection list. Accessibility removed from ux-designer: Project policy per CLAUDE.md is to decline accessibility suggestions. Listing it as a ux-designer scope would cause agents to select that agent for reviews we don't want. Fixed in commit 79f837e: quick gate fallthrough ("Otherwise, treat as balanced"), hardcoded path removed, PR description version corrected. |
Code ReviewOverall this is a solid improvement. The philosophical shift from implementation-centric to outcome-centric framing is well-executed, and the spec quality gate design is thoughtful. A few observations: StrengthsSpec quality gate design — The quick/balanced/deep gates with explicit fallthrough ("Otherwise, treat as balanced") are well-reasoned. The most common failure mode in spec gates is treating too many things as "quick"; the explicit fallthrough guards against that. Outcome-centric self-review — "Does this changeset solve the problem cleanly?" is a better anchor than "would we approve this in code review." The distinction matters: the former is user-value-oriented, the latter is approval-oriented. Good framing shift. Agent name normalization — Splitting Removing emoji from operational prose — The removed Issues to ConsiderVersion jump in
The PR argues "principles over decision trees" — fair, but the original specificity had value for novel failure scenarios agents haven't seen. "Retry once before escalating" is underspecified: retry what? The command, the whole phase, or just the tool call? The original's pattern of "retry once with simplified scope" for sub-agent failures was actionable. Consider: does "retry once" apply to all failure types uniformly, or does that need a word or two of guidance? Accessibility justification: The PR description states "project policy per CLAUDE.md is to decline accessibility suggestions" as justification for removing accessibility from
Removed SectionsThe removed Requirements/Configuration/Key Principles/Notes sections: the justification holds for agent-facing docs. The Configuration section (auto-detection of hooks, test runners, linting) is useful context, but agents can observe this directly at runtime. No objection to the removal. Minor
Verdict: The design decisions are sound. The |
Summary
Applies the "spec quality as bottleneck" insight from Dan Shapiro's 5 Levels of AI Coding framework to our autonomous workflow tooling. The core idea: AI spending more time understanding the problem than implementing produces better outcomes than iterating on an ambiguous spec.
autonomous-development-workflow.mdcv2.0 → v3.0: Added "Spec Quality Is the Bottleneck" section, replaced "Self-Review" with "Evaluate Outcomes" (shifted from code approval framing to problem-solving framing), fixed/load-rulesportability, made fallback path tool-agnosticautotask.mdv2.1 → v2.3: Added<task-preparation>section with complexity-scaled spec gates (quick/balanced/deep), added outcome check to<completion-verification>, applied prompt engineering review fixes (motivation-based language, real agent names, normalized terminology, removed redundant sections)Design Decisions
Spec quality gate scales with complexity — demanding a full spec for a typo fix is a failure mode. Quick tasks with no behavioral impact proceed directly; balanced tasks verify expected behavior; deep tasks write a formal spec. Quick gate has explicit fallthrough: behavioral impact → treat as balanced.
Outcome-centric not implementation-centric — "does this changeset solve the problem cleanly?" vs "would we approve this in code review." The
git diffcheck leads the Evaluate Outcomes section as the concrete behavioral anchor.Principles over decision trees in error-recovery — the previous version over-prescribed specific failure sequences. The current version preserves the critical safety constraints (
--no-verifyprohibition explicit) and fallback categories (merge conflicts, auth failures, CI issues) while removing per-scenario procedural steps.Removed Requirements/Configuration/Key Principles sections — these are agent-facing operational docs. Agents running
/autotaskread the workflow sections, not marketing bullets. The removed content duplicated what the workflow sections already express.Accessibility removed from ux-designer agent description — project policy per CLAUDE.md is to decline accessibility suggestions. Listing it would cause agents to select ux-designer for accessibility reviews we don't want.
Complexity Level
balanced — multi-file rule changes with design decisions about how to express the spec-quality concept across two different document types (command vs. rule)
Validation Performed