Skip to content

Html api docs improvement#61

Draft
sirreal wants to merge 216 commits into
trunkfrom
html-api-docs-improvement
Draft

Html api docs improvement#61
sirreal wants to merge 216 commits into
trunkfrom
html-api-docs-improvement

Conversation

@sirreal

@sirreal sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Trac ticket:

Use of AI Tools


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 30 commits June 11, 2026 18:37
Scaffolding for the autonomous documentation-improvement loop:
- PLAN.md records the full agreed design (corpus, scoring, isolation,
  harness, round flow, revert and stopping rules).
- render-docs-markdown.py deterministically renders phpdoc-parser JSON
  to agent-readable markdown, excluding implementation leakage.
16 tasks (12 train + 4 held-out), each with a subagent-facing prompt,
a validated reference implementation, and frozen hidden test cases.
Expected outputs were generated from the references and cross-checked
against PHP's Dom\HTMLDocument where semantics overlap (text
extraction, links, tables, outlines) — all agree.

Harness executes candidates standalone (no WordPress boot) with shims
for the six WP functions the html-api files reference; each test case
runs in an isolated subprocess with a 10s timeout so parse errors,
fatals, and infinite loops are contained and reported.
- stage-round.sh: regenerate JSON, render markdown, stage isolated
  scratch dir containing only the two markdown files.
- docs-only-guard.php: comment-stripped token-stream identity vs HEAD
  plus php -l, run before every round that follows doc edits.
- aggregate-round.py: trial/task/round scoring per PLAN.md formula.
- PROTOCOL.md: runbook with exact test-subagent and judge prompt
  templates, judge rubric, and results layout.
- docs-test-subject agent definition (Read+Grep only) for structural
  isolation in future sessions.

Pilot validated end-to-end: Sonnet test subject on T01 returned
well-formed output passing 8/8 hidden cases.
trials-workflow.js fans out one docs-only test subject per task-trial
with structured output; judge-workflow.js fans out one Opus judge per
task with the adherence rubric and doc-gap analysis; persist-trials.py
writes candidates to results/ and executes them against hidden tests.
48 Sonnet trials (16 tasks x 3) judged by 16 Opus judges.
TRAIN 93.57 / HELD-OUT 93.47. Dominant systematic failure: undocumented
closer-token depth semantics plus missing subtree-walk idiom (T03, T06,
H04). Secondary: get_modifiable_text() decoding unstated (T08, H04);
serialize_token() rewrite idiom undocumented (T12); misleading
tables-unsupported bullet (T08).
Round-0 failures in T03, T06, and held-out H04 shared one root cause:
nothing documents that a closing-tag token reports the PARENT's depth
(the element is already popped when matched on its closer). All three
T03 trials lost trailing text after nested elements by breaking their
walk loops at 'depth <= opener depth'.

get_current_depth(): state the closer rule explicitly, define depth as
breadcrumb count including non-element tokens, extend the existing
example through the closing tokens, and add the canonical
visit-every-token-inside-an-element loop (depth >= opener depth).
is_tag_closer() (HTML Processor): note that breadcrumbs and depth
reflect the parent context when matched on a closer.
…_token().

The docblock described the method as internal ('do not use') and steered
readers to the Tag Processor 'for access to the raw tokens' — the
opposite of the right guidance for structure-aware text collection,
which round-0 judges identified as a driver of the T06 failures (two of
three trials collected nothing).

Rewrite the description: define tokens, position next_token() as the
right tool when non-tag content matters alongside structure, document
that closers are visited for every opener (including implicit and
end-of-input closes), warn that text may split across consecutive #text
tokens, and add the canonical collect-text-of-an-element example in both
depth-guard and breadcrumbs-guard forms (both verified by execution).
@SInCE history left as-is.
…coded text.

Round-0 judges (T08, H04) flagged that nothing states whether the
returned text has character references decoded — the single most
load-bearing fact for text extraction. Several subjects bolted on a
redundant html_entity_decode() pass, which double-decodes and corrupts
text like '&amp;amp;'.

State the decoding rule with its boundaries (decoded for #text and
RCDATA elements like TEXTAREA/TITLE; verbatim for raw text SCRIPT/STYLE
and comment interiors — all verified by execution), add a one-line
example, and note the set_modifiable_text() inverse so callers work in
decoded space on both sides.
TRAIN 98.78 (+5.21 vs baseline). 36/36 trials passed every hidden case.
T03 +13.95 (closer-depth rule + subtree-walk example), T06 +46.33
(next_token() rehabilitation), no regressions beyond judge noise.
Sonnet has plateaued >=90 for two consecutive rounds; next step per
plan is the Haiku re-baseline. Round-2 adherence targets logged.
Task-first rebalance: add six tasks forcing undercovered concepts
(class removal, contextual selection, truncated-input detection,
normalize() failure handling, full-document parsing, HTML-vs-SVG image
namespace). New held-out set: N01/N02/N05/H04; H01-H03 retired;
T01/T02 relabeled smoke. Every task now carries role/commonness/
concept/processor labels and the aggregator reports per-concept means.
All new references harness-validated; N02/N05/N06 cross-checked
against Dom\HTMLDocument (covering image->img conversion and
img-breaking-out-of-svg).
Two of three Haiku trials on the build-figure task produced correct
markup with src/alt swapped and scored 0/6 — the docs never explain
where set_attribute() puts attributes. Verified by execution: updates
replace in place keeping position; NEW attributes insert after the tag
name before existing ones; multiple new attributes sort by attribute
name regardless of call order. Document all three rules plus the
start-from-a-template idiom for when output order matters.

Also fixes a judge-discovered bug in the paused_at_incomplete_token()
example, which called the nonexistent get_next_tag() instead of
next_tag().
…claims.

The class docblock claimed the HTML Processor cannot process any
element inside a TABLE, any foreign content (SVG/MathML), or anything
outside the IN BODY insertion mode. All three claims are false on this
branch — round-2 trials parsed well-formed tables, SVG content, and
full documents with head content; judges traced T08's defensive
fallback code directly to this passage.

Replace with verified behavior: the processor parses these fine and
aborts only on specific constructs — foster-parented content (e.g. a
DIV directly inside TABLE) and mis-nested formatting requiring
advance-and-rewind reconstruction (e.g. '<b>one<i>two</b>three</i>'),
both confirmed by execution, with simple mis-nesting supported. Also
document how aborts surface: get_last_error(),
get_unsupported_exception(), and null from serialize()/normalize().
Round-1 judges (T12) flagged that nothing connects serialize_token()
to its purpose: subjects mixed token loops with whole-string
normalize(), unsure which was right. Document that concatenating
serialize_token() across a next_token() walk reproduces serialize(),
that the token-by-token form exists for selective rewriting (skip to
remove, emit around to wrap), and that closers of skipped elements
must be skipped too — with an execution-verified removal example.
Cross-reference guidance: serialize() for unchanged output, the loop
for transformations.
All-19 91.47 / core 90.47 / train 92.56 / held-out 87.38. Round-1
edits transfer to Haiku (T03, T06 perfect). Per-concept reporting
exposes the gaps the aggregate hides: attributes 72.2 (set_attribute
ordering), full-document 78.0 (held-out, no edit made), namespace
85.9. Round-3 hypothesis edits committed separately.
ingest-trials.py and ingest-judges.py condense per-round bookkeeping
(persist, execute, aggregate, compare, gap digest with held-out gaps
marked DO-NOT-ACT) into single commands, keeping orchestration
overhead low across the 100+ round goal.
…d edits.

Round 3 confirmed the serialize_token() idiom (round-3 H3) helped its
targets (T09 +8.6, T12 +2.2) but induced a T07 regression (-33.7): two
trials called serialize() after add_class(), got null (scanning had
begun), and returned the unmodified input. Refining rather than
reverting, disclosed in LOG: state the boundary explicitly on both
serialize() and serialize_token() — queued attribute/class/text
updates are read with the inherited get_updated_html(); serialize()
demands a fresh processor and returns null once scanning has begun;
serialization is for normalizing/rewriting, get_updated_html() for
edits.
Two of three round-3 trials on the build-figure task produced empty
captions: they matched the empty FIGCAPTION tag and called
set_modifiable_text(), which returns false there — ordinary container
elements carry no text of their own and an empty element has no #text
token to modify. Nothing documented this. State the eligible token
kinds, the empty-element limitation, the check-the-return-value rule,
and the placeholder-template idiom (verified by execution).
…X idiom.

T10 adherence sat at ~80 because the set_bookmark() docblock forbids
programmatic names without stating the supported alternative; subjects
hedged with bookmark-count workarounds. State explicitly that
re-setting an existing name MOVES the bookmark (no leak, no release
needed) and that same-name-per-match is the idiom for tracking the
last occurrence in one pass (verified by execution; the docblock's
own last-li example already relied on it silently).

Also state the documented default for next_tag()'s tag_closers option
('skip'), which round-3 judges flagged as unstated.
…, two new gaps.

All-19 87.41 / train 90.66 (-1.9) / held-out 75.22. Round-3 edits
helped their targets (T09 +8.6, T12 +2.2, N06 +10.7, N04 100) but the
serialization idiom induced T07 -33.7 (serialize() after mutations).
Refined rather than reverted, with the boundary now stated. Round-4
hypotheses committed separately.
T04 trials each absorbed exactly one of the two template-building facts
(pre-seeded attribute order in set_attribute(), placeholder text in
set_modifiable_text()) and failed on the other — the facts live in two
distant method docblocks. Add a 'Building markup from a template'
section to the class overview, where template builders first look,
stating both rules together with one execution-verified example using
a link template (deliberately unlike any corpus task).
…decoded reads; add_class idempotency.

Judges across four tasks flagged the same unstated guarantees subjects
kept inferring (correctly, but unguided):
- next_tag(): tag-name matching is ASCII case-insensitive with source
  casing preserved; comments/CDATA/rawtext can never match; truncated
  trailing tags are never matched or modified (cross-ref
  paused_at_incomplete_token()). Stated as a 'What this matches' block.
- get_attribute(): string values come back DECODED (don't decode
  again), inverse of set_attribute's encode-on-write.
- add_class(): creates/appends without disturbing existing classes;
  re-adding an existing class is a no-op with an exact byte-for-byte
  duplicate check (add 'NOTE' to class="note" appends — verified;
  an initial case-insensitive claim was caught wrong by probe before
  commit).
T08 judges noted the only depth-bounded walk example nests one level,
where >= and > behave identically, so readers can't learn which is
right. State the rule: >= is correct at any depth; > ends the walk at
the first direct-child closer (verified: with > the UL walk stops
after the first LI's contents).
…oundary, get_updated_html identity, >= warning placement.

Round-5's two single-trial collapses both trace to unstated boundaries:
a T06 trial attempted tree-aware work in the Tag Processor (whose docs
never say it lacks depth/breadcrumbs), and a T03 trial copied the
next_token() example but guessed '>' because the >= warning only
existed in get_current_depth().

- Tag Processor overview: 'Which processor should I use?' section
  stating it has NO tree awareness and where those methods live;
  HTML Processor overview gets the matching half.
- get_updated_html(): own description at last (was a copy of
  __toString's) — read-your-edits semantics, byte preservation,
  safe mid-scan.
- next_token() example now carries the >= warning inline where the
  failing trial actually read.
…ocessor, >= beside the operator, drain idiom, add_class return semantics.

Round-6 train gaps: the HTML Processor's own get_modifiable_text()
override never stated decoding or that SCRIPT/STYLE/TEXTAREA/TITLE
carry their text on the element token (no #text child) — stated now
with a verified full-parser TITLE example; the >= rule now sits beside
the operator in the get_current_depth() example with the
nested-closer/sibling-text explanation inline; the
paused_at_incomplete_token() example gains the drain-all-tokens idiom
its single-tag example obscured; add_class() return documented as
enqueued-not-applied (false only with no matched tag, verified).
…boundary rule.

Round-7's only functional miss (T05 5/9) sliced multibyte text without
an explicit encoding; the docs say UTF-8 is the only supported input
but never said the output of get_modifiable_text() is UTF-8 nor showed
the mb_* explicit-encoding idiom — stated on both classes now. T08's
recurring boundary confusion appears in break-form code that the
continue-form-only warning misses: stated the equivalence
(break at < depth, never <= depth).
…e last-X bookmark idiom.

T08's recurring failure class is nested walk loops double-advancing
the single cursor: the inner collect-until-close loop exits already
matched on the next region's boundary token, which the outer loop's
next_token() then skips. Document the one-cursor contract on
next_token() with the closer-driven single-pass state-machine shape
(verified DT example), noting it stays reliable on malformed input
because closers are always visited. Also surface the
re-set-the-same-bookmark-name idiom in the overview bookmarks
narrative where T10 trials kept missing it.
@sirreal

sirreal commented Jun 15, 2026

Copy link
Copy Markdown
Owner Author

I reviewed the branch docs, LOG.md, and source-doc commit history. I did not update files; the relevant source-doc changes are already represented in the log.

The four most impactful changes are:

  1. Canonical subtree traversal and depth-boundary guidance
    Source areas: src/wp-includes/html-api/class-wp-html-processor.php (Recipe: scan a region before editing its opener, Recipe: test subtree membership and direct children).
    This taught next_token(), closer-depth semantics, bounded region scans, and direct-child checks. Evidence: round 1 train moved 93.57 -> 98.78, with 36/36 trials passing hidden cases; T03 rose +13.95 -> 100 and T06 rose +46.33 -> 99.8. On the refreshed corpus, round 19 moved N03-first-list-count 85.07 -> 100.00, with all three trials passing 11/11.

  2. Processor-choice and tree-awareness boundary
    Source area: src/wp-includes/html-api/class-wp-html-tag-processor.php (Which processor should I use?).
    This made clear when to use the Tag Processor versus the HTML Processor, especially that the Tag Processor has no depth/breadcrumb/tree awareness. Evidence: round 6 recorded All-19 95.92 / train 97.84 (+3.1) / held-out 88.69, with T06 +24.5 and T08 +20.0; the log attributes those gains to the chooser/tree-awareness boundary landing.

  3. Text-extraction policy: ordinary #text versus special-element opener text
    Source areas: src/wp-includes/html-api/class-wp-html-processor.php (Recipe: collect DOM-style text from a subtree, Quick policy table).
    This prevented subjects from treating every modifiable-text token as ordinary DOM text. Evidence: round 47 rose 98.18 -> 99.55 train, all 45 trials passed hidden cases, and target tasks stayed strong: T03 100.00, T05 98.00, T06 99.40, T08 99.40, N06 99.20. Judges explicitly credited the promoted table and reminders.

  4. Output/rewrite guidance: templates, get_updated_html(), and serialize_token() fallback policy
    Source areas: src/wp-includes/html-api/class-wp-html-tag-processor.php (Building markup from a template, get_updated_html()), src/wp-includes/html-api/class-wp-html-processor.php (Recipe: rewrite while serializing tokens).
    This clarified how to build markup safely, how to read queued edits, and how not to discard token-loop rewrites. Evidence: round 4 fixed the serialization/readback regression with T07 +35.0 -> 100; round 5's template section drove T04 +49.2 -> 98.6 and attributes 74.7 -> 99.3; round 56 later confirmed the serialization fallback wording with train 99.51 -> 99.61, serialization 98.85 -> 99.35, T09 99.10 -> 99.40, and T12 98.60 -> 99.30.

I excluded later reduction/simplification attempts from the top four because the recorded evidence says most were not promotable, and round 80/81 were explicitly partial or losing results rather than clear impact wins.

@sirreal

sirreal commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

HTML API docs improvement handoff

Scope reviewed: source docblock changes in src/wp-includes/html-api/class-wp-html-processor.php and src/wp-includes/html-api/class-wp-html-tag-processor.php, plus the experiment evidence in doc-experiment/LOG.md and persisted round summaries.

Criteria used here:

  • Count only documentation changes with explicitly recorded test evidence.
  • Do not infer impact from prose quality or from unmeasured intent.
  • Do not count scratch/reduction variants that the log says not to promote.
  • Held-out-only evidence is treated as sentinel signal, not as edit-driving evidence.
  • Mixed rounds are included only when the log states a target result clearly and the caveat is noted.

Recommended improvements to preserve or upstream

  1. Canonical subtree traversal and depth-boundary guidance
    Source areas: class-wp-html-processor.php (Recipe: scan a region before editing its opener, Recipe: test subtree membership and direct children).
    Why it matters: teaches next_token(), closer-depth semantics, bounded region scans, and direct-child checks.
    Evidence: round 1 train moved 93.57 -> 98.78, with 36/36 trials passing hidden cases; T03 rose +13.95 -> 100 and T06 rose +46.33 -> 99.8. On the refreshed corpus, round 19 moved N03-first-list-count 85.07 -> 100.00, with all three trials passing 11/11.

  2. Processor-choice and tree-awareness boundary
    Source area: class-wp-html-tag-processor.php (Which processor should I use?).
    Why it matters: makes clear when to use the Tag Processor versus the HTML Processor, especially that the Tag Processor has no depth/breadcrumb/tree awareness.
    Evidence: round 6 recorded All-19 95.92 / train 97.84 (+3.1) / held-out 88.69, with T06 +24.5 and T08 +20.0; the log attributes those gains to the chooser/tree-awareness boundary landing.

  3. Text-extraction policy: ordinary #text versus special-element opener text
    Source areas: class-wp-html-processor.php (Recipe: collect DOM-style text from a subtree, Quick policy table).
    Why it matters: prevents treating every modifiable-text token as ordinary DOM text.
    Evidence: round 47 rose 98.18 -> 99.55 train, all 45 trials passed hidden cases, and target tasks stayed strong: T03 100.00, T05 98.00, T06 99.40, T08 99.40, N06 99.20. Judges explicitly credited the promoted table and reminders.

  4. Output/rewrite guidance: templates, get_updated_html(), and serialize_token() fallback policy
    Source areas: class-wp-html-tag-processor.php (Building markup from a template, get_updated_html()), class-wp-html-processor.php (Recipe: rewrite while serializing tokens).
    Why it matters: clarifies how to build markup safely, how to read queued edits, and how not to discard token-loop rewrites.
    Evidence: round 4 fixed the serialization/readback regression with T07 +35.0 -> 100; round 5's template section drove T04 +49.2 -> 98.6 and attributes 74.7 -> 99.3; round 56 later confirmed serialization fallback wording with train 99.51 -> 99.61, serialization 98.85 -> 99.35, T09 99.10 -> 99.40, and T12 98.60 -> 99.30.

  5. WP_HTML_Processor::next_tag() cursor/search contract
    Source area: class-wp-html-processor.php method docs for next_tag().
    Why it matters: clarifies cursor-relative searches, no rewind after failed searches, and single-tag-name query behavior.
    Evidence: round 32 moved train 98.31 -> 99.67, all 45 trials passed hidden cases, and T07-nested-lists recovered 81.13 -> 99.30.

  6. RCDATA/RAWTEXT exception placed on the walk path
    Source area: class-wp-html-processor.php next_token() walk guidance.
    Why it matters: places the “special element text lives on the opener” warning where subtree-walk users actually read.
    Evidence: round 10 hit a new high, T08 +10.0 -> 96.8, with 8/8 in every trial.

  7. UTF-8 and explicit mb_* measurement guidance
    Source areas: get_modifiable_text() docs in both processor classes.
    Why it matters: states decoded text is UTF-8 and examples should pass explicit encodings to mb_strlen() / mb_substr().
    Evidence: round 8 recorded T05 +14.0 -> 99.3 and train 97.70, then a new high.

  8. Single shared cursor and single-pass state-machine guidance
    Source area: class-wp-html-processor.php next_token() docs.
    Why it matters: documents that nested next_token() loops consume the same cursor and can skip boundary tokens.
    Evidence: round 9 recorded train 98.66 (+1.0), T08 +8.7 -> 86.8 with no sub-50% trials, T10 +2.6, and 17/19 tasks functionally perfect.

  9. Corrected HTML Processor support/unsupported-markup claims
    Source area: class-wp-html-processor.php supported/unsupported markup section.
    Why it matters: replaces misleading unsupported-element claims with verified abort conditions and documents how aborts surface.
    Evidence: round 3 was mixed overall, but the log explicitly attributes N06 +10.7 to the support-claims rewrite, with N04 at 100. Caveat: the same round also recorded a separate serialization-induced T07 regression, so this is not a clean aggregate win.

  10. Construction asymmetry: Tag Processor uses new, HTML Processor uses factories
    Source areas: class overview and get_modifiable_text() placement reminders in both processor docs.
    Why it matters: addresses hallucinated WP_HTML_Tag_Processor::create_fragment() / wrong-class factory calls.
    Evidence: round 14 identifies the train failure as a T05 trial hallucinating WP_HTML_Tag_Processor::create_fragment(); after the source clarification, round 15 reports T05 back to 9/9x3 and T08 +15.5.

  11. Implied structure and opener-only next_tag() default
    Source areas: class-wp-html-processor.php next_tag() and next_token() docs.
    Why it matters: teaches that plain next_tag() visits openers only, and that parser-inserted structure like TBODY appears in walks.
    Evidence: round 13 was the first fully clean campaign round: 45/45 trials passed 343/343 hidden cases, with T08 +20.7 -> 96.9 and T06 +5.9 -> 99.6.

  12. Causal equality rule for >= subtree guards
    Source area: class-wp-html-processor.php get_current_depth() docs.
    Why it matters: makes explicit that a child closer can report the same depth as the matched ancestor opener, which is why >= is required.
    Evidence: round 11 records T03 +5.2 -> 98.9 and identifies the stated-causally equality rule as the reason.

  13. Drain-all-tokens idiom before checking paused_at_incomplete_token()
    Source area: class-wp-html-tag-processor.php paused_at_incomplete_token() docs.
    Why it matters: clarifies that truncation status is meaningful after scanning to the stopping point.
    Evidence: round 7 records N03 -> 100, failure-handling concept 100, and 13/15 tasks functionally perfect; the log names the “drain idiom” as the N03 fix.

  14. Bookmark re-set/last-match idiom
    Source area: class-wp-html-tag-processor.php bookmark docs.
    Why it matters: documents that setting the same bookmark name moves it, enabling a single-pass “last matching tag” pattern.
    Evidence: smaller but explicit: round 4 records T10 +2.5 after the bookmark reset change, and round 9 records T10 +2.6 after surfacing the last-X bookmark idiom.

Not recommended from the later reduction pass

Do not promote the later scratch/reduction variants from rounds 62-81 as positive improvements. The recorded evidence says the tested reductions generally lost against the comparable baseline or were explicitly not promotable. In particular:

  • Round 70’s private-helper pruning was promising but not clean, and round 78 failed to reproduce it.
  • Round 79 salvage improved some failures but still failed against the weak-tier source-doc baseline.
  • Round 80 bounded-subtree clarification only partially helped and scored below the round-56 weak-tier source-doc baseline.
  • Round 81 method-local boundary-warning scratch wording improved N03 but damaged other tasks badly, so it should not be promoted.

Handoff notes

  • The strongest source-doc value is in contract placement: facts worked best when stated where task authors naturally read (next_token(), next_tag(), get_current_depth(), get_modifiable_text(), get_updated_html()), not only in broad overviews.
  • Preserve the distinction between general API guidance and task-shaped examples. The experiment repeatedly rejected changes that looked concise but removed cues weaker subjects relied on.
  • When upstreaming, keep evidence-backed semantics even if phrasing is rewritten for WordPress documentation style. The evidence is for the contracts, placement, and examples, not necessarily every sentence verbatim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant