Skip to content

HTML API: Apply input preprocessing consistently at Tag Processor read boundaries#53

Open
sirreal wants to merge 20 commits into
trunkfrom
spec-compliant-getters
Open

HTML API: Apply input preprocessing consistently at Tag Processor read boundaries#53
sirreal wants to merge 20 commits into
trunkfrom
spec-compliant-getters

Conversation

@sirreal

@sirreal sirreal commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • Apply HTML input preprocessing consistently at Tag Processor read boundaries.
  • Normalize source carriage returns in attributes and replace source NULL bytes in exposed attribute and tag names.
  • Keep API-supplied values verbatim while preserving lossless raw-span updates.

Testing

  • New input-preprocessing and decoder PHPUnit coverage with browser-verified expectations.
  • HTML API and html5lib PHPUnit groups and PHPCS pass.
  • codex review --base trunk.

Trac ticket: https://core.trac.wordpress.org/ticket/65372

Use of AI Tools

AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5.5
Used for: PR description cleanup and code review.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 19 commits June 11, 2026 18:29
Red TDD step: browser-verified expectations for raw CR/CRLF/NUL in
attribute values; passing pins for encoded 
/� and for verbatim
pass-through of API-supplied values.

See #65372.
Attribute values read from the input document now normalize newlines
(CRLF/CR to LF) and replace U+0000 NULL bytes with U+FFFD before
decoding character references, matching what browsers produce for the
same markup. Values enqueued through set_attribute() are plaintext API
values and continue to pass through unchanged.

See #65372.
Red TDD step: flushing add_class()/remove_class() updates must read the
existing class attribute through the same input preprocessing as
get_attribute(), normalizing newlines and replacing NULL bytes.

See #65372.
class_name_updates_to_attributes_updates() reads the existing class
value through the same preprocessing helper as get_attribute(), so
add_class()/remove_class() no longer rebuild the attribute from raw
source bytes containing CR or NULL.

See #65372.
Red TDD step: browser-verified expectations that attribute names are
exposed and addressed with U+FFFD replacing NULL bytes, that names
collapsing after replacement behave as duplicates of one attribute,
and that attribute updates target the replaced name.

See #65372.
Attribute lookup keys are normalized where they are created, in
parse_next_attribute(): NULL bytes are replaced with U+FFFD before
lowercasing, as the tokenizer does in browsers. Names which collapse
to the same replaced name are duplicates of one attribute (first one
wins), lookups by the raw NULL spelling no longer match, and updates
or removals by the replaced name target the source attribute. Raw
document spans are untouched.

See #65372.
Red TDD step: tag names are exposed with U+FFFD replacing NULL bytes;
passing pins confirm NULL bytes never select rawtext parsing and never
appear in PI-lookalike comment tag names.

See #65372.
get_tag() (and get_token_name(), which delegates to it) returns tag
names with U+0000 NULL bytes replaced by U+FFFD, as the tokenizer does
in browsers. Internal token identification continues to compare raw
bytes: a NULL byte in a tag name already prevents rawtext detection,
matching browsers, where the replaced name likewise never equals
SCRIPT or the other special names.

See #65372.
Red TDD step: browser-verified expectation that classList-equivalent
reads preserve NULL bytes in values set through the API; the U+0000
replacement belongs to the tokenizer, and document-sourced values
already receive it in get_attribute().

See #65372.
class_list() received its NULL-byte replacement when reading raw class
values; that replacement now happens in get_attribute() for values
from the input document. Performing it on API-supplied values diverged
from browsers, where classList preserves NULL bytes in values set via
setAttribute().

See #65372.
Benchmark-guided: reading an attribute value applies up to three
str_replace passes which doubled read cost for long values containing
no bytes needing replacement. Guarding with strpos keeps the common
case at two fast scans; values are typically free of CR and NULL.

Benchmark (PHP 8.4, medians of 3): scanning 100-tag documents reading
3 attributes each, 2000 iterations: trunk 667ms, unguarded 714ms,
guarded 699ms. Reading a 10.8KB clean attribute value 200k times:
trunk 147ms, unguarded 313ms, guarded 258ms. The remaining cost is
the unavoidable byte inspection.

See #65372.
Red TDD step from adversarial review: a named character reference
without a terminating semicolon must decode when followed by a NULL
byte or any non-ASCII byte. Replacing NULL with U+FFFD before decoding
fed the decoder a multi-byte follower whose classification by
ctype_alnum() depends on the process locale, suppressing valid decodes
in attribute values, diverging from browsers and from trunk.

See #65372.
The tokenizer replaces U+0000 NULL bytes as it consumes input, so a
character reference without a terminating semicolon sees the raw NULL
byte as its follower, which is unambiguous, and the reference decodes.
Replacing before decoding handed the decoder U+FFFD's lead byte, whose
ctype_alnum() classification depends on the process locale, wrongly
suppressing the decode under UTF-8 locales. No character reference
decodes into NULL, so replacing after decoding is equivalent for the
value's own bytes and faithful to the tokenizer's order.

See #65372.
Per the named-character-reference state, a semicolon-less reference is
ambiguous only when followed by an ASCII alphanumeric or equals sign.
ctype_alnum() classifies bytes 0x80 and above as alphanumeric under
UTF-8 locales, wrongly suppressing decodes followed by any non-ASCII
byte and making decoding depend on the process locale.

See #65372.
Red TDD step from adversarial review: next_tag() must match tag names
in the same U+FFFD-replaced alphabet that get_tag() exposes, so the
getter round-trips into queries, raw NULL spellings match nothing, and
the Tag Processor agrees with the HTML Processor, whose queries
already compare against the replaced token name.

See #65372.
next_tag() compared sought tag names against raw document bytes while
get_tag() returns names with NULL bytes replaced by U+FFFD, breaking
the getter-to-query round trip and disagreeing with the HTML
Processor's queries. Matching now happens in the exposed alphabet; the
existing byte comparison is unchanged for names without NULL bytes, so
the hot path costs the same.

See #65372.
Red TDD step from adversarial review: get_attribute( 'CLASS' )
returned a stale value when class updates were pending, because the
flush guard compared the attribute name case-sensitively.

See #65372.
Attribute lookups are ASCII-case-insensitive, but the pending-class
flush in get_attribute() compared the requested name case-sensitively,
returning a stale value for spellings like "CLASS".

See #65372.
From adversarial review: pins for class helpers over replaced source
values, boolean attributes with NULL-byte names, verbatim prefix
matching in get_attribute_names_with_prefix(), and HTML Processor
end-tag matching across NULL and U+FFFD spellings (browser-verified:
both spellings tokenize to the same name). Documents the @SInCE 7.1.0
behavior on indirectly-affected getters and the known asymmetry of
set_modifiable_text(), whose value reads back normalized unlike
attribute values, which round-trip verbatim.

See #65372.
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

sirreal added a commit that referenced this pull request Jun 11, 2026
# Conflicts:
#	src/wp-includes/html-api/class-wp-html-tag-processor.php
@sirreal

sirreal commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

This is an issue:

Input HTML (below "␀" === U+0000 NUL byte)

<div attr␀ class='c␀'>

Appears largely preserved via getters (depending on exactly what is used). The actual HTML is more like the following:

<div attr� class="c�">

Getter functions are inconsistent

$p = WP_HTML_Processor::create_fragment("<div\0 attr\0 class='c\0'>");
$p->next_tag();
echo "get_tag()\n";
print_r(unpack('C*', $p->get_tag()));
echo "\nget_attribute_names_with_prefix()\n";
print_r(unpack('C*', $p->get_attribute_names_with_prefix('')[0] ));
echo "\nget_attribute('class')\n";
print_r(unpack('C*', $p->get_attribute('class')));
echo "\nclass_list()\n";
print_r(unpack('C*', iterator_to_array($p->class_list())[0]));
echo "\nhas_class('c\u{FFFD}')\n";
var_export($p->has_class("c\u{FFFD}"));
echo "\nhas_class('c␀')\n";
var_export($p->has_class("c\x00"));

They're expected to be replaced with U+FFFD replacement.

get_tag()
Array
(
    [1] => 68
    [2] => 73
    [3] => 86
    [4] => 0
)

get_attribute_names_with_prefix()
Array
(
    [1] => 97
    [2] => 116
    [3] => 116
    [4] => 114
    [5] => 0
)

get_attribute('class')
Array
(
    [1] => 99
    [2] => 0
)

class_list()
Array
(
    [1] => 99
    [2] => 239
    [3] => 191
    [4] => 189
)

has_class('c�')
true
has_class('c␀')
false

@sirreal sirreal added this to the HTML API confirmed fuzz PRs milestone Jun 17, 2026
@sirreal

sirreal commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

This is relevant around \r U+000D CARRIAGE RETURN has some normalization rules in pre-processing, but can be present as an HTML character reference where it is not removed in pre-processing.

For example ( === U+000D CR byte, === U+000A LINE FEED byte):

<hr attr="&#x0D;&#x0A;"><hr attr="␍␊">

This reports the tree

├─HR attr="␍␊"
└─HR attr="␍␊"

But the expected tree is:

├─HR attr="␍␊"
└─HR attr="␊"

The ␍␊ byte sequence should become to a single ␊, while the character references &#x0D;&#x0A; are literally ␍␊ in the decoded attribute value.

See #42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant