HTML API: CSS selectors with token processor parsing#70
Draft
sirreal wants to merge 172 commits into
Draft
Conversation
The length guard before the attribute matcher required 4 remaining bytes where the minimum valid tail `=x]` is 3, so a valid exact-match attribute selector with a single-character unquoted value at the end of the selector string (e.g. `[a=b]`) was wrongly rejected as unparseable. Relax the guard from `>=` to `>`. All reads after the guard are bounded: the operator reads touch at most offset+1, and every later read re-checks the length itself. Adds the exact-fit valid case and invalid cases at the same boundary (`[a=]`, `[a~=]`, `[a==b]`, `[a=1]`) to the parse tests, plus an assertNotNull so parse failures report cleanly instead of erroring on a null property read. Found by the CSS selector fuzzer (tools/css-selector-fuzz, Bug 3 in FINDINGS.md). (cherry picked from commit 16d03e2)
Per Selectors level 4, the substring attribute matchers with an empty value — [x^=""], [x$=""], [x*=""] — represent nothing and must never match. The matcher instead matched any element carrying the attribute (prefix and contains) or an element whose attribute value was exactly empty (suffix). Add an early return for the empty operand on those three matchers, before the case modifier and boolean-attribute normalization so [x^="" i] and valueless attributes are covered too. [x=""], [x|=""], and [x~=""] are unaffected and remain spec-correct: exact and hyphen matchers may match an empty value, and the one-of matcher already matched nothing because a whitespace-delimited list never yields an empty item. Tests pin all of these, including |= against a hyphen-prefixed value. https://www.w3.org/TR/selectors-4/#attribute-substrings Found by the CSS selector fuzzer (tools/css-selector-fuzz, Bug 2 in FINDINGS.md). (cherry picked from commit 0cefeb2)
consume_escaped_codepoint() read the escaped codepoint for non-hex
(identity) escapes with mb_substr( $input, $offset, 1 ), but $offset is
a byte offset while mb_substr()'s second argument is a character index.
Any multibyte content earlier in the selector string shifts the read one
character right per continuation byte, decoding the wrong codepoint:
'Ü\sup' parsed as ident 'Üuup', and the corruption threads across an
entire selector list ('#ÜÜÜ,\sup #x' parsed the second selector's type
as ' up'). Depending on the mis-decoded codepoint this also caused
spurious parse failures of valid selectors. Hex escapes were already
byte-correct and are unaffected.
Read the codepoint from the byte offset instead. ASCII-only inputs are
byte-for-byte unchanged, and the returned codepoint's byte length keeps
the offset advancing exactly past it.
Adds parse_ident and parse_string cases pinning identity escapes after
multibyte characters, plus a hex-escape control.
Found by the CSS selector fuzzer (tools/css-selector-fuzz, Bug 1 in
FINDINGS.md).
(cherry picked from commit 7419a9f)
Per CSS Syntax 3, a backslash followed by EOF is a valid escape in
ident context -- §4.3.8 rejects only a newline as the second code
point, and EOF is not a newline -- and consuming it returns U+FFFD
REPLACEMENT CHARACTER (§4.3.7). WP rejected the whole selector:
next_two_are_valid_escape() required a code point after the backslash,
so '.foo\' parsed to null instead of the class "foo\u{FFFD}".
Fix: consume_escaped_codepoint() returns U+FFFD at EOF without
advancing, and next_two_are_valid_escape() accepts a backslash as the
final byte. String context is unaffected: parse_string() guards EOF
itself before consuming an escape, preserving the §4.3.5 'do nothing'
EOF rule ('foo\ still parses to foo).
Review of the fix surfaced a second bug in the same family:
normalize_selector_input() trimmed *trailing* whitespace before
tokenizing, so '.foo\ ' (escaped space: the valid, unmatchable class
'foo ') and ".foo\\n" (invalid escape: must be rejected) both
collapsed to '.foo\' and matched elements with class "foo\u{FFFD}" --
a wrong-match-set bug, where before the EOF-escape fix the collapse
was a harmless fail-safe rejection. Now only leading whitespace is
stripped; the grammar already consumes insignificant trailing
whitespace via parse_whitespace() in both selector-list parsers.
Verified against lexbor: '.foo\' matches class "foo\u{FFFD}", lone
'\' parses as type U+FFFD and matches nothing, '.foo\ ' is valid and
matches nothing, and the LF/CR/FF escape variants are rejected --
exact agreement on all probes. (NEXT-STEPS.md 'candidate finding 4',
now confirmed and closed.)
(cherry picked from commit 203858b)
Per CSS Syntax 3 §5.4.8/§4.3.5, tokenization auto-closes unterminated
simple blocks and unterminated strings at EOF (a parse error, but the
block/string is returned), and the selector grammar then applies to the
block contents. So '[att=val' is the same selector as '[att=val]', and
'[att="a b' carries the string value 'a b'. WP rejected all of these
with null.
The attribute parser now treats the end of input like a closing ']' at
the two positions where the grammar is complete (after the name, and
after the value/modifier), and the early length guards that required
room for a closing bracket are relaxed accordingly. Truncation inside
the grammar itself is still invalid: '[', '[a=', '[a~', '[a=b x', and
a comma inside the open block ('[a=b, div') all stay null.
Escape interplay (verified per spec and in Chromium): '[a=b\' carries
the value "b\u{FFFD}" (escape at EOF in ident context), while
'[a="b\' carries 'b' (backslash-then-EOF in a string 'does nothing').
'[a\]' parses as a presence selector for the attribute 'a]' (the
escaped ']' joins the ident and EOF closes the block).
Chromium agrees with every accepted and rejected form above. lexbor
rejects all EOF-truncated forms (it does not implement the auto-close
rule) and diverges from browsers and the spec here; the fuzzer's lexbor
differential is unaffected because it compares canonical re-renders,
which always include the closing bracket.
(cherry picked from commit 5eea359)
HTML defines 46 attributes (type, rel, lang, dir, media, hreflang, http-equiv, ...) whose values must match ASCII case-insensitively in attribute selectors on an HTML element when the selector carries no i/s modifier: https://html.spec.whatwg.org/multipage/semantics-other.html#case-sensitivity-of-selectors WP honored only the explicit modifiers, so [type=TEXT] silently failed to match <input type="text"> — a wrong match set rather than a refusal, invisible to callers. The matcher now folds case when all three hold: no modifier on the selector, the element is in the html namespace (per the processor's get_namespace()), and the lowercased attribute name is in the list. An explicit s modifier still forces case-sensitive matching, per Selectors 4 §6.3: 'the UA must match the value case-sensitively ... regardless of document language rules.' All six matchers and |='s hyphen check honor the rule via the existing case-insensitive comparison branches. Namespace scoping follows the spec's 'on an HTML element' wording: SVG/MathML elements keep case-sensitive matching, while elements at HTML integration points (e.g. inside <svg><foreignObject>) fold, since they are html-namespace. Verified in Chromium, which agrees on the integration point but also folds plain SVG-namespace elements, diverging from the spec's scoping; WP follows the spec. The standalone Tag Processor tracks no namespaces and folds everywhere — the same class of approximation as its ancestor-blind matching. The review panel machine-diffed both list constants against the live spec (exact, in spec order). (cherry picked from commit 40640d1)
phpcbf reports 40 WordPress.Arrays.MultipleStatementAlignment warnings in this file's data providers, and the coding-standards workflow runs phpcs over the test suite without -n, so warnings fail CI. Pure whitespace; no test changes. (cherry picked from commit 3db43ea)
The identity arm of consume_escaped_codepoint() read one character via mb_substr( substr( $input, $offset ), 0, 1 ), copying the entire remaining input per escape: O(n^2) over selectors composed of escapes, plus an O(n) temporary allocation each time. Size the code point in place instead with the bounded scanner _wp_scan_utf8( $input, $at, $invalid_length, 4, 1 ) from compat-utf8.php (WP 6.9, loaded unconditionally before the HTML API), then copy at most 4 bytes. Escapes of invalid UTF-8 fall through to the literal previous mb_substr() line, so behavior is preserved by construction under every mb_substitute_character setting; that fallback remains O(tail) per call, accepted for developer-supplied selectors. _wp_utf8_codepoint_span() is deliberately not used: it leaves the scanner's ASCII fast-path unbounded, which is quadratic again (noted in-code). 200KB of repeated \g through parse_ident: 180 ms before, 45 ms after, with linear scaling after (47/90/180 ms at 200/400/800KB; previously ~4x per doubling) and half the peak memory. Escape pin coverage grows to 14 cases: 2/3/4-byte characters including at end of input, NUL, and each invalid-byte class (lone continuation, overlong lead, invalid lead, truncated 3/4-byte, encoded surrogate, above U+10FFFF), with expectations probe-verified against the pre-change implementation. Adversarial review: equivalence reviewer ran ~74M differential old-vs-new cases (exhaustive byte-class boundaries at every offset, random fuzz, non-default mb_substitute_character) with 0 mismatches; perf reviewer independently reproduced the quadratic-before / linear-after curves; integration reviewer verified load order (including SHORTINIT), private-function precedent, and phpcs. All approved. Gates: full html-api PHPUnit group green (1654 tests), fuzzer 5000 seeds 0 failures. (cherry picked from commit 9d82c1c)
Decoding an identity escape of invalid UTF-8 leaks the process-global mb_substitute_character() setting into parse results: the substitute character is returned and the offset advances by the byte length of the substitute, not of the invalid sequence. Under the default '?' this is nearly invisible; under a multibyte substitute it swallows following characters and can push the offset past the end of the input. Pin the setting to a distinctive canary -- U+2603 SNOWMAN -- in set_up()/tear_down() and rewrite the seven invalid-byte pins to the canary expectations, making the dependence unmistakable: five cases show the trailing 'z' being eaten, and a dedicated test asserts the offset overrun that the rest-of-input assertion cannot see (substr() returns '' both at and past the end). A differential run of all provider cases under canary/default/'none' confirms exactly these seven react to the setting; everything else is independent of it. These pins document the leak, not endorse it. They are the ready-made red suite for the planned fix: decoding invalid bytes to U+FFFD per maximal subpart (CSS Syntax 3 section 3.2 via the WHATWG Encoding Standard) makes the outputs setting-independent and flips every one of these expectations. Adversarial review approved; full html-api group green (1654 tests) with the substitute character verified restored after the run. (cherry picked from commit 9b0b1df)
Selector strings are UTF-8 text. from_selectors() now decodes the input byte stream before parsing: normalize_selector_input() replaces each maximal subpart of an ill-formed byte sequence with U+FFFD via wp_scrub_utf8() (WP 6.9), per the byte-decoding step CSS Syntax 3 section 3.2 defines through the WHATWG Encoding Standard's UTF-8 decoder. A replaced selector is almost always a developer mistake (mojibake, double encoding) that would otherwise yield a silently empty match set, so the replacement also reports _doing_it_wrong(), named "<called class>::from_selectors" via late static binding. The mb_substitute_character() leak in consume_escaped_codepoint() dies structurally: with all public input scrubbed, the identity arm's mb_substr() fallback became unreachable through from_selectors() and is replaced by a deterministic decode for direct parse() callers — consume the maximal subpart the existing _wp_scan_utf8() call already reported and return one U+FFFD, consistent with the scrub. This also removes the remaining O(tail)-per-escape copy for invalid bytes. Design decision: reject (wp_is_valid_utf8() -> null) and raw byte passthrough were both considered and discarded by a three-persona adversarial design review; scrub is the option that stays stable under both the current raw value getters and a future where the getters scrub their return values. An escape-arm-only U+FFFD decode was ruled out unanimously: it would break the identity property that escaping a non-special code point is equivalent to writing it unescaped. The known divergence is pinned in a test: a scrubbed selector cannot match raw invalid bytes in a document (the Tag Processor reports raw bytes); if the HTML API value getters are ever changed to scrub, that pin flips to a match and must be updated in the same change. The compound-list class docblock gains a "Text Encoding" section recording the contract. Tests: the seven U+2603-canary escape pins flip to maximal-subpart U+FFFD expectations, and the canary is retained permanently — its job inverted from documenting the leak to proving setting-independence (a reintroduced mb_substitute_character dependence fails eight tests). New coverage: scrub + notice through from_selectors() on both list classes (the complex-list test pins the late-static-binding notice name), a lone invalid byte parsing as a U+FFFD type selector, the notice firing even when the scrubbed selector is rejected by the grammar, string-token invalid-byte decode, identity-escape equivalence and U+FFFD matching through select(), and the raw-document-bytes no-match pin (deliberately unique 0xC1 byte: select() memoizes the last parsed selector string, so a unique selector guarantees the parse-time notice under any test order). Adversarial review: three hostile reviewers. The equivalence reviewer verified the decode against an independent reference WHATWG UTF-8 decoder (exhaustive 1-2-byte tails, ~204k boundary-alphabet 3-4-byte tails, 100k random; zero mismatches, all under a U+2603 canary), scrub idempotence and ordering-neutrality (200k cases), and the notice-name propagation. The test reviewer killed four core mutations against the suite and demonstrated two test defects (select()-cache coupling, an unpinned complex-list notice name), both fixed before commit. The integration reviewer verified worker-model equivalence and determinism (10000 fuzz seeds clean). All approved. Gates: full html-api group green (1665 tests), fuzzer 5000 seeds 0 failures, self-check OK, phpcs clean. (cherry picked from commit 598ed6f)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trac ticket:
Use of AI Tools
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.