Skip to content

Commit 21c4b73

Browse files
eendebakptclaude
andauthored
gh-152056: Compile single-category character sets to a bare CATEGORY opcode (GH-152057)
A character set containing exactly one category, e.g. [\d] or [^\s], now compiles to a single CATEGORY opcode (like \d or \S) instead of an IN block. The negated form maps to the complementary category. This speeds up matching and reduces the size of the compiled byte code. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent a46db4f commit 21c4b73

3 files changed

Lines changed: 17 additions & 4 deletions

File tree

Doc/whatsnew/3.16.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -288,10 +288,12 @@ re
288288
--
289289

290290
* Character class escapes (``\d``, ``\D``, ``\s``, ``\S``, ``\w`` and ``\W``)
291-
outside a character set are now compiled to a single ``CATEGORY`` opcode
292-
instead of being wrapped in an ``IN`` block. This speeds up matching of
293-
patterns such as ``\d+`` and reduces the size of the compiled byte code.
294-
(Contributed by Serhiy Storchaka in :gh:`152033`.)
291+
outside a character set, and character sets containing a single such escape
292+
(such as ``[\d]`` or ``[^\s]``), are now compiled to a single ``CATEGORY``
293+
opcode instead of being wrapped in an ``IN`` block. This speeds up matching
294+
of patterns such as ``\d+`` and reduces the size of the compiled byte code.
295+
(Contributed by Serhiy Storchaka in :gh:`152033` and Pieter Eendebak in
296+
:gh:`152056`.)
295297

296298
module_name
297299
-----------

Lib/re/_parser.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -625,6 +625,12 @@ def _parse(source, state, verbose, nested, first=False):
625625
subpatternappend((NOT_LITERAL, set[0][1]))
626626
else:
627627
subpatternappend(set[0])
628+
elif _len(set) == 1 and set[0][0] is CATEGORY:
629+
# optimization: a lone category like [\d] or [^\d]
630+
if negate:
631+
subpatternappend((CATEGORY, CH_NEGATE[set[0][1]]))
632+
else:
633+
subpatternappend(set[0])
628634
else:
629635
if negate:
630636
set.insert(0, (NEGATE, None))
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Optimize matching of a character set that contains a single character
2+
category, such as ``[\d]`` or ``[^\s]``: it is now compiled to a single
3+
``CATEGORY`` opcode, the same as the corresponding ``\d`` or ``\S`` escape,
4+
instead of being wrapped in an ``IN`` block. This speeds up matching and
5+
reduces the size of the compiled byte code. Patch by Pieter Eendebak.

0 commit comments

Comments
 (0)