Skip to content

Commit 8ca0ebe

Browse files
gh-95555: Support Unicode property escapes \p{...} in regular expressions
Add support for \p{property} and \P{property} in Unicode (str) regular expressions, for the properties the engine can resolve without the unicodedata database. They are matched either as CATEGORY opcodes (character predicates and combinations of them, see sre.c) or as fixed sets of character ranges. Supported properties: * many General_Category values -- the groups L, N, Z, C and the values Lu, Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn; * the binary properties Alphabetic, Lowercase, Uppercase, Numeric, Printable, XID_Start, XID_Continue, Cased and Case_Ignorable; * the POSIX compatibility classes alpha, alnum, blank, cntrl, digit, graph, lower, print, space, upper, word and xdigit; * the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point, Join_Control and the immutable Pattern_Syntax and Pattern_White_Space. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 4de5683 commit 8ca0ebe

9 files changed

Lines changed: 827 additions & 7 deletions

File tree

Doc/library/re.rst

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -591,7 +591,7 @@ character ``'$'``.
591591

592592
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
593593

594-
__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
594+
__ https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-4/#G124142
595595

596596
For 8-bit (bytes) patterns:
597597
Matches any decimal digit in the ASCII character set;
@@ -658,6 +658,50 @@ character ``'$'``.
658658
matches characters which are neither alphanumeric in the current locale
659659
nor the underscore.
660660

661+
.. index:: single: \p; in regular expressions
662+
single: \P; in regular expressions
663+
664+
``\p{property=value}``, ``\p{value}``
665+
Matches any character with the given Unicode property
666+
(see `Unicode Technical Standard #18
667+
<https://unicode.org/reports/tr18/>`_, requirement RL1.2 "Properties").
668+
Property and value names are matched loosely:
669+
case, whitespace, ``'-'`` and ``'_'`` are ignored.
670+
The following properties are supported:
671+
672+
* The ``General_Category`` property (short name ``gc``),
673+
spelled ``\p{Lu}``, ``\p{gc=Lu}`` or, for a one-letter group, ``\p{L}``.
674+
The supported values are the groups ``L``, ``N``, ``Z`` and ``C`` and the
675+
values ``Lu``, ``Lt``, ``Lm``, ``Nd``, ``Nl``, ``No``, ``Zs``, ``Zl``,
676+
``Zp``, ``Cc``, ``Cf``, ``Cs``, ``Co`` and ``Cn``.
677+
* The binary properties ``XID_Start``, ``XID_Continue``, ``Alphabetic``,
678+
``Lowercase``, ``Uppercase``, ``Numeric``, ``Printable``, ``Cased`` and
679+
``Case_Ignorable``. A binary property may also be spelled
680+
``\p{name=yes}`` or ``\p{name=no}``.
681+
* The POSIX compatibility classes ``alpha``, ``alnum``, ``blank``,
682+
``cntrl``, ``digit``, ``graph``, ``lower``, ``print``, ``space``,
683+
``upper``, ``word`` and ``xdigit``.
684+
* The properties ``ASCII``, ``Any``, ``Assigned``,
685+
``Noncharacter_Code_Point``, ``Join_Control``, ``Pattern_Syntax`` and
686+
``Pattern_White_Space``.
687+
688+
Where a supported property corresponds to a :mod:`unicodedata` accessor or
689+
:class:`str` method, the set of characters it matches is exactly the one
690+
they report. For consistency with these, ``space`` follows
691+
:py:meth:`str.isspace` (like ``\s``) and ``xdigit`` matches only the ASCII
692+
hexadecimal digits.
693+
694+
This is only recognized in Unicode (str) patterns.
695+
In bytes patterns it is an error.
696+
697+
.. versionadded:: next
698+
699+
``\P{...}``
700+
Matches any character which does *not* have the given Unicode property.
701+
This is the opposite of ``\p``.
702+
703+
.. versionadded:: next
704+
661705
.. index:: single: \z; in regular expressions
662706
single: \Z; in regular expressions
663707

Doc/whatsnew/3.16.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,17 @@ os
142142
(Contributed by Maurycy Pawłowski-Wieroński in :gh:`149464`.)
143143

144144

145+
re
146+
--
147+
148+
* Regular expressions now support Unicode property escapes ``\p{...}`` and
149+
``\P{...}``, which match a character by a Unicode property -- for example
150+
``\p{Lu}`` (an uppercase letter), ``\p{Cased}`` or ``\p{ASCII}``. See
151+
:ref:`the regular expression syntax <re-syntax>` for the supported
152+
properties.
153+
(Contributed by Serhiy Storchaka in :gh:`95555`.)
154+
155+
145156
shlex
146157
-----
147158

Lib/re/_constants.py

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
# update when constants are added or removed
1515

16-
MAGIC = 20230612
16+
MAGIC = 20260628
1717

1818
from _sre import MAXREPEAT, MAXGROUPS # noqa: F401
1919

@@ -150,6 +150,35 @@ def _makecodes(*names):
150150
'CATEGORY_UNI_SPACE', 'CATEGORY_UNI_NOT_SPACE',
151151
'CATEGORY_UNI_WORD', 'CATEGORY_UNI_NOT_WORD',
152152
'CATEGORY_UNI_LINEBREAK', 'CATEGORY_UNI_NOT_LINEBREAK',
153+
154+
# Unicode property categories. These are not affected by the ASCII,
155+
# LOCALE or UNICODE flags.
156+
'CATEGORY_ALPHA', 'CATEGORY_NOT_ALPHA',
157+
'CATEGORY_LOWER', 'CATEGORY_NOT_LOWER',
158+
'CATEGORY_UPPER', 'CATEGORY_NOT_UPPER',
159+
'CATEGORY_NUMERIC', 'CATEGORY_NOT_NUMERIC',
160+
'CATEGORY_PRINTABLE', 'CATEGORY_NOT_PRINTABLE',
161+
'CATEGORY_ALNUM', 'CATEGORY_NOT_ALNUM',
162+
'CATEGORY_XID_START', 'CATEGORY_NOT_XID_START',
163+
'CATEGORY_XID_CONTINUE', 'CATEGORY_NOT_XID_CONTINUE',
164+
'CATEGORY_TITLE', 'CATEGORY_NOT_TITLE',
165+
'CATEGORY_CASED', 'CATEGORY_NOT_CASED',
166+
'CATEGORY_CASE_IGNORABLE', 'CATEGORY_NOT_CASE_IGNORABLE',
167+
# Compound categories: Lu = uppercase letter, N = number.
168+
'CATEGORY_LU', 'CATEGORY_NOT_LU',
169+
'CATEGORY_N', 'CATEGORY_NOT_N',
170+
'CATEGORY_LM', 'CATEGORY_NOT_LM',
171+
'CATEGORY_NL', 'CATEGORY_NOT_NL',
172+
'CATEGORY_NO', 'CATEGORY_NOT_NO',
173+
'CATEGORY_CF', 'CATEGORY_NOT_CF',
174+
'CATEGORY_Z', 'CATEGORY_NOT_Z',
175+
'CATEGORY_ZS', 'CATEGORY_NOT_ZS',
176+
'CATEGORY_C', 'CATEGORY_NOT_C',
177+
'CATEGORY_CN', 'CATEGORY_NOT_CN',
178+
'CATEGORY_ASSIGNED', 'CATEGORY_NOT_ASSIGNED',
179+
'CATEGORY_BLANK', 'CATEGORY_NOT_BLANK',
180+
'CATEGORY_GRAPH', 'CATEGORY_NOT_GRAPH',
181+
'CATEGORY_PRINT', 'CATEGORY_NOT_PRINT',
153182
)
154183

155184

@@ -206,6 +235,39 @@ def _makecodes(*names):
206235
CATEGORY_NOT_LINEBREAK: CATEGORY_UNI_NOT_LINEBREAK
207236
}
208237

238+
# The Unicode property categories are the same regardless of the flags.
239+
CH_PROPERTY = (
240+
CATEGORY_ALPHA, CATEGORY_NOT_ALPHA,
241+
CATEGORY_LOWER, CATEGORY_NOT_LOWER,
242+
CATEGORY_UPPER, CATEGORY_NOT_UPPER,
243+
CATEGORY_NUMERIC, CATEGORY_NOT_NUMERIC,
244+
CATEGORY_PRINTABLE, CATEGORY_NOT_PRINTABLE,
245+
CATEGORY_ALNUM, CATEGORY_NOT_ALNUM,
246+
CATEGORY_XID_START, CATEGORY_NOT_XID_START,
247+
CATEGORY_XID_CONTINUE, CATEGORY_NOT_XID_CONTINUE,
248+
CATEGORY_TITLE, CATEGORY_NOT_TITLE,
249+
CATEGORY_CASED, CATEGORY_NOT_CASED,
250+
CATEGORY_CASE_IGNORABLE, CATEGORY_NOT_CASE_IGNORABLE,
251+
CATEGORY_LU, CATEGORY_NOT_LU,
252+
CATEGORY_N, CATEGORY_NOT_N,
253+
CATEGORY_LM, CATEGORY_NOT_LM,
254+
CATEGORY_NL, CATEGORY_NOT_NL,
255+
CATEGORY_NO, CATEGORY_NOT_NO,
256+
CATEGORY_CF, CATEGORY_NOT_CF,
257+
CATEGORY_Z, CATEGORY_NOT_Z,
258+
CATEGORY_ZS, CATEGORY_NOT_ZS,
259+
CATEGORY_C, CATEGORY_NOT_C,
260+
CATEGORY_CN, CATEGORY_NOT_CN,
261+
CATEGORY_ASSIGNED, CATEGORY_NOT_ASSIGNED,
262+
CATEGORY_BLANK, CATEGORY_NOT_BLANK,
263+
CATEGORY_GRAPH, CATEGORY_NOT_GRAPH,
264+
CATEGORY_PRINT, CATEGORY_NOT_PRINT,
265+
)
266+
for _cat in CH_PROPERTY:
267+
CH_LOCALE[_cat] = _cat
268+
CH_UNICODE[_cat] = _cat
269+
del _cat
270+
209271
CH_NEGATE = dict(zip(CHCODES[::2] + CHCODES[1::2], CHCODES[1::2] + CHCODES[::2]))
210272

211273
# flags

Lib/re/_parser.py

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,22 @@ def checkgroupname(self, name, offset):
309309
msg = "bad character in group name %r" % name
310310
raise self.error(msg, len(name) + offset)
311311

312+
def _property_escape(source, escape, in_set=False):
313+
# handle \p{...} and \P{...} (UTS #18 1.2.4, "Property Syntax")
314+
from . import _properties
315+
if not source.match('{'):
316+
raise source.error("missing {, expected property name")
317+
name = source.getuntil('}', 'property name')
318+
code = _properties.parse_property(name, escape[1] == 'P')
319+
if code is None:
320+
raise source.error("unknown property name %r" % name,
321+
len(name) + len(r'\p{}'))
322+
if in_set and code[1][0] == (NEGATE, None):
323+
# A negated multi-range property cannot be a member of a set.
324+
raise source.error("bad escape %s in character class" % escape,
325+
len(name) + len(r'\p{}'))
326+
return code
327+
312328
def _class_escape(source, escape):
313329
# handle escape code inside character class
314330
code = ESCAPES.get(escape)
@@ -351,6 +367,8 @@ def _class_escape(source, escape):
351367
raise source.error("undefined character name %r" % charname,
352368
len(charname) + len(r'\N{}')) from None
353369
return LITERAL, c
370+
elif c in "pP" and source.istext:
371+
return _property_escape(source, escape, in_set=True)
354372
elif c in OCTDIGITS:
355373
# octal escape (up to three digits)
356374
escape += source.getwhile(2, OCTDIGITS)
@@ -411,6 +429,8 @@ def _escape(source, escape, state):
411429
raise source.error("undefined character name %r" % charname,
412430
len(charname) + len(r'\N{}')) from None
413431
return LITERAL, c
432+
elif c in "pP" and source.istext:
433+
return _property_escape(source, escape)
414434
elif c == "0":
415435
# octal escape
416436
escape += source.getwhile(2, OCTDIGITS)
@@ -591,8 +611,9 @@ def _parse(source, state, verbose, nested, first=False):
591611
source.tell() - here)
592612
if that == "]":
593613
if code1[0] is IN:
594-
code1 = code1[1][0]
595-
setappend(code1)
614+
set.extend(code1[1])
615+
else:
616+
setappend(code1)
596617
setappend((LITERAL, _ord("-")))
597618
break
598619
if that[0] == "\\":
@@ -617,8 +638,9 @@ def _parse(source, state, verbose, nested, first=False):
617638
setappend((RANGE, (lo, hi)))
618639
else:
619640
if code1[0] is IN:
620-
code1 = code1[1][0]
621-
setappend(code1)
641+
set.extend(code1[1])
642+
else:
643+
setappend(code1)
622644

623645
set = _uniq(set)
624646
# XXX: <fl> should move set optimization to compiler!

0 commit comments

Comments
 (0)