Summary
A Constants instance does not survive a pickle round-trip. Any customizations (added/removed titles, prefixes, suffixes, etc.) are silently lost, every collection reverts to its module default, and prefixes is left holding garbage — the names of the config attributes rather than actual prefixes. No exception is raised, so callers get a silently-wrong parser.
This is pre-existing on master (unrelated to the _pst cache work in #166) but was noticed while reviewing that PR.
Reproduction
import pickle
from nameparser.config import Constants
c = Constants()
c.titles.add('customtitle')
c.prefixes.add('customprefix')
r = pickle.loads(pickle.dumps(c))
print('customtitle' in r.titles) # False (expected True)
print('customprefix' in r.prefixes) # False (expected True)
print(sorted(r.prefixes)[:5])
# ['capitalization_exceptions', 'capitalize_name', 'conjunctions',
# 'empty_attribute_default', 'first_name_titles'] <-- attribute names, not prefixes
print(len(r.prefixes)) # 15 (the number of public attrs, not the 36 default prefixes)
Root cause
nameparser/config/__init__.py:
def __setstate__(self, state: Mapping[str, Any]) -> None:
Constants.__init__(self, state)
def __getstate__(self) -> Mapping[str, Any]:
attrs = [x for x in dir(self) if not x.startswith('_')]
return dict([(a, getattr(self, a)) for a in attrs])
__getstate__ returns a dict of every public attribute. __setstate__ then passes that whole dict as the first positional argument to __init__, i.e. as prefixes. So:
prefixes becomes SetManager(state_dict) — iterating a dict yields its keys, which is why prefixes ends up containing attribute names like conjunctions, capitalize_name, …
- every other
__init__ parameter (titles, suffix_acronyms, conjunctions, …) falls back to its module default, discarding the pickled values entirely.
Impact
- Any workflow that pickles a customized
Constants (e.g. caching, passing config across process boundaries / multiprocessing, some serialization layers) gets a silently corrupted parser rather than the configured one.
- The failure is silent — no error, just wrong parsing results — which makes it easy to miss.
Suggested fix
__setstate__ should restore the saved attributes onto the instance rather than re-running __init__ with the dict as a positional arg. Note that __getstate__ mixes two kinds of attributes that need different handling:
__init__ collection parameters: prefixes, suffix_acronyms, suffix_not_acronyms, titles, first_name_titles, conjunctions, capitalization_exceptions, regexes
- plain class-level scalars that are not
__init__ parameters: string_format, initials_format, initials_delimiter, empty_attribute_default, capitalize_name, force_mixed_case_capitalization
So a simple Constants.__init__(self, **state) would raise TypeError on the scalar keys. Restoring via setattr(self, key, value) for each item in state is the more robust direction (and naturally re-wires the _pst cache invalidation for the four cached set attributes, since assignment goes through their descriptor). A round-trip test asserting customizations survive should accompany the fix.
A couple of related things worth deciding while in there:
copy.deepcopy(Constants()) currently raises TypeError: 're.Pattern' object is not callable via RegexTupleManager.__reduce__; the test suite's conftest.py already works around this by rebuilding managers instead of deep-copying. Worth confirming whether deepcopy should be supported too.
- Consider whether
Constants needs to be picklable at all, or whether documenting it as non-picklable is acceptable — but if it stays picklable, the current silent corruption should be fixed.
Summary
A
Constantsinstance does not survive apickleround-trip. Any customizations (added/removed titles, prefixes, suffixes, etc.) are silently lost, every collection reverts to its module default, andprefixesis left holding garbage — the names of the config attributes rather than actual prefixes. No exception is raised, so callers get a silently-wrong parser.This is pre-existing on
master(unrelated to the_pstcache work in #166) but was noticed while reviewing that PR.Reproduction
Root cause
nameparser/config/__init__.py:__getstate__returns a dict of every public attribute.__setstate__then passes that whole dict as the first positional argument to__init__, i.e. asprefixes. So:prefixesbecomesSetManager(state_dict)— iterating a dict yields its keys, which is whyprefixesends up containing attribute names likeconjunctions,capitalize_name, …__init__parameter (titles,suffix_acronyms,conjunctions, …) falls back to its module default, discarding the pickled values entirely.Impact
Constants(e.g. caching, passing config across process boundaries / multiprocessing, some serialization layers) gets a silently corrupted parser rather than the configured one.Suggested fix
__setstate__should restore the saved attributes onto the instance rather than re-running__init__with the dict as a positional arg. Note that__getstate__mixes two kinds of attributes that need different handling:__init__collection parameters:prefixes,suffix_acronyms,suffix_not_acronyms,titles,first_name_titles,conjunctions,capitalization_exceptions,regexes__init__parameters:string_format,initials_format,initials_delimiter,empty_attribute_default,capitalize_name,force_mixed_case_capitalizationSo a simple
Constants.__init__(self, **state)would raiseTypeErroron the scalar keys. Restoring viasetattr(self, key, value)for each item instateis the more robust direction (and naturally re-wires the_pstcache invalidation for the four cached set attributes, since assignment goes through their descriptor). A round-trip test asserting customizations survive should accompany the fix.A couple of related things worth deciding while in there:
copy.deepcopy(Constants())currently raisesTypeError: 're.Pattern' object is not callableviaRegexTupleManager.__reduce__; the test suite'sconftest.pyalready works around this by rebuilding managers instead of deep-copying. Worth confirming whether deepcopy should be supported too.Constantsneeds to be picklable at all, or whether documenting it as non-picklable is acceptable — but if it stays picklable, the current silent corruption should be fixed.