Skip to content

Fix leading space in surnames after capitalize() with empty middle name#164

Merged
derek73 merged 3 commits into
derek73:masterfrom
patchwright:fix/capitalize-leading-space
Jun 27, 2026
Merged

Fix leading space in surnames after capitalize() with empty middle name#164
derek73 merged 3 commits into
derek73:masterfrom
patchwright:fix/capitalize-leading-space

Conversation

@patchwright

@patchwright patchwright commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

After capitalize() is called on a name with any absent attribute, the corresponding *_list attribute becomes [''] instead of []. The most visible symptom is a spurious leading space in surnames (e.g. ' Doe' instead of 'Doe'). The same class of bug also affects suffix_list.

Reproduction

from nameparser import HumanName

n = HumanName('john doe')
n.capitalize()

print(repr(n.surnames))     # ' Doe'   <- leading space
print(n.middle_list)        # ['']    <- should be []
print(n.title_list)         # ['']    <- should be []
print(n.suffix_list)        # ['']    <- should be [] (no suffix present)

The leading space is a real observable artifact, not just an empty-list detail: downstream code that joins or compares surnames silently breaks.

Cause

cap_piece(...) returns '' for an absent name part. Splitting an empty string with an explicit separator always preserves a spurious empty token:

  • ''.split(' ')['']
  • ''.split(', ')['']

whereas str.split() with no argument returns [] for an empty string.

Fix

Whitespace-delimited attributes (title, first, middle, last)

Drop the explicit separator — bare .split() correctly returns [] for an empty string:

-        self.title_list = self.cap_piece(self.title, 'title').split(' ')
-        self.first_list = self.cap_piece(self.first, 'first').split(' ')
-        self.middle_list = self.cap_piece(self.middle, 'middle').split(' ')
-        self.last_list = self.cap_piece(self.last, 'last').split(' ')
+        self.title_list = self.cap_piece(self.title, 'title').split()
+        self.first_list = self.cap_piece(self.first, 'first').split()
+        self.middle_list = self.cap_piece(self.middle, 'middle').split()
+        self.last_list = self.cap_piece(self.last, 'last').split()

Suffix

suffix_list must keep its comma delimiter (suffixes are stored as "Ph.D., M.D."), so bare .split() can't be used. Instead, filter empty tokens after splitting:

-        self.suffix_list = self.cap_piece(self.suffix, 'suffix').split(', ')
+        self.suffix_list = [s for s in self.cap_piece(self.suffix, 'suffix').split(', ') if s]

Tests

  • Split the original regression test into three focused methods (normal path, force=True path, list invariants)
  • Added test_capitalize_title_and_last_only_no_spurious_tokens — exercises empty first_list and middle_list simultaneously
  • Added three suffix tests: empty suffix → [], single suffix, multiple suffixes

Full suite: 686 tests OK (20 expected failures).

capitalize() split each attribute with str.split(' '), which returns ['']
(not []) for an empty string. cap_piece() returns '' for an empty part, so an
empty middle name produced middle_list = [''], which leaked into surnames_list
(middle_list + last_list) and yielded a leading space in the surnames property:

    >>> hn = HumanName('john doe'); hn.capitalize(); hn.surnames
    ' Doe'   # leading space (should be 'Doe')

The same spurious '' element also appeared in title_list/first_list/last_list
for empty attributes. Using str.split() instead returns [] for empty strings
and is equivalent for the already-whitespace-collapsed pieces cap_piece()
returns. The suffix split (', ') is intentionally left unchanged.

Added a regression test in HumanNameCapitalizationTestCase.
@patchwright patchwright force-pushed the fix/capitalize-leading-space branch from 50b95d5 to 7fbb73c Compare June 26, 2026 14:15
@patchwright

Copy link
Copy Markdown
Contributor Author

Rebased onto master (1f46568) to clear a merge conflict from the test-suite reorg in #165. The monolithic tests.py is gone, so I moved the regression test into tests/test_capitalization.py next to the other capitalize cases. The parser.py fix itself is unchanged (the four .split(" ") calls in capitalize() become .split()). Full suite green locally on 3.13: 670 passed, 4 skipped, 20 xfailed; the new test fails on the unpatched parser with Doe != Doe for john doe, passes with the fix.

derek73 and others added 2 commits June 27, 2026 12:03
- Split test_capitalize_empty_name_part_has_no_leading_space_in_surnames
  into three focused tests (normal path, force=True path, list invariants)
  so failures self-localize to the scenario
- Added test_capitalize_title_and_last_only_no_spurious_tokens covering
  a name with empty first and middle simultaneously
- Sharpened inline comments to accurately describe root cause and why
  force=True matters (early-return guard)
- Added comment to suffix_list line explaining the intentional split(', ')
  asymmetry vs the space-delimited attributes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
''.split(', ') returns [''] just like ''.split(' ') did for the other
attributes. Use a filtered list comprehension to preserve the comma
delimiter while dropping empty tokens, making suffix_list consistent
with the [] invariant the rest of the codebase relies on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@derek73

derek73 commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Additional fix included: suffix_list spurious '' token

While reviewing this PR, I noticed that suffix_list has the same class of bug: ''.split(', ') returns [''] just like the old ''.split(' ') did for the other attributes. An absent suffix was producing suffix_list == [''] instead of [].

Since .split() (no args) can't be used here — the comma delimiter is load-bearing for multi-suffix names like "Ph.D., M.D." — the fix uses a filtered list comprehension:

-        self.suffix_list = self.cap_piece(self.suffix, 'suffix').split(', ')
+        self.suffix_list = [s for s in self.cap_piece(self.suffix, 'suffix').split(', ') if s]

Three tests added covering the empty-suffix case (the bug), single suffix, and multiple suffixes. Full suite: 686 tests OK.

@derek73 derek73 merged commit 5c19547 into derek73:master Jun 27, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants