Skip to content

Allow writing StringDType variables to netCDF#11218

Open
kkollsga wants to merge 2 commits intopydata:mainfrom
kkollsga:fix-stringdtype-netcdf-11199
Open

Allow writing StringDType variables to netCDF#11218
kkollsga wants to merge 2 commits intopydata:mainfrom
kkollsga:fix-stringdtype-netcdf-11199

Conversation

@kkollsga
Copy link
Contributor

@kkollsga kkollsga commented Mar 8, 2026

Summary

  • Recognizes numpy.dtypes.StringDType (kind "T") as a unicode string type in is_unicode_dtype, so the encoding pipeline and backend dtype selection handle it correctly.
  • Converts StringDType arrays to object arrays in netCDF4 and h5netcdf backend prepare_variable methods, since neither C library supports StringDType natively.
  • Null values from StringDType(na_object=None) are replaced with empty strings on write, matching existing behavior for object-dtype string arrays with missing values.
  • The scipy backend already works because EncodedStringCoder(allows_unicode=False) encodes strings to bytes via encode_string_array, which handles StringDType.

Test plan

  • test_is_unicode_dtype_stringdtype — unit test for is_unicode_dtype with StringDType
  • test_roundtrip_stringdtype_data — roundtrip test in DatasetIOBase, runs across all backends (netCDF4, h5netcdf, scipy, zarr)
  • Manual verification of null handling with StringDType(na_object=None)
  • Pre-commit (ruff, formatting) passes
  • mypy passes (no new errors)

🤖 Generated with Claude Code

@kkollsga
Copy link
Contributor Author

kkollsga commented Mar 9, 2026

The mypy and test failures look related to numpy 2.4.2's stricter type stubs (#11183, #11204).

@kkollsga kkollsga mentioned this pull request Mar 9, 2026
2 tasks
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @kkollsga! Just one comment about adding some more tests. Also does the scipy backend need this same handling?

data = np.array(["ab", "cdef"], dtype=np.dtypes.StringDType())
expected = Dataset({"x": ("t", data)})
with self.roundtrip(expected) as actual:
assert_identical(expected, actual)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is great! Can you also add a test where there is a null value in the dataset and also one where you explicitly set StringDType(na_object="") or something to prove that the nulll handling works as expected?

kkollsga and others added 2 commits March 14, 2026 12:20
Recognize numpy.dtypes.StringDType (kind "T") as a unicode string type
in is_unicode_dtype, and convert StringDType arrays to object arrays
before passing to netCDF4/h5netcdf backends which don't support
StringDType natively. Null values from StringDType(na_object=None) are
replaced with empty strings on write.

Co-authored-by: Claude <noreply@anthropic.com>
- Handle StringDType null values in encode_string_array (scipy/nc3 path)
- Add roundtrip tests for StringDType with na_object=None and na_object=""
- Add unit test for encode_string_array with StringDType nulls

Co-Authored-By: Claude <noreply@anthropic.com>
@kkollsga kkollsga force-pushed the fix-stringdtype-netcdf-11199 branch from 4f7280c to 79b4ae1 Compare March 14, 2026 11:36
@kkollsga
Copy link
Contributor Author

Thanks for the review @jsignell ! I looked into scipy and the situation is a bit different there. The scipy backend goes through EncodedStringCoder(allows_unicode=False) which converts strings to bytes before prepare_variable ever sees the data, so it doesn't need the same prepare_variable change. However, encode_string_array (called by that coder) would crash on null values since None.encode("utf-8") raises AttributeError. So the fix needed to go in encode_string_array instead, using the same convert-to-object-and-replace-None pattern to keep null handling consistent across all paths.

I added these changes:

  • Null handling in encode_string_array for StringDType (covers scipy and nc3 format paths)
  • test_roundtrip_stringdtype_nulls — roundtrip with StringDType(na_object=None) containing a null, verifies it comes back as ""
  • test_roundtrip_stringdtype_with_na_object — roundtrip with StringDType(na_object="")
  • test_encode_string_array_stringdtype_nulls — unit test for the encode_string_array fix

All roundtrip tests are in DatasetIOBase so they run across every backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datasets concatenated along string dimension cannot write to netCDF

2 participants