FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 by subrata-ms · Pull Request #495 · microsoft/mssql-python

subrata-ms · 2026-04-03T06:43:52Z

Work Item / Issue Reference

AB#43177

GitHub Issue: #468

Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)
All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

mssql_python/pybind/ddbc_bindings.cpp

github-actions · 2026-04-03T07:43:00Z

📊 Code Coverage Report

🔥 Diff Coverage 85%	🎯 Overall Coverage 78%	📈 Total Lines Covered: `6636` out of `8461` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/pybind/ddbc_bindings.cpp (87.4%): Missing lines 3253-3258,3265-3276,3278-3282,4795-4797,5053-5054,5104-5106,5410-5411
mssql_python/pybind/ddbc_bindings.h (65.2%): Missing lines 830-832,838-842

Summary

Total: 285 lines
Missing: 41 lines
Coverage: 85%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3249-3262

  3249                                     "length=%lu",
  3250                                     i, (unsigned long)numCharsInData);
  3251                             } else {
  3252                                 // Buffer too small, fallback to streaming
! 3253                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3254                                     "truncated, using streaming LOB",
! 3255                                     i);
! 3256                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3257                                                               "utf-16le"));
! 3258                             }
  3259                         } else if (dataLen == SQL_NULL_DATA) {
  3260                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {

Lines 3261-3286

  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {
  3263                             row.append(py::str(""));
  3264                         } else if (dataLen == SQL_NO_TOTAL) {
! 3265                             LOG("SQLGetData: Cannot determine data length "
! 3266                                 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3267                                 "returning NULL",
! 3268                                 i);
! 3269                             row.append(py::none());
! 3270                         } else if (dataLen < 0) {
! 3271                             LOG("SQLGetData: Unexpected negative data length "
! 3272                                 "for column %d - dataType=%d, dataLen=%ld",
! 3273                                 i, dataType, (long)dataLen);
! 3274                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3275                                               "data length");
! 3276                         }
  3277                     } else {
! 3278                         LOG("SQLGetData: Error retrieving data for column %d "
! 3279                             "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3280                             i, ret);
! 3281                         row.append(py::none());
! 3282                     }
  3283                 } else {
  3284                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3285                     //
  3286                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 4791-4801

  4791                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4792                 break;
  4793             default:
  4794                 std::ostringstream errorString;
! 4795                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4796                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4797                             << (i + 1);
  4798                 LOG(errorString.str().c_str());
  4799                 ThrowStdException(errorString.str());
  4800                 break;
  4801         }

Lines 5049-5058

  5049                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5050                                                  sizeof(DateTimeOffset),
  5051                                                  buffers.indicators[idxCol].data());
  5052                             if (!SQL_SUCCEEDED(ret)) {
! 5053                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5054                                     idxCol + 1);
  5055                                 return ret;
  5056                             }
  5057                             break;
  5058                         }

Lines 5100-5110

  5100                     nullCounts[idxCol] += 1;
  5101                     continue;
  5102                 } else if (indicator < 0) {
  5103                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5104                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5105                         "Length - %lld",
! 5106                         idxCol + 1, dataType, (long long)indicator);
  5107                     ThrowStdException("Unexpected negative data length.");
  5108                 }
  5109                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5406-5415

  5406         arrowSchemaBatchCapsule =
  5407             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5408                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5409                 if (arrowSchema->release) {
! 5410                     arrowSchema->release(arrowSchema);
! 5411                 }
  5412                 delete arrowSchema;
  5413             });
  5414     } catch (...) {
  5415         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  826             PyObject* pyStr =
  827                 PyUnicode_FromWideChar(reinterpret_cast<const wchar_t*>(wcharData), numCharsInData);
  828 #endif
  829             if (!pyStr) {
! 830                 PyErr_Clear();
! 831                 Py_INCREF(Py_None);
! 832                 PyList_SET_ITEM(row, col - 1, Py_None);
  833             } else {
  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {

  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {
  837             // LOB / truncated: stream with SQL_C_WCHAR
! 838             PyList_SET_ITEM(row, col - 1,
! 839                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 840                                 .release()
! 841                                 .ptr());
! 842         }
  843         return;
  844     }
  845 
  846     // Original narrow-char path (charCtype == SQL_C_CHAR)

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

Copilot

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`mssql_python/connection.py`	Changes default decoding settings for `SQL_CHAR` to UTF-16LE/`SQL_WCHAR`.
`mssql_python/cursor.py`	Passes updated decoding settings (encoding + ctype) into the native fetch functions.
`mssql_python/pybind/ddbc_bindings.cpp`	Adds `charCtype` plumb-through and implements `SQL_C_WCHAR` paths for `SQLGetData` and bound-column fetching.
`mssql_python/pybind/ddbc_bindings.h`	Extends per-column metadata and adds a wide-char branch in `ProcessChar`.
`tests/test_013_encoding_decoding.py`	Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T08:15:18Z

mssql_python/connection.py

        self._decoding_settings = {
            ConstantsDDBC.SQL_CHAR.value: {
-                "encoding": "utf-8",
-                "ctype": ConstantsDDBC.SQL_CHAR.value,
+                "encoding": "utf-16le",
+                "ctype": ConstantsDDBC.SQL_WCHAR.value,
            },


Connection.init now sets SQL_CHAR decoding defaults to encoding='utf-16le' with ctype=SQL_WCHAR, but setdecoding() still defaults SQL_CHAR to 'utf-8' when encoding is None. This makes it impossible to reliably “reset to defaults” via setdecoding(SQL_CHAR) and creates inconsistent documented behavior. Update setdecoding’s default-encoding branch (and its docstring/examples) so SQL_CHAR defaults match the new connection defaults (utf-16le + SQL_WCHAR).

Copilot · 2026-04-03T08:15:18Z

mssql_python/pybind/ddbc_bindings.cpp

+                                std::string utf8str = WideToUTF8(wstr);
+                                row.append(py::str(utf8str));
+#else
+                                std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));


In the SQL_C_WCHAR path, this constructs a std::wstring from a null-terminated buffer (wstring(ptr)) even though SQLGetData provides an explicit length (numCharsInData). If the data contains embedded NULs, or if the driver fails to null-terminate as expected, the value may be truncated or read past the valid range. Prefer constructing wstr with the explicit length (e.g., wstring(ptr, numCharsInData)) and avoid relying on terminators.

Suggested change

std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));

std::wstring wstr(

reinterpret_cast<wchar_t*>(dataBuffer.data()),

static_cast<size_t>(numCharsInData));

Copilot · 2026-04-03T08:15:19Z

mssql_python/pybind/ddbc_bindings.h

+                Py_INCREF(Py_None);
+                PyList_SET_ITEM(row, col - 1, Py_None);


ProcessChar’s wide-char decode failure path clears the exception and stores None in the result. This silently drops data and is inconsistent with the narrow-char path (which falls back to bytes on decode failure) and ProcessWChar (which falls back to empty string). Consider aligning behavior (e.g., return raw UTF-16 bytes or a deterministic fallback) rather than returning None.

Suggested change

Py_INCREF(Py_None);

PyList_SET_ITEM(row, col - 1, Py_None);

PyObject* pyBytes = PyBytes_FromStringAndSize(

reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));

if (!pyBytes) {

Py_INCREF(Py_None);

PyList_SET_ITEM(row, col - 1, Py_None);

} else {

PyList_SET_ITEM(row, col - 1, pyBytes);

}

Copilot · 2026-04-03T08:15:19Z

tests/test_013_encoding_decoding.py

+        for b in defined_bytes:
+            cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")


This test inserts 123 rows using a per-row execute() loop, which can add noticeable runtime on Windows CI (many round-trips). Consider batching these inserts (executemany, a single INSERT ... VALUES list, or inserting into a temp table) to keep the regression coverage without the per-row overhead.

Suggested change

for b in defined_bytes:

cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")

cursor.executemany(

f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))",

[(b, b) for b in defined_bytes],

)

base code changes

14d1f1d

github-actions bot added the pr-size: large Substantial code update label Apr 3, 2026

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

mssql_python/pybind/ddbc_bindings.cpp Show resolved Hide resolved

mssql_python/pybind/ddbc_bindings.cpp Show resolved Hide resolved

subrata-ms and others added 2 commits April 3, 2026 12:32

Merge branch 'main' into subrata-ms/CP1252Encoding

8817007

code fix for linux based system

597ee4c

subrata-ms marked this pull request as ready for review April 3, 2026 08:10

subrata-ms requested review from Copilot, gargsaumya and sumitmsft April 3, 2026 08:10

Copilot started reviewing on behalf of subrata-ms April 3, 2026 08:10 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

subrata-ms mentioned this pull request Apr 3, 2026

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 #478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 3 commits intomainfrom
subrata-ms/CP1252Encoding

subrata-ms commented Apr 3, 2026 •

edited by azure-boards bot

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

🔥 Diff Coverage

85%

🎯 Overall Coverage

78%

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		for b in defined_bytes:
		cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")

-        for b in defined_bytes:
-            cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")
+        cursor.executemany(
+            f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))",
+            [(b, b) for b in defined_bytes],
+        )

Conversation

subrata-ms commented Apr 3, 2026 • edited by azure-boards bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

85%

🎯 Overall Coverage

78%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/ddbc_bindings.h

📋 Files Needing Attention

🔗 Quick Links

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

subrata-ms commented Apr 3, 2026 •

edited by azure-boards bot

Loading