Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

Open
subrata-ms wants to merge 3 commits intomainfrom
subrata-ms/CP1252Encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 3 commits intomainfrom
subrata-ms/CP1252Encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Apr 3, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

  • The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)

  • All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

  • The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

  • Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

@github-actions github-actions bot added the pr-size: large Substantial code update label Apr 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

85%


🎯 Overall Coverage

78%


📈 Total Lines Covered: 6636 out of 8461
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (87.4%): Missing lines 3253-3258,3265-3276,3278-3282,4795-4797,5053-5054,5104-5106,5410-5411
  • mssql_python/pybind/ddbc_bindings.h (65.2%): Missing lines 830-832,838-842

Summary

  • Total: 285 lines
  • Missing: 41 lines
  • Coverage: 85%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3249-3262

  3249                                     "length=%lu",
  3250                                     i, (unsigned long)numCharsInData);
  3251                             } else {
  3252                                 // Buffer too small, fallback to streaming
! 3253                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3254                                     "truncated, using streaming LOB",
! 3255                                     i);
! 3256                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3257                                                               "utf-16le"));
! 3258                             }
  3259                         } else if (dataLen == SQL_NULL_DATA) {
  3260                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {

Lines 3261-3286

  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {
  3263                             row.append(py::str(""));
  3264                         } else if (dataLen == SQL_NO_TOTAL) {
! 3265                             LOG("SQLGetData: Cannot determine data length "
! 3266                                 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3267                                 "returning NULL",
! 3268                                 i);
! 3269                             row.append(py::none());
! 3270                         } else if (dataLen < 0) {
! 3271                             LOG("SQLGetData: Unexpected negative data length "
! 3272                                 "for column %d - dataType=%d, dataLen=%ld",
! 3273                                 i, dataType, (long)dataLen);
! 3274                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3275                                               "data length");
! 3276                         }
  3277                     } else {
! 3278                         LOG("SQLGetData: Error retrieving data for column %d "
! 3279                             "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3280                             i, ret);
! 3281                         row.append(py::none());
! 3282                     }
  3283                 } else {
  3284                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3285                     //
  3286                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 4791-4801

  4791                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4792                 break;
  4793             default:
  4794                 std::ostringstream errorString;
! 4795                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4796                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4797                             << (i + 1);
  4798                 LOG(errorString.str().c_str());
  4799                 ThrowStdException(errorString.str());
  4800                 break;
  4801         }

Lines 5049-5058

  5049                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5050                                                  sizeof(DateTimeOffset),
  5051                                                  buffers.indicators[idxCol].data());
  5052                             if (!SQL_SUCCEEDED(ret)) {
! 5053                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5054                                     idxCol + 1);
  5055                                 return ret;
  5056                             }
  5057                             break;
  5058                         }

Lines 5100-5110

  5100                     nullCounts[idxCol] += 1;
  5101                     continue;
  5102                 } else if (indicator < 0) {
  5103                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5104                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5105                         "Length - %lld",
! 5106                         idxCol + 1, dataType, (long long)indicator);
  5107                     ThrowStdException("Unexpected negative data length.");
  5108                 }
  5109                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5406-5415

  5406         arrowSchemaBatchCapsule =
  5407             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5408                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5409                 if (arrowSchema->release) {
! 5410                     arrowSchema->release(arrowSchema);
! 5411                 }
  5412                 delete arrowSchema;
  5413             });
  5414     } catch (...) {
  5415         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  826             PyObject* pyStr =
  827                 PyUnicode_FromWideChar(reinterpret_cast<const wchar_t*>(wcharData), numCharsInData);
  828 #endif
  829             if (!pyStr) {
! 830                 PyErr_Clear();
! 831                 Py_INCREF(Py_None);
! 832                 PyList_SET_ITEM(row, col - 1, Py_None);
  833             } else {
  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {

  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {
  837             // LOB / truncated: stream with SQL_C_WCHAR
! 838             PyList_SET_ITEM(row, col - 1,
! 839                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 840                                 .release()
! 841                                 .ptr());
! 842         }
  843         return;
  844     }
  845 
  846     // Original narrow-char path (charCtype == SQL_C_CHAR)


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review April 3, 2026 08:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

  • Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
  • Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
  • Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mssql_python/connection.py Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR.
mssql_python/cursor.py Passes updated decoding settings (encoding + ctype) into the native fetch functions.
mssql_python/pybind/ddbc_bindings.cpp Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching.
mssql_python/pybind/ddbc_bindings.h Extends per-column metadata and adds a wide-char branch in ProcessChar.
tests/test_013_encoding_decoding.py Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 268 to 272
self._decoding_settings = {
ConstantsDDBC.SQL_CHAR.value: {
"encoding": "utf-8",
"ctype": ConstantsDDBC.SQL_CHAR.value,
"encoding": "utf-16le",
"ctype": ConstantsDDBC.SQL_WCHAR.value,
},
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection.init now sets SQL_CHAR decoding defaults to encoding='utf-16le' with ctype=SQL_WCHAR, but setdecoding() still defaults SQL_CHAR to 'utf-8' when encoding is None. This makes it impossible to reliably “reset to defaults” via setdecoding(SQL_CHAR) and creates inconsistent documented behavior. Update setdecoding’s default-encoding branch (and its docstring/examples) so SQL_CHAR defaults match the new connection defaults (utf-16le + SQL_WCHAR).

Copilot uses AI. Check for mistakes.
std::string utf8str = WideToUTF8(wstr);
row.append(py::str(utf8str));
#else
std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SQL_C_WCHAR path, this constructs a std::wstring from a null-terminated buffer (wstring(ptr)) even though SQLGetData provides an explicit length (numCharsInData). If the data contains embedded NULs, or if the driver fails to null-terminate as expected, the value may be truncated or read past the valid range. Prefer constructing wstr with the explicit length (e.g., wstring(ptr, numCharsInData)) and avoid relying on terminators.

Suggested change
std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));
std::wstring wstr(
reinterpret_cast<wchar_t*>(dataBuffer.data()),
static_cast<size_t>(numCharsInData));

Copilot uses AI. Check for mistakes.
Comment on lines +831 to +832
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessChar’s wide-char decode failure path clears the exception and stores None in the result. This silently drops data and is inconsistent with the narrow-char path (which falls back to bytes on decode failure) and ProcessWChar (which falls back to empty string). Consider aligning behavior (e.g., return raw UTF-16 bytes or a deterministic fallback) rather than returning None.

Suggested change
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
PyObject* pyBytes = PyBytes_FromStringAndSize(
reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));
if (!pyBytes) {
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
} else {
PyList_SET_ITEM(row, col - 1, pyBytes);
}

Copilot uses AI. Check for mistakes.
Comment on lines +7420 to +7421
for b in defined_bytes:
cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test inserts 123 rows using a per-row execute() loop, which can add noticeable runtime on Windows CI (many round-trips). Consider batching these inserts (executemany, a single INSERT ... VALUES list, or inserting into a temp table) to keep the regression coverage without the per-row overhead.

Suggested change
for b in defined_bytes:
cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")
cursor.executemany(
f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))",
[(b, b) for b in defined_bytes],
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants