Charset encoding fix by Ev3nt · Pull Request #8955 · FirebirdSQL/firebird

Ev3nt · 2026-03-26T06:56:16Z

Problem

When processing text data without an explicitly set encoding, non-ASCII characters were unconditionally replaced with '?'. This caused irreversible data loss — for example, usernames containing Cyrillic or other multibyte characters appeared corrupted in trace output.

Request

create user user1 password 'user1' firstname 'Абоба';

Configs

firebird.conf

AuditTraceConfigFile = fbtrace.conf
TracePlugin = fbtrace

fbtrace.conf

database
{
  enabled = true
  log_filename = fbtrace.log
  log_statement_finish = true
  time_threshold = 0
}

Before

create user user1 password 'user1' firstname '??????????'
...
param3 = varchar(128), "??????????"

After

create user user1 password 'user1' firstname 'Абоба'
...
param3 = varchar(128), "Абоба"

Fix

Instead of unconditionally replacing unrecognized characters with '?', the code now first attempts to convert the text to UTF-8. The '?' replacement is used only as a fallback if the conversion fails.

dyemanov · 2026-03-26T07:03:48Z

Just to clarify the description - we optimistically assume that the input is already in UTF8 and validate whether it's well-formed. If so, it's used "as is", otherwise the original replacement happens.

aafemt · 2026-03-26T08:19:00Z

Application that uses charset NONE rather send queries in ANSI on Windows and system locale on Linux (which nowadays is UTF-8 indeed), IMHO.

hvlad · 2026-03-26T08:18:43Z

src/dsql/dsql.cpp

+			static_assert(CS_METADATA == CS_UTF8);
+			const bool isConverted = DataTypeUtil::convertToUTF8(source, transformedText, charSetId, ERRD_post);
+			dsqlStatement->setSqlText(FB_NEW_POOL(*statementPool) RefString(*statementPool,
+				isConverted ? transformedText : source));


In which pool the source was allocated ?

As I understand from the code, in defaultMemoryPool.

I see that Parser and its transformedString allocated from scratchPool

Line 626 explicitly allocates the resulting string in statementPool, no?

transformedString uses a default constructor that calls getDefaultMemoryPool(), no?

RefString is not "a reference to string", it is "a reference-counted string". It is explicitly allocated in the statement pool and is given with the statement pool for copying content from transformedText or source, so it doesn't matter where Parser and transformedString is allocated, the text will end up in the statement pool.

Finally something meaningful...

We could save some time if type names in Firebird were meaningful.

We could save some time if type names in Firebird were meaningful.

Not in this case, for sure

BTW, I would reconsider needs to copy parser's transformedString into separate temporary string (transformedText) or, at least, allocate transformedText from short-lived scratchPool to not waste space in statementPool

hvlad · 2026-03-26T08:19:21Z

src/jrd/DataTypeUtil.cpp


 		dst.resize(length);
 	}
+	catch (const status_exception& e)


ex, please

Ok, i'll rename it in next commit.

src/jrd/intl.h

dyemanov · 2026-03-26T08:34:42Z

Application that uses charset NONE rather send queries in ANSI on Windows

And nothing will change for them.

and system locale on Linux (which nowadays is UTF-8 indeed)

And they will be printed correctly, instead of being replaced with question marks.

aafemt · 2026-03-26T08:58:42Z

Why not just replace the loop with ISC_SystemToUtf8 call? It would please users of both platforms.

Ev3nt · 2026-03-26T09:23:02Z

Why not just replace the loop with ISC_SystemToUtf8 call? It would please users of both platforms.

Isn't it too expensive to lock a mutex every time?

MutexLockGuard g(mtx, FB_FUNCTION);

asfernandes · 2026-03-26T12:21:05Z

Why the hell we should assume not explicitly defined charset as UTF8?
This makes no sense IMO.
This is also not used only for NONE charset. There is "introducer" syntax that could mix data of different encodings in the text.
This PR should be rejected IMO.

dyemanov · 2026-03-26T12:36:24Z

Why the hell we should assume not explicitly defined charset as UTF8?

Because this is the charset we're converting the SQL text into. And because this is user-friendly.

There is "introducer" syntax that could mix data of different encodings in the text.

Hmmm. And how do we currently convert WIN1251 statement with embedded _WIN1250 literals into UTF8? I guess this fails anyway?

asfernandes · 2026-03-26T12:45:10Z

Why the hell we should assume not explicitly defined charset as UTF8?

Because this is the charset we're converting the SQL text into.

"Into" is different than "from". It's why a conversion exists.

And because this is user-friendly.

It's not, we would masquerading problems in one plataform that appears in another. We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.

There is "introducer" syntax that could mix data of different encodings in the text.

Hmmm. And how do we currently convert WIN1251 statement with embedded _WIN1250 literals into UTF8? I guess this fails anyway?

This case actually seems to use hex strings (see introducedMarks.

dyemanov · 2026-03-26T13:06:46Z

Because this is the charset we're converting the SQL text into.

"Into" is different than "from". It's why a conversion exists.

Sure. And of course it's pointless to find out what was the original charset. But we may at least check whether it looks like UTF8 (the one we need).

We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.

Inside the engine, CS_NONE / CS_BINARY strings are converted (into whatever else) using a simple byte copy. We don't explicitly assume anything. But if the destination charset is UTF8, we assume the result will be in UTF8. And INTL_convert_bytes() we use for transliteration explicitly calls toCharSet->wellFormed() to ensure that. This means we implicitly assume the source string being in UTF8 too. I see zero difference with this PR.

aafemt · 2026-03-26T13:17:52Z

Isn't it too expensive to lock a mutex every time?

Perhaps it is cheaper than iconv_open but I'm not sure.

asfernandes · 2026-03-26T13:50:40Z

Because this is the charset we're converting the SQL text into.

"Into" is different than "from". It's why a conversion exists.

Sure. And of course it's pointless to find out what was the original charset. But we may at least check whether it looks like UTF8 (the one we need).

Every thing that looks like UTF-8 looks like NONE, as NONE have no limitations on its own.
It may looks like as others charsets too.

We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.

Inside the engine, CS_NONE / CS_BINARY strings are converted (into whatever else) using a simple byte copy. We don't explicitly assume anything. But if the destination charset is UTF8, we assume the result will be in UTF8. And INTL_convert_bytes() we use for transliteration explicitly calls toCharSet->wellFormed() to ensure that. This means we implicitly assume the source string being in UTF8 too. I see zero difference with this PR.

This is different. It generally happens by individual things (a move from one place to another) and generally to an user defined thing (with user decided to create the target as UTF-8).

Now we have clear thing: you used non-ASCII characters and didn't defined the connection charset, you have question marks as output.

In this PR is a generalization for a whole statement text to different internal things (metadata text, trace output, monitoring) is done. And it does it using assumptions that may be wrong. A user in Windows using WIN1252 will think this is a bug and that we should convert it from WIN1252 to UTF8 instead of this.

dyemanov · 2026-03-26T14:04:26Z

Now we have clear thing: you used non-ASCII characters and didn't defined the connection charset, you have question marks as output.

Yes, it's clear. But in one particular case (input is in fact in UTF8) we could handle it better, but decided to not. This is what I call being not so friendly.

As for me, this clear statement won't become much harder if re-stated as "you used non-ASCII characters and didn't defined the connection charset and your actual encoding wasn't UTF8". But maybe this is just me.

asfernandes · 2026-03-26T14:38:57Z

Yes, it's clear. But in one particular case (input is in fact in UTF8) we could handle it better, but decided to not. This is what I call being not so friendly.

"It's wrong these question marks. Let me discover what is wrong. Ah, I'm not using the important connection charset. Now I have this problem and many others fixed."

aafemt · 2026-03-26T14:44:36Z

I would suggest to close this PR and try a different approach: in absence of explicit isc_dpb_lc_ctype in DPB, client library should add it with the value corresponding to current locale (ANSI codepage). This way the solution would be more general and handle cases of clients with different locales.

dyemanov · 2026-03-26T19:59:05Z

"It's wrong these question marks. Let me discover what is wrong. Ah, I'm not using the important connection charset. Now I have this problem and many others fixed."

Or they may not notice the problem immediately. And PSQL sources are stored as unrecoverably broken. And DBA cannot get anything meaningful from the trace logs. While it could be avoided.

All these use cases (stored PSQL text, monitoring, trace) are IMO expected to preserve as much original information as possible. Giving up at something "possibly unknown" without checking whether it could be actually processed is easy but somewhat underdone.

And honestly, I don't buy the Linux/Windows argument at all. It's not about the system locale per se, it's about the missing attachment charset. Data could be sent in UTF8 even on Windows (e.g. from a Java application), it's not necessarily comes from CLI tools like ISQL.

livius2 · 2026-03-26T21:58:45Z

Wouldn't it be better to throw an error that no charset was provided during the connection? And add compatibility flag in the firebird.conf to support NONE?

Charset encoding fix

f28c1a3

hvlad reviewed Mar 26, 2026

View reviewed changes

Fix integer truncation in convertLength buffer size calculation

dc22270

Ev3nt requested review from aafemt and hvlad March 26, 2026 12:00

Uh oh!

Conversation

Ev3nt commented Mar 26, 2026

Problem

Request

Configs

firebird.conf

fbtrace.conf

Before

After

Fix

Uh oh!

dyemanov commented Mar 26, 2026

Uh oh!

aafemt commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aafemt Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ev3nt Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dyemanov commented Mar 26, 2026

Uh oh!

aafemt commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ev3nt commented Mar 26, 2026

Uh oh!

asfernandes commented Mar 26, 2026

Uh oh!

dyemanov commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asfernandes commented Mar 26, 2026

Uh oh!

dyemanov commented Mar 26, 2026

Uh oh!

aafemt commented Mar 26, 2026

Uh oh!

asfernandes commented Mar 26, 2026

Uh oh!

dyemanov commented Mar 26, 2026

Uh oh!

asfernandes commented Mar 26, 2026

Uh oh!

aafemt commented Mar 26, 2026

Uh oh!

dyemanov commented Mar 26, 2026

Uh oh!

livius2 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

aafemt Mar 26, 2026 •

edited

Loading

Ev3nt Mar 26, 2026 •

edited

Loading

aafemt commented Mar 26, 2026 •

edited

Loading

dyemanov commented Mar 26, 2026 •

edited

Loading