Skip to content

Charset encoding fix#8955

Open
Ev3nt wants to merge 2 commits intoFirebirdSQL:masterfrom
Ev3nt:charset_encoding_fix
Open

Charset encoding fix#8955
Ev3nt wants to merge 2 commits intoFirebirdSQL:masterfrom
Ev3nt:charset_encoding_fix

Conversation

@Ev3nt
Copy link
Contributor

@Ev3nt Ev3nt commented Mar 26, 2026

Problem

When processing text data without an explicitly set encoding, non-ASCII characters were unconditionally replaced with '?'. This caused irreversible data loss — for example, usernames containing Cyrillic or other multibyte characters appeared corrupted in trace output.

Request

create user user1 password 'user1' firstname 'Абоба';

Configs

firebird.conf

AuditTraceConfigFile = fbtrace.conf
TracePlugin = fbtrace

fbtrace.conf

database
{
  enabled = true
  log_filename = fbtrace.log
  log_statement_finish = true
  time_threshold = 0
}

Before

create user user1 password 'user1' firstname '??????????'
...
param3 = varchar(128), "??????????"

After

create user user1 password 'user1' firstname 'Абоба'
...
param3 = varchar(128), "Абоба"

Fix

Instead of unconditionally replacing unrecognized characters with '?', the code now first attempts to convert the text to UTF-8. The '?' replacement is used only as a fallback if the conversion fails.

@dyemanov
Copy link
Member

Just to clarify the description - we optimistically assume that the input is already in UTF8 and validate whether it's well-formed. If so, it's used "as is", otherwise the original replacement happens.

@aafemt
Copy link
Contributor

aafemt commented Mar 26, 2026

Application that uses charset NONE rather send queries in ANSI on Windows and system locale on Linux (which nowadays is UTF-8 indeed), IMHO.

static_assert(CS_METADATA == CS_UTF8);
const bool isConverted = DataTypeUtil::convertToUTF8(source, transformedText, charSetId, ERRD_post);
dsqlStatement->setSqlText(FB_NEW_POOL(*statementPool) RefString(*statementPool,
isConverted ? transformedText : source));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which pool the source was allocated ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from the code, in defaultMemoryPool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that Parser and its transformedString allocated from scratchPool

Copy link
Contributor

@aafemt aafemt Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 626 explicitly allocates the resulting string in statementPool, no?

Copy link
Contributor Author

@Ev3nt Ev3nt Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformedString uses a default constructor that calls getDefaultMemoryPool(), no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RefString is not "a reference to string", it is "a reference-counted string". It is explicitly allocated in the statement pool and is given with the statement pool for copying content from transformedText or source, so it doesn't matter where Parser and transformedString is allocated, the text will end up in the statement pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally something meaningful...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could save some time if type names in Firebird were meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could save some time if type names in Firebird were meaningful.

Not in this case, for sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I would reconsider needs to copy parser's transformedString into separate temporary string (transformedText) or, at least, allocate transformedText from short-lived scratchPool to not waste space in statementPool


dst.resize(length);
}
catch (const status_exception& e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'll rename it in next commit.

@dyemanov
Copy link
Member

Application that uses charset NONE rather send queries in ANSI on Windows

And nothing will change for them.

and system locale on Linux (which nowadays is UTF-8 indeed)

And they will be printed correctly, instead of being replaced with question marks.

@aafemt
Copy link
Contributor

aafemt commented Mar 26, 2026

Why not just replace the loop with ISC_SystemToUtf8 call? It would please users of both platforms.

@Ev3nt
Copy link
Contributor Author

Ev3nt commented Mar 26, 2026

Why not just replace the loop with ISC_SystemToUtf8 call? It would please users of both platforms.

Isn't it too expensive to lock a mutex every time?

MutexLockGuard g(mtx, FB_FUNCTION);

@Ev3nt Ev3nt requested review from aafemt and hvlad March 26, 2026 12:00
@asfernandes
Copy link
Member

Why the hell we should assume not explicitly defined charset as UTF8?
This makes no sense IMO.
This is also not used only for NONE charset. There is "introducer" syntax that could mix data of different encodings in the text.
This PR should be rejected IMO.

@dyemanov
Copy link
Member

dyemanov commented Mar 26, 2026

Why the hell we should assume not explicitly defined charset as UTF8?

Because this is the charset we're converting the SQL text into. And because this is user-friendly.

There is "introducer" syntax that could mix data of different encodings in the text.

Hmmm. And how do we currently convert WIN1251 statement with embedded _WIN1250 literals into UTF8? I guess this fails anyway?

@asfernandes
Copy link
Member

Why the hell we should assume not explicitly defined charset as UTF8?

Because this is the charset we're converting the SQL text into.

"Into" is different than "from". It's why a conversion exists.

And because this is user-friendly.

It's not, we would masquerading problems in one plataform that appears in another. We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.

There is "introducer" syntax that could mix data of different encodings in the text.

Hmmm. And how do we currently convert WIN1251 statement with embedded _WIN1250 literals into UTF8? I guess this fails anyway?

This case actually seems to use hex strings (see introducedMarks.

@dyemanov
Copy link
Member

Because this is the charset we're converting the SQL text into.

"Into" is different than "from". It's why a conversion exists.

Sure. And of course it's pointless to find out what was the original charset. But we may at least check whether it looks like UTF8 (the one we need).

We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.

Inside the engine, CS_NONE / CS_BINARY strings are converted (into whatever else) using a simple byte copy. We don't explicitly assume anything. But if the destination charset is UTF8, we assume the result will be in UTF8. And INTL_convert_bytes() we use for transliteration explicitly calls toCharSet->wellFormed() to ensure that. This means we implicitly assume the source string being in UTF8 too. I see zero difference with this PR.

@aafemt
Copy link
Contributor

aafemt commented Mar 26, 2026

Isn't it too expensive to lock a mutex every time?

Perhaps it is cheaper than iconv_open but I'm not sure.

@asfernandes
Copy link
Member

Because this is the charset we're converting the SQL text into.

"Into" is different than "from". It's why a conversion exists.

Sure. And of course it's pointless to find out what was the original charset. But we may at least check whether it looks like UTF8 (the one we need).

Every thing that looks like UTF-8 looks like NONE, as NONE have no limitations on its own.
It may looks like as others charsets too.

We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.

Inside the engine, CS_NONE / CS_BINARY strings are converted (into whatever else) using a simple byte copy. We don't explicitly assume anything. But if the destination charset is UTF8, we assume the result will be in UTF8. And INTL_convert_bytes() we use for transliteration explicitly calls toCharSet->wellFormed() to ensure that. This means we implicitly assume the source string being in UTF8 too. I see zero difference with this PR.

This is different. It generally happens by individual things (a move from one place to another) and generally to an user defined thing (with user decided to create the target as UTF-8).

Now we have clear thing: you used non-ASCII characters and didn't defined the connection charset, you have question marks as output.

In this PR is a generalization for a whole statement text to different internal things (metadata text, trace output, monitoring) is done. And it does it using assumptions that may be wrong. A user in Windows using WIN1252 will think this is a bug and that we should convert it from WIN1252 to UTF8 instead of this.

@dyemanov
Copy link
Member

Now we have clear thing: you used non-ASCII characters and didn't defined the connection charset, you have question marks as output.

Yes, it's clear. But in one particular case (input is in fact in UTF8) we could handle it better, but decided to not. This is what I call being not so friendly.

As for me, this clear statement won't become much harder if re-stated as "you used non-ASCII characters and didn't defined the connection charset and your actual encoding wasn't UTF8". But maybe this is just me.

@asfernandes
Copy link
Member

Yes, it's clear. But in one particular case (input is in fact in UTF8) we could handle it better, but decided to not. This is what I call being not so friendly.

"It's wrong these question marks. Let me discover what is wrong. Ah, I'm not using the important connection charset. Now I have this problem and many others fixed."

@aafemt
Copy link
Contributor

aafemt commented Mar 26, 2026

I would suggest to close this PR and try a different approach: in absence of explicit isc_dpb_lc_ctype in DPB, client library should add it with the value corresponding to current locale (ANSI codepage). This way the solution would be more general and handle cases of clients with different locales.

@dyemanov
Copy link
Member

"It's wrong these question marks. Let me discover what is wrong. Ah, I'm not using the important connection charset. Now I have this problem and many others fixed."

Or they may not notice the problem immediately. And PSQL sources are stored as unrecoverably broken. And DBA cannot get anything meaningful from the trace logs. While it could be avoided.

All these use cases (stored PSQL text, monitoring, trace) are IMO expected to preserve as much original information as possible. Giving up at something "possibly unknown" without checking whether it could be actually processed is easy but somewhat underdone.

And honestly, I don't buy the Linux/Windows argument at all. It's not about the system locale per se, it's about the missing attachment charset. Data could be sent in UTF8 even on Windows (e.g. from a Java application), it's not necessarily comes from CLI tools like ISQL.

@livius2
Copy link

livius2 commented Mar 26, 2026

Wouldn't it be better to throw an error that no charset was provided during the connection? And add compatibility flag in the firebird.conf to support NONE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants