Conversation
|
Just to clarify the description - we optimistically assume that the input is already in UTF8 and validate whether it's well-formed. If so, it's used "as is", otherwise the original replacement happens. |
|
Application that uses charset NONE rather send queries in ANSI on Windows and system locale on Linux (which nowadays is UTF-8 indeed), IMHO. |
| static_assert(CS_METADATA == CS_UTF8); | ||
| const bool isConverted = DataTypeUtil::convertToUTF8(source, transformedText, charSetId, ERRD_post); | ||
| dsqlStatement->setSqlText(FB_NEW_POOL(*statementPool) RefString(*statementPool, | ||
| isConverted ? transformedText : source)); |
There was a problem hiding this comment.
In which pool the source was allocated ?
There was a problem hiding this comment.
As I understand from the code, in defaultMemoryPool.
There was a problem hiding this comment.
I see that Parser and its transformedString allocated from scratchPool
There was a problem hiding this comment.
Line 626 explicitly allocates the resulting string in statementPool, no?
There was a problem hiding this comment.
transformedString uses a default constructor that calls getDefaultMemoryPool(), no?
There was a problem hiding this comment.
RefString is not "a reference to string", it is "a reference-counted string". It is explicitly allocated in the statement pool and is given with the statement pool for copying content from transformedText or source, so it doesn't matter where Parser and transformedString is allocated, the text will end up in the statement pool.
There was a problem hiding this comment.
We could save some time if type names in Firebird were meaningful.
There was a problem hiding this comment.
We could save some time if type names in Firebird were meaningful.
Not in this case, for sure
There was a problem hiding this comment.
BTW, I would reconsider needs to copy parser's transformedString into separate temporary string (transformedText) or, at least, allocate transformedText from short-lived scratchPool to not waste space in statementPool
src/jrd/DataTypeUtil.cpp
Outdated
|
|
||
| dst.resize(length); | ||
| } | ||
| catch (const status_exception& e) |
There was a problem hiding this comment.
Ok, i'll rename it in next commit.
And nothing will change for them.
And they will be printed correctly, instead of being replaced with question marks. |
|
Why not just replace the loop with |
Isn't it too expensive to lock a mutex every time? MutexLockGuard g(mtx, FB_FUNCTION); |
|
Why the hell we should assume not explicitly defined charset as UTF8? |
Because this is the charset we're converting the SQL text into. And because this is user-friendly.
Hmmm. And how do we currently convert WIN1251 statement with embedded _WIN1250 literals into UTF8? I guess this fails anyway? |
"Into" is different than "from". It's why a conversion exists.
It's not, we would masquerading problems in one plataform that appears in another. We should assume nothing about what character set user is using if he does not inform it in the way it should. NONE character never was there to assume the client charset.
This case actually seems to use hex strings (see |
Sure. And of course it's pointless to find out what was the original charset. But we may at least check whether it looks like UTF8 (the one we need).
Inside the engine, |
Perhaps it is cheaper than |
Every thing that looks like UTF-8 looks like NONE, as NONE have no limitations on its own.
This is different. It generally happens by individual things (a move from one place to another) and generally to an user defined thing (with user decided to create the target as UTF-8). Now we have clear thing: you used non-ASCII characters and didn't defined the connection charset, you have question marks as output. In this PR is a generalization for a whole statement text to different internal things (metadata text, trace output, monitoring) is done. And it does it using assumptions that may be wrong. A user in Windows using WIN1252 will think this is a bug and that we should convert it from WIN1252 to UTF8 instead of this. |
Yes, it's clear. But in one particular case (input is in fact in UTF8) we could handle it better, but decided to not. This is what I call being not so friendly. As for me, this clear statement won't become much harder if re-stated as "you used non-ASCII characters and didn't defined the connection charset and your actual encoding wasn't UTF8". But maybe this is just me. |
"It's wrong these question marks. Let me discover what is wrong. Ah, I'm not using the important connection charset. Now I have this problem and many others fixed." |
|
I would suggest to close this PR and try a different approach: in absence of explicit |
Or they may not notice the problem immediately. And PSQL sources are stored as unrecoverably broken. And DBA cannot get anything meaningful from the trace logs. While it could be avoided. All these use cases (stored PSQL text, monitoring, trace) are IMO expected to preserve as much original information as possible. Giving up at something "possibly unknown" without checking whether it could be actually processed is easy but somewhat underdone. And honestly, I don't buy the Linux/Windows argument at all. It's not about the system locale per se, it's about the missing attachment charset. Data could be sent in UTF8 even on Windows (e.g. from a Java application), it's not necessarily comes from CLI tools like ISQL. |
|
Wouldn't it be better to throw an error that no charset was provided during the connection? And add compatibility flag in the firebird.conf to support NONE? |
Problem
When processing text data without an explicitly set encoding, non-ASCII characters were unconditionally replaced with
'?'. This caused irreversible data loss — for example, usernames containing Cyrillic or other multibyte characters appeared corrupted in trace output.Request
Configs
firebird.conf
fbtrace.conf
database { enabled = true log_filename = fbtrace.log log_statement_finish = true time_threshold = 0 }Before
After
Fix
Instead of unconditionally replacing unrecognized characters with
'?', the code now first attempts to convert the text to UTF-8. The'?'replacement is used only as a fallback if the conversion fails.