MDEV-39970 Fix JSON_KEYS failing to deduplicate keys with escaped cha…#5245
MDEV-39970 Fix JSON_KEYS failing to deduplicate keys with escaped cha…#5245akshatnehra wants to merge 1 commit into
Conversation
…racters check_key_in_list() compares keys byte-by-byte looking for " as the key delimiter. When keys contain escaped characters (like \" or \\), the unescaped " inside the key is misidentified as the end-of-key delimiter, causing false mismatches and failing to deduplicate identical keys. Fix: handle escape sequences in both the comparison loop and the skip-to-end loop. When a backslash is encountered, consume it along with the following character as a single escaped unit, preventing the escaped character from being mistaken for a delimiter. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.
There was a problem hiding this comment.
Code Review
This pull request addresses MDEV-39970, fixing an issue where JSON_KEYS fails to deduplicate keys containing escaped characters. It updates the key comparison logic in check_key_in_list within sql/item_jsonfunc.cc to correctly handle backslashes and escaped quotes, and adds comprehensive test cases. The review feedback highlights critical safety issues in the new logic, specifically pointing out potential out-of-bounds reads and false-positive matches during key comparison, as well as a potential infinite loop or out-of-bounds read when skipping non-matching keys in malformed JSON strings. Both comments suggest adding robust bounds checks to prevent these issues.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for (n_char=0; n_char < key_len; n_char++) | ||
| { | ||
| if (c[n_char] == '\\') | ||
| { | ||
| if (key[n_char] != '\\') | ||
| break; | ||
| n_char++; | ||
| if (n_char >= key_len || c[n_char] != key[n_char]) | ||
| break; | ||
| continue; | ||
| } | ||
| if (c[n_char] == '"') | ||
| break; | ||
| if (c[n_char] != key[n_char]) | ||
| break; | ||
| } |
There was a problem hiding this comment.
There are two issues here:
- Out-of-bounds read: If
key_lenis larger than the remaining characters inc(i.e.,c + n_char > end), accessingc[n_char]can lead to an out-of-bounds read. - False positive match: If
keyends with a backslash (e.g.,a\) andchas an escaped character at that position (e.g.,a\"),n_charis incremented tokey_lenand we break. However, sincec[n_char]is"(the escaped quote), the post-loop checkn_char == key_len && c[n_char] == '"'succeeds, leading to a false positive match.
We should add bounds checks and decrement n_char on premature break to prevent false matches.
for (n_char=0; n_char < key_len; n_char++)
{
if (c + n_char >= end)
break;
if (c[n_char] == '\\')
{
if (key[n_char] != '\\')
break;
n_char++;
if (n_char >= key_len || c + n_char >= end || c[n_char] != key[n_char])
{
n_char--;
break;
}
continue;
}
if (c[n_char] == '"')
break;
if (c[n_char] != key[n_char])
break;
}| else | ||
| { | ||
| while (c[n_char] != '"') | ||
| { | ||
| if (c[n_char] == '\\') | ||
| n_char++; | ||
| n_char++; | ||
| } | ||
| } |
There was a problem hiding this comment.
If the JSON string in res is somehow malformed or truncated, the loop while (c[n_char] != '"') can increment n_char past end, leading to an out-of-bounds read or infinite loop. We should add a bounds check c + n_char < end to ensure safety.
else
{
while (c + n_char < end && c[n_char] != '"')
{
if (c[n_char] == '\\')
{
n_char++;
if (c + n_char >= end)
break;
}
n_char++;
}
}
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution! This is a preliminary review.
This needs a LOT more test cases covering various combinations. One such is mentioned below, but there's more to be sure.
Unless you're willing to adjust your ambition level, as suggested below.
| while (c < end) | ||
| { | ||
| int n_char; | ||
| for (n_char=0; c[n_char] != '"' && n_char < key_len; n_char++) |
There was a problem hiding this comment.
I believe this feels like a hack: it's not using the official json_read_keyname_chr() call to read a char from a string. And it's incomplete: see json_handle_esc for what is being missed: e.g. \u vs normal char etc.
I believe that a proper solution should be that a normalized set of search strings must be built. And that the incoming strings need to be normalized before being searched. If we want to preserve de-duplication that is. Read more on that below please.
The deduplication is not an official documented feature of JSON_KEYS. It's not mentioned neither in the MariaDB docs nor in the MySQL docs.
Theoretically, duplicate keys in a JSON object, i.e. {"a": 1, "a":2} is not valid JSON. So I honestly do not know why is the attempt made of processing that here.
I'd just extract the keys and be done with it: no duplicate checks. Why would one function check for invalid JSON in their arguments and the others wont?
This probably is something for the final reviewers to worry about though. The above is just my honest opinion.
| { | ||
| if (c[n_char] == '\\') | ||
| { | ||
| if (key[n_char] != '\\') |
There was a problem hiding this comment.
Why is a double quote char a "break"? It usually is considered a literal quote.
Please try {"a\\\\b": 1, "a\\\\c": 2}. I'd be surprised if it worked correctly.
There was a problem hiding this comment.
forgot to mention: this needs to be fixed in all affected versions (it's a bug). So please rebase to 10.11
Description
This PR fixes MDEV-39970 where
JSON_KEYS()fails to deduplicate keys that contain escaped characters (like\"or\\).check_key_in_list()compares keys byte-by-byte using"as the end-of-key delimiter. When keys contain escape sequences (e.g.,a\"b), the\"inside the key is misidentified as the closing delimiter, causing the comparison to mismatch and identical keys to appear as duplicates in the output.The fix involves:
\is encountered, the next character is consumed as part of the escape sequence rather than checked as a delimiterHow can this PR be tested?
Run the MTR test:
Or manually:
Results from my testing
Duplicate key
a"bappears twice — deduplication failed.Duplicate key correctly deduplicated to one entry.
Basing the PR against the correct MariaDB version
PR quality check
CODING_STANDARDS.mdfile and my PR conforms to this where appropriate.