Skip to content

fix regexp_count should count empty-pattern matches#22311

Open
xiedeyantu wants to merge 2 commits into
apache:mainfrom
xiedeyantu:regexp_count
Open

fix regexp_count should count empty-pattern matches#22311
xiedeyantu wants to merge 2 commits into
apache:mainfrom
xiedeyantu:regexp_count

Conversation

@xiedeyantu
Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

regexp_count did not handle empty regular-expression patterns correctly. An empty pattern should be counted as valid matches instead of returning 0. This also affects calls that use the start argument and certain flag combinations.

What changes are included in this PR?

  • Fix regexp_count so empty-pattern matches are counted correctly.
  • Adjust start handling so character offsets are computed correctly.
  • Update unit tests and sqllogictest coverage for empty patterns, start, and flags.
  • Update expected results to match the corrected behavior.

Are these changes tested?

  • Yes. Rust unit tests were updated.
  • Yes. Sqllogictest coverage was added/updated.
  • I also ran:
    • cargo test -p datafusion-sqllogictest --test sqllogictests table_functions
    • cargo test -p datafusion-sqllogictest --test sqllogictests scalar

Are there any user-facing changes?

  • Yes. regexp_count now returns correct counts for empty patterns, so results may differ from the previous behavior.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 17, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiedeyantu
Thanks for the fix here. I walked through the updated behavior and everything looks good overall. I just have one small suggestion around expanding the SQL regression coverage for the user-visible paths.

----
4

query I
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition to cover regexp_count('abc', ''). Since this change also affects the start-position and flag handling paths, it would be great to add a couple more SQL-visible cases here as well, like regexp_count('abc', '', 2) and regexp_count('abc', '', 1, 'i'). It may also be worth covering the one-past-end or beyond-end boundary behavior if that is intentional. The Rust unit tests already exercise some of the internals, but adding these to the SLT suite would help lock in the user-facing behavior that motivated this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! I have added some tests in this slt file. Please take a look again.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to the execution results of PGSQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL compatibility: regexp_count should count empty-pattern matches

2 participants