Skip to content

fix(datafusion): keep hot runs in one session#833

Open
kumarUjjawal wants to merge 1 commit intoClickHouse:mainfrom
kumarUjjawal:fix/datafusion_hot_run
Open

fix(datafusion): keep hot runs in one session#833
kumarUjjawal wants to merge 1 commit intoClickHouse:mainfrom
kumarUjjawal:fix/datafusion_hot_run

Conversation

@kumarUjjawal
Copy link
Copy Markdown

Summary

This fixes the partitioned DataFusion runner so hot runs are really hot. See apache/datafusion#21696

Before this change, the script started a new datafusion-cli process for every try. That made each try use a cold process, so tries 2 and 3 did not keep the warmed metadata cache.

What changed

  • drop OS cache once before each query
  • build one SQL file with create.sql and the same query 3 times
  • run that file in one datafusion-cli session
  • skip the setup Elapsed lines from create.sql
  • keep failed later tries as null instead of mixing in setup timings

This follows the same main idea used by the DuckDB partitioned runner: one process per query, with repeated runs inside that process.

Thank you for review 🙏

Thank You for Your Contribution!

We appreciate your effort and contribution to the project. To ensure that your Pull Request (PR) adheres to our guidelines, please ensure to review the rules mentioned in our contribution guidelines:

ClickHouse/ClickBench Contribution Rules

Thank you for your attention to these details and for helping us maintain the quality and integrity of the project.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 20, 2026

CLA assistant check
All committers have signed the CLA.

@kumarUjjawal
Copy link
Copy Markdown
Author

@qoega If I could get your eyes on this?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 27, 2026

Hi @kumarUjjawal -- thank you. How did you test this change?

Maybe it would be a good exercise to try and get DataFusion 53 numbers with this script? That way we both double check the script works and update the results for DF 53

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 27, 2026

I can try and help do this (run for DF 53) if needed

@kumarUjjawal
Copy link
Copy Markdown
Author

kumarUjjawal commented Apr 27, 2026

Hi @kumarUjjawal -- thank you. How did you test this change?

Maybe it would be a good exercise to try and get DataFusion 53 numbers with this script? That way we both double check the script works and update the results for DF 53

I asked Codex to build a small bash script for local test.

It used a fake datafusion-cli output:

  • one datafusion-cli session per query
  • setup timings from create.sql are skipped
  • if the shared session fails after try 1, the output is [t1, null, null]

If you could help run the DF 53 bench that would be great. 🙏

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 27, 2026

Will try later this week -- basically we need to run it on aws.

@kumarUjjawal
Copy link
Copy Markdown
Author

Thank you!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 30, 2026

I am somewhat overwhelmed this week with other things. I'll see what I can do later.

Maybe @pmcgleenon has some time to review / rerun the numbers (he used to be our star number runner 🦾 )

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented May 1, 2026

I'm running the benchmarks now, I'll test this change out.

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented May 1, 2026

@kumarUjjawal this seems to also be true for the non-partitioned version.
FWIW, a lot of the difference an OS level cache that does persist between runs. running the current setup definitely shows a difference between the first run of every query and the other two.

[0.066, 0.001, 0.001],
[0.142, 0.036, 0.035],
[0.193, 0.060, 0.059],
...

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented May 1, 2026

One weird thing that this change has caused consistently (for both partitioned and single) - the last iteration of q23 (which has a dynamic filter) becomes a lot slower! Something like [52.698, 8.065, 37.231] where the current submissions (and my updated ones with the current scripts) all have a similar patterns of [~55, ~9, ~9].

@kumarUjjawal
Copy link
Copy Markdown
Author

Is there some deeper issue here as q23 shouldn't behave like this right?

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented May 1, 2026

yeah definitely, might be already fixed upstream. For now I'm still running everything with the existing scripts, but once I get through that I'll go back and check if it reproduces with DataFusion's latest main.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 1, 2026

One weird thing that this change has caused consistently (for both partitioned and single) - the last iteration of q23 (which has a dynamic filter) becomes a lot slower! Something like [52.698, 8.065, 37.231] where the current submissions (and my updated ones with the current scripts) all have a similar patterns of [~55, ~9, ~9].

I think they are very dependent on timing and when the file is opened vs when the top value is seen

If the last value is seen quickly enough then it can skip several files entirely, but if the top value isnt' seen fast enough then the files get processed

Copy link
Copy Markdown
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for symmetry, it would make sense to apply the same changes to datafusion/run.sh.


echo $1
cat queries.sql | while read -r query; do
while IFS= read -r query || [ -n "$query" ]; do
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what IFS means.

echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

echo "$query" > /tmp/query.sql
QUERY_FILE="${TMP_DIR}/query-${QUERY_NUM}.sql"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is too much copying going on here, at least for my taste, and there are no code comments that explain the rationale. Please try to simplify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants