Skip to content

init commit, tldr the cache_dir or files should probably indicate tab…#835

Open
jhnwu3 wants to merge 15 commits intomasterfrom
fix/base_dataset_cache_name
Open

init commit, tldr the cache_dir or files should probably indicate tab…#835
jhnwu3 wants to merge 15 commits intomasterfrom
fix/base_dataset_cache_name

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Feb 9, 2026

The problem here is it can be quite confusing for the user to know exactly what tables are cached inside each cache file when they want to specify a new location.

@jhnwu3
Copy link
Collaborator Author

jhnwu3 commented Feb 9, 2026

Don't know how to change this to a draft. But, will keep iterating on this as I move along.

@EricSchrock EricSchrock marked this pull request as draft February 9, 2026 23:15
@jhnwu3 jhnwu3 requested review from EricSchrock and Logiquo and removed request for Logiquo February 11, 2026 02:12
@EricSchrock EricSchrock marked this pull request as ready for review February 15, 2026 19:46
@EricSchrock
Copy link
Collaborator

EricSchrock commented Feb 15, 2026

@jhnwu3 and @Logiquo I believe the updates I pushed fix both concerns while passing existing tests. I also added new tests for the new caching behavior. Let me know what you think.

  1. Datasets with different configurations (root, tables, name, or dev value) get different caches.
  2. Tasks with different configurations (names, schemas, or arg values) get different caches. Also, since default task caches are nested in dataset caches, running the same task on datasets with different configurations will result in different task caches, unless you override the default task cache path.

Maybe we feed the dataset UUID into the task UUID generation so that if someone overrides the default task path, task caching will still be sensitive to dataset configuration?

Or maybe we should remove the ability for users to override the default task cache? Do they really need to be able to override both the dataset and task cache paths?

@EricSchrock EricSchrock requested a review from Logiquo February 15, 2026 20:51
Copy link
Collaborator

@EricSchrock EricSchrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the one thing left is to update the benchmarking examples. Without the ability to set the task cache, the math for calculating cache usage needs to be updated.

@EricSchrock EricSchrock requested a review from Logiquo February 16, 2026 04:23
Copy link
Collaborator

@Logiquo Logiquo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jhnwu3 @EricSchrock

@jhnwu3
Copy link
Collaborator Author

jhnwu3 commented Feb 16, 2026

@EricSchrock I've updated benchmark scripts to use the correct cache_dir/tasks/ layout for calculating cache sizes. Let me know if this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants