Skip to content

HIVE-29465:Prevent excessive query results cache usage at runtime#6376

Open
ramitg254 wants to merge 1 commit intoapache:masterfrom
ramitg254:HIVE-29465
Open

HIVE-29465:Prevent excessive query results cache usage at runtime#6376
ramitg254 wants to merge 1 commit intoapache:masterfrom
ramitg254:HIVE-29465

Conversation

@ramitg254
Copy link
Copy Markdown
Contributor

@ramitg254 ramitg254 commented Mar 18, 2026

What changes were proposed in this pull request?

Introducing safe cache writing conf which is when enabled writing to cache directory should not happen dirctly and if the entry is valid then only that entry should be copied to cache directory.

Why are the changes needed?

spilling of cache directory was happening when query as cleanup is done in the post execution.

Does this PR introduce any user-facing change?

No

How was this patch tested?

locally and ci test

@ramitg254 ramitg254 changed the title [WIP]HIVE-29465:Prevent excessive query results cache usage at runtime [WIP] HIVE-29465:Prevent excessive query results cache usage at runtime Mar 18, 2026
@ramitg254 ramitg254 changed the title [WIP] HIVE-29465:Prevent excessive query results cache usage at runtime HIVE-29465:Prevent excessive query results cache usage at runtime Mar 23, 2026
@ramitg254 ramitg254 changed the title HIVE-29465:Prevent excessive query results cache usage at runtime [WIP]HIVE-29465:Prevent excessive query results cache usage at runtime Mar 23, 2026
@ramitg254 ramitg254 changed the title [WIP]HIVE-29465:Prevent excessive query results cache usage at runtime HIVE-29465:Prevent excessive query results cache usage at runtime Mar 23, 2026
@sonarqubecloud
Copy link
Copy Markdown

return false;
}

if (isSafeCacheWriteEnabled) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramitg254 : thanks for working on this so far
I'm not sure if the approach fully addresses what has been reported: as far as I can understand, there is a safe buffer directory, where the files are placed, and this safe folder is on the same storage, but that's not the only issue, it's rather that this doesn't prevent big files actually landing on the filesystem that holds the cache

the original report showed something like this:

du -h -d 1 /efs/tmp/hive/_resultscache_/results-9d89cc59-c99d-46a5-9d93-2b550576532012.0K ./66356edb-57a6-4f0a-90cd-7d14d9e2b739
...
1.1T ./0fe343fb-6a89-4d28-b2fd-caed2f2e42f6
...
1.1T .

I missed something to double-check before creating the jira: if the "0fe343fb-6a89-4d28-b2fd-caed2f2e42f6" folder belongs to a finished query result? if so - and given that it clearly exceeded the configured 2G max cache size - query results cache should have taken care of that, so I think the original problem/usecase should be investigated thoroughly first

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog I have written down my understanding of the problem below please validate:

Problem statement -> so when cache is enabled cache entry validation takes place at post execution causing cache entry already been written to cache directory in runtime exceeding max cache size which should not happen.

Replication of Problem Statement scenario -> I have added unit test which replicate this problem statemnt in TestCachedResults.java‎ in which for the same set of queries testUnsafeCacheWrite passes when at runtime cache directory size at runtime increases beyond max cache size allowed which is the current behaviour we have and the other is testSafeCacheWrite which is solving this issue for the same set of queries and not exceeding at any moment on runtime beyond max cache size allowed.

Solution: we are introducing safe cache write conf which is when enabled then query files for fetch work does not directly written to cache directory but we proceed it as normal query execution and since normal query execution also stores these files somewhere during runtime (like in local scratch dir in case of these unit tests) and by this way we are not maintaining any extra storage temporary buffer and if it fails then fails as normal query and if it succeeds then we perform validation checks for those files in normal query execution and if it is valid then we just copy those files to cache dir in post execution and if while moving these files if it fails in between then I am performing cleaanup as well for it.
And I made this configurable as there is overhead in copying files from location of normal query execution to cache directory

@ramitg254 ramitg254 requested a review from abstractdog March 23, 2026 15:47
@abstractdog abstractdog requested a review from Copilot April 14, 2026 12:54
private static final Logger LOG = LoggerFactory.getLogger(TestCachedResults.class);

@ClassRule
public static HiveTestEnvSetup envSetup = new HiveTestEnvSetup();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this, HiveConfForTest is sufficient in most of the cases

public static HiveTestEnvSetup envSetup = new HiveTestEnvSetup();

@Rule
public TestRule methodRule = envSetup.getMethodRule();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think don't need this either


@BeforeClass
public static void setUp() throws Exception {
conf = envSetup.getTestCtx().hiveConf;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new HiveConfForTest(...)

driver.run(sql);
}

private static long getFolderSize(File folder) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like reinventing the wheel, is there a chance we already have this implemented somewhere else?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a “safe cache write” mode for Hive query results cache to prevent oversized/invalid entries from being written directly into the cache directory (reducing runtime cache overspill), with accompanying tests.

Changes:

  • Introduces hive.query.results.safe.cache.write.enabled configuration flag.
  • Updates results destination selection so safe mode writes outside the cache and later conditionally copies into the cache.
  • Adds runtime copy logic in QueryResultsCache#setEntryValid and a new test suite validating safe vs unsafe behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
ql/src/test/org/apache/hadoop/hive/ql/TestCachedResults.java Adds tests and a cache-size monitor to verify safe vs unsafe cache write behavior.
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java Routes output to default destination in safe mode and records a “safe” source directory.
ql/src/java/org/apache/hadoop/hive/ql/cache/results/QueryResultsCache.java Implements safe-mode copy into cache on entry validation and cleanup helpers.
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java Adds the new configuration variable and description text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@Test
public void testSafeCacheWrite() throws Exception {
HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVE_QUERY_RESULTS_SAFE_CACHE_WRITE_ENABLED, true);
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests mutate a shared static conf but testUnsafeCacheWrite never explicitly sets HIVE_QUERY_RESULTS_SAFE_CACHE_WRITE_ENABLED back to false. If testSafeCacheWrite runs first (test order is not guaranteed), testUnsafeCacheWrite will execute in safe mode and assert the opposite behavior. Make each test explicitly set the flag to the intended value (or reset it in @Before / @After).

Copilot uses AI. Check for mistakes.
}

@Test
public void testUnsafeCacheWrite() throws Exception {
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests mutate a shared static conf but testUnsafeCacheWrite never explicitly sets HIVE_QUERY_RESULTS_SAFE_CACHE_WRITE_ENABLED back to false. If testSafeCacheWrite runs first (test order is not guaranteed), testUnsafeCacheWrite will execute in safe mode and assert the opposite behavior. Make each test explicitly set the flag to the intended value (or reset it in @Before / @After).

Suggested change
public void testUnsafeCacheWrite() throws Exception {
public void testUnsafeCacheWrite() throws Exception {
HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVE_QUERY_RESULTS_SAFE_CACHE_WRITE_ENABLED, false);

Copilot uses AI. Check for mistakes.
private static String cacheDir;

private ScheduledExecutorService scheduler;
private long maxCacheSize = 0;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxCacheSize is written from the scheduler thread and read from the test thread without any memory-visibility guarantees, which can make assertions flaky (e.g., reading a stale 0). Use an AtomicLong (or make it volatile and update carefully) to ensure cross-thread visibility.

Suggested change
private long maxCacheSize = 0;
private volatile long maxCacheSize = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +247
private void startCacheMonitor(long intervalMs) {
scheduler.scheduleAtFixedRate(() -> {
long size = getFolderSize(new File(cacheDir));
maxCacheSize = Math.max(maxCacheSize, size);
}, 0, intervalMs, TimeUnit.MILLISECONDS);
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxCacheSize is written from the scheduler thread and read from the test thread without any memory-visibility guarantees, which can make assertions flaky (e.g., reading a stale 0). Use an AtomicLong (or make it volatile and update carefully) to ensure cross-thread visibility.

Copilot uses AI. Check for mistakes.
conf = envSetup.getTestCtx().hiveConf;

HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVE_QUERY_RESULTS_CACHE_ENABLED, true);
HiveConf.setVar(conf, HiveConf.ConfVars.HIVE_QUERY_RESULTS_CACHE_DIRECTORY, "/tmp/hive/cache");
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding /tmp/hive/cache in a test can cause collisions across concurrent test runs and may fail on environments without a writable /tmp (or on non-Unix setups). Prefer using a per-test temporary directory (e.g., test framework temp dirs / java.nio.file.Files#createTempDirectory) and set the conf to that path.

Copilot uses AI. Check for mistakes.
Comment on lines +593 to +594
Path destFile = new Path(resultDir,
new Path(fs.getPath().toString().substring(safeDir.length() + 1)));
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building relative paths via substring(safeDir.length() + 1) is fragile: it assumes the string form of each path always starts with safeDir (same scheme/authority, same qualification, no trailing slash differences). If that assumption breaks, it can produce incorrect paths or throw StringIndexOutOfBoundsException. Prefer computing a proper relative path using qualified Path/URI comparison (e.g., qualify both paths and derive the relative suffix from URI paths) and validate that the source is inside safeDir before deriving the destination.

Copilot uses AI. Check for mistakes.
return false;
}
fetchWork.setFilesToFetch(cacheFilesToFetch);
fetchWork.setTblDir(new Path(resultDir, fetchWork.getTblDir().toString().substring(safeDir.length() + 1)));
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building relative paths via substring(safeDir.length() + 1) is fragile: it assumes the string form of each path always starts with safeDir (same scheme/authority, same qualification, no trailing slash differences). If that assumption breaks, it can produce incorrect paths or throw StringIndexOutOfBoundsException. Prefer computing a proper relative path using qualified Path/URI comparison (e.g., qualify both paths and derive the relative suffix from URI paths) and validate that the source is inside safeDir before deriving the destination.

Copilot uses AI. Check for mistakes.
Comment on lines +587 to +590
rwLock.writeLock().lock();
boolean succeeded = true;
try {
for (FileStatus fs : fetchWork.getFilesToFetch()) {
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache write lock is held while performing potentially slow filesystem copies. This can block unrelated cache operations for the duration of the copy and hurt concurrency under load. Consider copying files outside the cache rwLock, and only acquiring the lock for the minimal critical section where shared cache state is updated.

Copilot uses AI. Check for mistakes.
Path srcFile = fs.getPath();
Path destFile = new Path(resultDir,
new Path(fs.getPath().toString().substring(safeDir.length() + 1)));
succeeded = FileUtil.copy(srcFs, srcFile, cacheFs, destFile, false, conf);
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache write lock is held while performing potentially slow filesystem copies. This can block unrelated cache operations for the duration of the copy and hurt concurrency under load. Consider copying files outside the cache rwLock, and only acquiring the lock for the minimal critical section where shared cache state is updated.

Copilot uses AI. Check for mistakes.
Comment on lines +5674 to +5675
"If the query results safe cache is enabled. This will safely write to cache directory by first evaluating " +
"the cache entry is not overspilling the the cache directory before writing it to cache directory "),
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new conf var description has grammatical issues (e.g., duplicated 'the', awkward phrasing) and it doesn’t clearly describe the actual behavior implemented (write results to a non-cache destination first, then copy into cache only after the entry is deemed valid/acceptable). Please reword to be precise and user-facing.

Suggested change
"If the query results safe cache is enabled. This will safely write to cache directory by first evaluating " +
"the cache entry is not overspilling the the cache directory before writing it to cache directory "),
"If enabled, query results are written to a temporary non-cache location first and copied into the " +
"cache directory only after the cache entry is accepted as valid, for example after size checks."),

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@abstractdog abstractdog Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I was about to comment the same, configuration should be clear about the tradeoff which is present under normal circumstances

Comment on lines +5674 to +5675
"If the query results safe cache is enabled. This will safely write to cache directory by first evaluating " +
"the cache entry is not overspilling the the cache directory before writing it to cache directory "),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated "the the"

cacheFs.mkdirs(resultDir);

Set<FileStatus> cacheFilesToFetch = new HashSet<>();
rwLock.writeLock().lock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that in other places in this class, this is handled via a separate variable:

    Lock writeLock = rwLock.writeLock();
    try{
        writeLock.lock();
    }

but that usage pattern is not consistent: feel free to pick one, and unify all the lock usages

return false;
}

if (isSafeCacheWriteEnabled) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe refactor this whole block to a separate method

"If the query results cache is enabled. This will keep results of previously executed queries " +
"to be reused if the same query is executed again."),

HIVE_QUERY_RESULTS_SAFE_CACHE_WRITE_ENABLED("hive.query.results.safe.cache.write.enabled", false,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for correct alphabetical order this should rather be hive.query.results.cache.safe.write.enabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants