Fix for GPU accelerated vector indexing silently falling back to using CPU instead#4328
Open
rahulgoswami wants to merge 3 commits intoapache:mainfrom
Open
Fix for GPU accelerated vector indexing silently falling back to using CPU instead#4328rahulgoswami wants to merge 3 commits intoapache:mainfrom
rahulgoswami wants to merge 3 commits intoapache:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://issues.apache.org/jira/browse/SOLR-18210
Description
GPU accelerated vector indexing uses the Lucene99AcceleratedHNSWVectorsFormat coming from cuvs-lucene library. It currently falls back to using Lucene99HnswVectorsWriter (which uses the CPU instead) due to a library loading issue.
Solution
Root Cause: Initialization Race Between Two Independent Code Paths
There are two independent code paths that access com.nvidia.cuvs.spi.CuVSServiceProvider$Holder.INSTANCE, but only one of them loads the required libcudart.so native library first. The wrong one wins the race.
Path A : GpuMetricsService (runs FIRST, does NOT load libcudart)
org.apache.solr.cuvs.GpuMetricsService is started on a ScheduledExecutorService during CoreContainer initialization. Its updateGpuMetrics() method directly
calls:
GpuMetricsService.updateGpuMetrics() // scheduled executor thread
→ CuVSProvider.provider() // cuvs-java: CuVSProvider.java:159
→ CuVSServiceProvider$Holder.INSTANCE // triggers $Holder.
This triggers $Holder class init → loadProvider() → builtinProvider(), → eventually lands in "throws UnsatisfiedLinkError: unresolved symbol: cudaMemcpyAsync" , causing $Holder init to fail.
Once the class initializer fails, all future access throws NoClassDefFoundError.
Path B : Utils.cuVSResourcesOrNull() (runs SECOND, DOES load libcudart)
Lucene99AcceleratedHNSWVectorsFormat class init calls Utils.cuVSResourcesOrNull() (class com.nvidia.cuvs.lucene.Utils):
This method correctly calls System.loadLibrary("cudart") before touching $Holder. If it ran first, cudaMemcpyAsync would be resolvable via SymbolLookup.loaderLookup() and GPU init would succeed. But by the
time this path runs, $Holder is already poisoned by Path A.
This causes Lucene99AcceleratedHNSWVectorsFormat.supported() to return false, causing a silent fallback to Lucene99HnswVectorsWriter, whereby the indexing succeeds successfully with a log warning "GPU based indexing not supported, falling back to using the Lucene99HnswVectorsWriter"
Fix :
Load the cuda runtime library when GpuMetricsService initializes
Tests
Built the solr-cuvs.jar locally and placed it in WEB-INF/lib of the solr web app. Then ran vector indexing on an L40S GPU machine with the configuration mentioned in the document https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html#gpu-acceleration.
The log then prints "cuVS is supported so using the Lucene99AcceleratedHNSWVectorsWriter" coming from cuvs-lucene's Lucene99AcceleratedHNSWVectorsFormat