Ollama cold start and VRAM unloading in Olares (Kubernetes) when used with Dify #2451
Replies: 6 comments 3 replies
-
|
Thank you for the suggestion. I’d like to confirm whether my understanding is correct: you’re hoping that Ollama can respond promptly to Dify’s first request, without having to wait for the model to load, and that the model remains loaded in GPU memory. Our current approach is to package Ollama together with a specific model into a single application: With this setup, the application downloads the model at startup and then serves it. Whether the model stays resident in GPU memory is managed by Olares (via time slicing), rather than relying on Ollama’s native multi‑model scheduling. |
Beta Was this translation helpful? Give feedback.
-
|
Hi Peng Peng Thank you very much for these clarifications! It helps me understand how Olares manages GPU residence through time slicing rather than Ollama's native scheduling. Since these "all-in-one" packages (like Gemma 3 12B or GPT-OSS 20B) seem to be the preferred way to ensure stability and performance on Olares, I have a suggestion: would it be possible to add a "Generic vLLM" module to the Market? A generic module where we could simply input a Hugging Face model ID would be a game changer. It would allow users to leverage vLLM's superior performance and static VRAM allocation for any model, without waiting for a specific pre-packaged app for every new LLM release. Thanks again for your support and for this great platform!" |
Beta Was this translation helpful? Give feedback.
-
|
Hello Peng Peng I’m reporting an issue with the Gemma3 12B (vLLM) 1.0.9 app from Olares Market. The app installs correctly, but fails at runtime when started. This should not happen, since the model is delivered via Olares Market artifacts and should run offline after installation. Additionally, the referenced HuggingFace repository Other vLLM apps work correctly in the same environment, confirming this is specific to vllmgemma312bitv2. Expected behavior: Olivier |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Olivier38220 , From your screenshot, the error does not appear to be related to the repository itself. We’ve verified that the repository is accessible in a clean network environment on our side. As a quick workaround, you may also try uninstalling the application and reinstalling it. |
Beta Was this translation helpful? Give feedback.
-
|
Hi harveyff I’ve run the requested checks in the same environment. Connectivity test curl -Iv https://huggingface.co --connect-timeout 5 returns HTTP/2 200 with a valid TLS certificate for huggingface.co. Proxy check env | grep -i proxy returns no output (no proxy variables set). I also reproduced the exact endpoint used by the Gemma module outside of Olares: This request returns HTTP 200 with a valid huggingface.co certificate when executed directly on the host. Inside Olares, the same request fails with an x509 error, where the TLS certificate is issued for mhradio.org instead of huggingface.co. This confirms that general network connectivity to Hugging Face is working correctly, and that the issue is caused by HTTPS redirection or interception within the Olares environment, specific to the downloader path used by this module, rather than by the repository or external network. I’ve identified the exact root cause. The download-model container is forced to route all outbound traffic through the internal Olares proxy (VLLM_PROXY_TARGET). This proxy returns an invalid TLS certificate (mhradio.org) for some Hugging Face endpoints used by the Gemma module, causing the x509 failure. When the proxy is overridden to an empty value, the TLS error disappears, but the container loses external network access entirely (Host is unreachable), confirming that the proxy is also acting as the egress gateway. This proves the issue is not related to Hugging Face, the repository, or general connectivity, but to the proxy’s handling of specific Hugging Face endpoints. A proper fix would be to adjust the proxy configuration (TLS chain or NO_PROXY for huggingface.co) rather than disabling it. Best |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the clarification and for looking into this. Based on the investigation and the reproducible behavior described above, I agree that the issue is related to how outbound HTTPS traffic is handled in the current environment rather than to the Hugging Face repository or the downloader logic itself. I’ll close this issue for now and keep an eye on future updates or improvements around the proxy / egress handling. Thanks again for your time and support. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm running Ollama inside Olares (Kubernetes-based) on a Linux server bare metal equipped with an NVIDIA RTX 3090 GPU.
Ollama is used as a shared LLM backend, mainly through Dify (workflow + chat apps), and potentially for voice-based use cases.
The setup works correctly functionally, but I consistently observe the following behavior when using Dify + Ollama:
This does not appear to be a performance issue (the RTX 3090 handles inference well once loaded), but rather a cold-start / model unloading behavior.
I understand this may be expected in Kubernetes environments, but I would like to confirm the recommended approach specifically in Olares for GPU-based conversational workloads.
Questions:
(e.g. via Helm values or environment variables like
OLLAMA_KEEP_ALIVE=-1)(for example using a
postStarthook runningollama run <model>) considered an acceptable or recommended solution?My goal is to keep Ollama inside Olares (not bare metal), since it is shared by multiple applications, while avoiding cold starts for near-real-time conversational usage.
Thanks for any guidance or best practices.
Beta Was this translation helpful? Give feedback.
All reactions