Ollama cold start and VRAM unloading in Olares (Kubernetes) when used with Dify #2451

Olivier38220 · 2026-01-27T17:52:13Z

Olivier38220
Jan 27, 2026

Hi,

I'm running Ollama inside Olares (Kubernetes-based) on a Linux server bare metal equipped with an NVIDIA RTX 3090 GPU.

Ollama is used as a shared LLM backend, mainly through Dify (workflow + chat apps), and potentially for voice-based use cases.

The setup works correctly functionally, but I consistently observe the following behavior when using Dify + Ollama:

After pod start or a period of inactivity, GPU VRAM is empty
The first request sent by Dify takes ~30 seconds
During this delay, VRAM is allocated and the model is loaded
Once the model is in VRAM, subsequent requests are fast and stable

This does not appear to be a performance issue (the RTX 3090 handles inference well once loaded), but rather a cold-start / model unloading behavior.

I understand this may be expected in Kubernetes environments, but I would like to confirm the recommended approach specifically in Olares for GPU-based conversational workloads.

Questions:

Is there an officially supported way in Olares to keep an Ollama model loaded in GPU VRAM?
(e.g. via Helm values or environment variables like OLLAMA_KEEP_ALIVE=-1)
When Ollama is primarily used through Dify, is configuring a pod lifecycle warm-up
(for example using a postStart hook running ollama run <model>) considered an acceptable or recommended solution?
Are there plans to provide a native warm-up / keep-alive mechanism for AI applications in Olares?

My goal is to keep Ollama inside Olares (not bare metal), since it is shared by multiple applications, while avoiding cold starts for near-real-time conversational usage.

Thanks for any guidance or best practices.

pengpeng · 2026-01-28T04:39:11Z

pengpeng
Jan 28, 2026
Maintainer

hi @Olivier38220

Thank you for the suggestion.

I’d like to confirm whether my understanding is correct: you’re hoping that Ollama can respond promptly to Dify’s first request, without having to wait for the model to load, and that the model remains loaded in GPU memory.

Our current approach is to package Ollama together with a specific model into a single application:
https://market.olares.com/app/market.olares/ollamagptoss20bv2

With this setup, the application downloads the model at startup and then serves it. Whether the model stays resident in GPU memory is managed by Olares (via time slicing), rather than relying on Ollama’s native multi‑model scheduling.

0 replies

Olivier38220 · 2026-01-28T07:51:07Z

Olivier38220
Jan 28, 2026
Author

Hi Peng Peng

Thank you very much for these clarifications! It helps me understand how Olares manages GPU residence through time slicing rather than Ollama's native scheduling.

Since these "all-in-one" packages (like Gemma 3 12B or GPT-OSS 20B) seem to be the preferred way to ensure stability and performance on Olares, I have a suggestion: would it be possible to add a "Generic vLLM" module to the Market?

A generic module where we could simply input a Hugging Face model ID would be a game changer. It would allow users to leverage vLLM's superior performance and static VRAM allocation for any model, without waiting for a specific pre-packaged app for every new LLM release.

Thanks again for your support and for this great platform!"

1 reply

pengpeng Jan 28, 2026
Maintainer

Thank you for the suggestion. Olares already supports this through its system mechanisms, based on:

Application replicas
Environment variables
When users install the “Generic vLLM” app from the Market, they are prompted to enter the model’s Hugging Face address. Users can also install this app multiple times, each time with different environment variable configurations.

We’ll work on getting this application released as soon as possible.

Olivier38220 · 2026-01-28T10:21:59Z

Olivier38220
Jan 28, 2026
Author

Hello Peng Peng

I’m reporting an issue with the Gemma3 12B (vLLM) 1.0.9 app from Olares Market.

The app installs correctly, but fails at runtime when started.
At startup, it still attempts to access HuggingFace, which results in a TLS error (x509 certificate mismatch).

This should not happen, since the model is delivered via Olares Market artifacts and should run offline after installation.

Additionally, the referenced HuggingFace repository
abhishekhchohan/gemma-3-12b-it-quantized-W4A16
contains only a Jinja template and no model weights, so any HuggingFace download attempt will fail anyway.

Other vLLM apps work correctly in the same environment, confirming this is specific to vllmgemma312bitv2.

Expected behavior:
The app should rely only on local artifacts and not require HuggingFace access at runtime.

Olivier

1 reply

pengpeng Jan 28, 2026
Maintainer

@Olivier38220 thank you

@harveyff please take a look～

harveyff · 2026-01-28T13:08:27Z

harveyff
Jan 28, 2026
Maintainer

Hi @Olivier38220 ,

From your screenshot, the error does not appear to be related to the repository itself.
At the moment, it looks more likely to be caused by DNS interception or proxy/network behavior.

We’ve verified that the repository is accessible in a clean network environment on our side.
To help us confirm whether this issue is environment-related on your machine, please run the following commands in the same environment where the downloader is running and paste the output here:

curl -Iv https://huggingface.co --connect-timeout 5
env | grep -i proxy

As a quick workaround, you may also try uninstalling the application and reinstalling it.
We’ll continue to improve the downloader to better handle these kinds of network issues.

0 replies

Olivier38220 · 2026-01-28T14:25:50Z

Olivier38220
Jan 28, 2026
Author

Hi harveyff

I’ve run the requested checks in the same environment.

Connectivity test

curl -Iv https://huggingface.co --connect-timeout 5

returns HTTP/2 200 with a valid TLS certificate for huggingface.co.

Proxy check

env | grep -i proxy

returns no output (no proxy variables set).

I also reproduced the exact endpoint used by the Gemma module outside of Olares:

https://huggingface.co/api/models/abhishekchohan/gemma-3-12b-it-quantized-W4A16/tree/main?recursive=1

This request returns HTTP 200 with a valid huggingface.co certificate when executed directly on the host.

Inside Olares, the same request fails with an x509 error, where the TLS certificate is issued for mhradio.org instead of huggingface.co.

This confirms that general network connectivity to Hugging Face is working correctly, and that the issue is caused by HTTPS redirection or interception within the Olares environment, specific to the downloader path used by this module, rather than by the repository or external network.

I’ve identified the exact root cause.

The download-model container is forced to route all outbound traffic through the internal Olares proxy (VLLM_PROXY_TARGET).

This proxy returns an invalid TLS certificate (mhradio.org) for some Hugging Face endpoints used by the Gemma module, causing the x509 failure.

When the proxy is overridden to an empty value, the TLS error disappears, but the container loses external network access entirely (Host is unreachable), confirming that the proxy is also acting as the egress gateway.

This proves the issue is not related to Hugging Face, the repository, or general connectivity, but to the proxy’s handling of specific Hugging Face endpoints.

A proper fix would be to adjust the proxy configuration (TLS chain or NO_PROXY for huggingface.co) rather than disabling it.

Best

1 reply

harveyff Jan 28, 2026
Maintainer

Hi,
VLLM_PROXY_TARGET is a lightweight Hugging Face Go-based downloader that simply fetches model files and proxies requests to vLLM. We have carefully reviewed the implementation, and it does not involve or introduce any DNS manipulation or hijacking behavior.

To be cautious, we will also test and simulate its stability under different network environments. In the meantime, you can try uninstalling and reinstalling it. Once the Hugging Face model download is completed, it should work normally.

Olivier38220 · 2026-01-28T15:12:21Z

Olivier38220
Jan 28, 2026
Author

Thanks for the clarification and for looking into this.

Based on the investigation and the reproducible behavior described above, I agree that the issue is related to how outbound HTTPS traffic is handled in the current environment rather than to the Hugging Face repository or the downloader logic itself.

I’ll close this issue for now and keep an eye on future updates or improvements around the proxy / egress handling.

Thanks again for your time and support.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama cold start and VRAM unloading in Olares (Kubernetes) when used with Dify #2451

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ollama cold start and VRAM unloading in Olares (Kubernetes) when used with Dify #2451

Uh oh!

Olivier38220 Jan 27, 2026

Replies: 6 comments · 3 replies

Uh oh!

pengpeng Jan 28, 2026 Maintainer

Uh oh!

Olivier38220 Jan 28, 2026 Author

Uh oh!

pengpeng Jan 28, 2026 Maintainer

Uh oh!

Olivier38220 Jan 28, 2026 Author

Uh oh!

pengpeng Jan 28, 2026 Maintainer

Uh oh!

harveyff Jan 28, 2026 Maintainer

Uh oh!

Olivier38220 Jan 28, 2026 Author

Uh oh!

harveyff Jan 28, 2026 Maintainer

Uh oh!

Olivier38220 Jan 28, 2026 Author

Olivier38220
Jan 27, 2026

Replies: 6 comments 3 replies

pengpeng
Jan 28, 2026
Maintainer

Olivier38220
Jan 28, 2026
Author

pengpeng Jan 28, 2026
Maintainer

Olivier38220
Jan 28, 2026
Author

pengpeng Jan 28, 2026
Maintainer

harveyff
Jan 28, 2026
Maintainer

Olivier38220
Jan 28, 2026
Author

harveyff Jan 28, 2026
Maintainer

Olivier38220
Jan 28, 2026
Author