BoxedLLaMA is a Delphi toolkit that wraps llama.cpp into a managed, batteries-included package for local AI inference on Windows.
Most developers who want local AI face the same wall: download the right binary, figure out the command-line flags, spawn a process, parse HTTP responses, manage the lifecycle, and hope nothing breaks when llama.cpp ships a new release next Tuesday. BoxedLLaMA handles all of it. One toolkit. One managed subprocess. Zero DLL coupling.
Server.AddMessage('user', 'Summarize the Delphi roadmap.');
LResult := Server.ChatCompletionStream(CModelName, LChatConfig);Three lines to stream a response from a local model. Everything underneath -- the binary, the process, the HTTP plumbing, the response parsing -- is handled.
| Feature | What It Means |
|---|---|
| 🚀 Automatic server management | Downloads, installs, and auto-updates the llama.cpp server binary from GitHub releases. Version policies: auto, pinned, or manual. |
| 💬 Chat completions | Synchronous and streaming with full token callback support. Token counts, timing, and generation speed in one result record. |
| 🔧 Tool calling | Two-tier architecture: three meta-tools visible to the model, full catalog discovered at runtime. Agentic multi-round tool loop built in. |
| 📐 Embeddings | Single and batch generation with cosine similarity. TChat automatically enables embeddings when an embedding model is configured. |
| 🧠 Persistent memory | SQLite + FTS5 + HNSW vector index. Hybrid retrieval (keyword + semantic) with automatic recall injection per turn. |
| 📄 Document ingest | Paragraph-aware chunking with configurable overlap. Drop a file into memory and let retrieval find the right pieces. |
| 🧭 Session management | System prompt invariant, context-budget trimming, history compaction, and two-level persistence (JSON history + KV cache). |
| 📥 HuggingFace models | Download, delete, and track GGUF models directly through the server API with SSE progress tracking. |
| 🧪 Reasoning models | Configurable thinking tag display for chain-of-thought models. Show, hide, or replace with a placeholder. |
| ⚡ GPU offloading | Automatic, full, or manual GPU layer control with Vulkan backend. Quantized KV cache (Q4_0, Q8_0) to fit larger contexts in VRAM. |
| 🔄 Auto-updating | vpAuto checks GitHub on a configurable interval and updates the server binary silently. Your app always runs on the latest llama.cpp. |
| 🔌 Built on StdApp | Console UI, HTTP, JSON, VFS, crypto, and more. One dependency tree, no external packages. |
Your Application
|
v
+-----------------------------------------+
| BoxedLLaMA Toolkit |
| |
| TChat -----> TSession -----> TServer |
| | | | |
| v v v |
| TConsoleChat TMemory llama-server |
| (frontend) (SQLite+ (managed |
| FTS5+HNSW) subprocess) |
| | |
| TToolRegistry <----tool---+ | |
| TToolBuilder calls | |
+-----------------------------------------+
|
v
Local GGUF Models (Vulkan GPU inference)
📖 Full Documentation -- configuration, API reference, code examples, and architecture details for every module.
- Clone the repository:
git clone https://github.com/tinyBigGAMES/BoxedLLaMA.git- Open
projects\Testbed\Testbed.dprojin Delphi 12 or higher - Build the Testbed project (Win64 target)
- Run -- the server binary downloads automatically on first launch
- Place your GGUF model files in
C:\Dev\LLM\GGUF(or update the path inprojects\Testbed\UTestbed.Common.pas). Single-file models go in the root; multimodal models with ammprojfile get their own subfolder. Reference local models by filename without the.ggufextension. - (Optional) Set the
TAVILY_API_KEYenvironment variable for web search tools. Get a free key at Tavily (1,000 credits/month).
These vetted models work out of the box with the testbed demos:
| Purpose | Model | Size | Download |
|---|---|---|---|
| 💬 Chat (multimodal) | Gemma 4 E4B Abliterated Q4_K | 5.3 GB | Download |
| 👁️ Vision projector | mmproj for Gemma 4 E4B (bf16) | 992 MB | Download |
| 📐 Embeddings | Qwen3 Embedding 0.6B Q8_0 | 639 MB | Download |
| Requirement | |
|---|---|
| 🖥️ Host OS | Windows 10/11 x64 |
| 🎮 GPU | Vulkan-capable GPU recommended |
| ⚙️ Building from source | Delphi 12.x or higher |
| 📦 Runtime dependencies | None -- server binary downloaded automatically |
Important
This repository is under active development. Follow the repo or join the Discord to track progress.
BoxedLLaMA is an open project. Whether you are fixing a bug, improving documentation, or proposing a feature, contributions are welcome.
- 🐛 Report bugs: Open an issue with a minimal reproduction
- 💡 Suggest features: Describe the use case first
- 🔧 Submit pull requests: Bug fixes, documentation improvements, and well-scoped features
Join the Discord to discuss development, ask questions, and share what you are building.
If BoxedLLaMA saves you time or sparks something useful:
- ⭐ Star the repo -- helps others find the project
- 🗣️ Spread the word -- write a post, mention it in a community
- 💬 Join us on Discord -- share what you are building
- 💖 Become a sponsor -- sponsorship directly funds development
- 🦋 Follow on Bluesky -- stay in the loop on releases
BoxedLLaMA is licensed under the Apache License 2.0. See LICENSE for details.
Apache 2.0 is a permissive open source license that lets you use, modify, and distribute BoxedLLaMA freely in both open source and commercial projects. You are not required to release your own source code. The license includes an explicit patent grant. Attribution is required -- keep the copyright notice and license file in place.
- 📖 Documentation
- 💬 Discord
- 🦋 Bluesky
- 🏠 tinyBigGAMES
BoxedLLaMA™ -- Put llama.cpp on rails.
Copyright © 2026-present tinyBigGAMES™ LLC
All Rights Reserved.
