GitHub - tinyBigGAMES/BoxedLLaMA: BoxedLLaMA is a specialized software toolkit designed for developers to integrate local artificial intelligence into Windows applications. It functions as a comprehensive wrapper for llama.cpp, automating complex tasks such as server installation, version updates, and model management from Hugging Face.

Put llama.cpp on rails.

What is BoxedLLaMA?

BoxedLLaMA is a Delphi toolkit that wraps llama.cpp into a managed, batteries-included package for local AI inference on Windows.

Most developers who want local AI face the same wall: download the right binary, figure out the command-line flags, spawn a process, parse HTTP responses, manage the lifecycle, and hope nothing breaks when llama.cpp ships a new release next Tuesday. BoxedLLaMA handles all of it. One toolkit. One managed subprocess. Zero DLL coupling.

Server.AddMessage('user', 'Summarize the Delphi roadmap.');
LResult := Server.ChatCompletionStream(CModelName, LChatConfig);

Three lines to stream a response from a local model. Everything underneath -- the binary, the process, the HTTP plumbing, the response parsing -- is handled.

Features

Feature	What It Means
🚀 Automatic server management	Downloads, installs, and auto-updates the llama.cpp server binary from GitHub releases. Version policies: auto, pinned, or manual.
💬 Chat completions	Synchronous and streaming with full token callback support. Token counts, timing, and generation speed in one result record.
🔧 Tool calling	Two-tier architecture: three meta-tools visible to the model, full catalog discovered at runtime. Agentic multi-round tool loop built in.
📐 Embeddings	Single and batch generation with cosine similarity. TChat automatically enables embeddings when an embedding model is configured.
🧠 Persistent memory	SQLite + FTS5 + HNSW vector index. Hybrid retrieval (keyword + semantic) with automatic recall injection per turn.
📄 Document ingest	Paragraph-aware chunking with configurable overlap. Drop a file into memory and let retrieval find the right pieces.
🧭 Session management	System prompt invariant, context-budget trimming, history compaction, and two-level persistence (JSON history + KV cache).
📥 HuggingFace models	Download, delete, and track GGUF models directly through the server API with SSE progress tracking.
🧪 Reasoning models	Configurable thinking tag display for chain-of-thought models. Show, hide, or replace with a placeholder.
⚡ GPU offloading	Automatic, full, or manual GPU layer control with Vulkan backend. Quantized KV cache (Q4_0, Q8_0) to fit larger contexts in VRAM.
🔄 Auto-updating	`vpAuto` checks GitHub on a configurable interval and updates the server binary silently. Your app always runs on the latest llama.cpp.
🔌 Built on StdApp	Console UI, HTTP, JSON, VFS, crypto, and more. One dependency tree, no external packages.

Architecture

Your Application
    |
    v
+-----------------------------------------+
|  BoxedLLaMA Toolkit                     |
|                                         |
|  TChat -----> TSession -----> TServer   |
|    |              |               |     |
|    v              v               v     |
|  TConsoleChat  TMemory    llama-server  |
|  (frontend)   (SQLite+    (managed      |
|               FTS5+HNSW)  subprocess)   |
|                               |         |
|  TToolRegistry <----tool---+  |         |
|  TToolBuilder    calls        |         |
+-----------------------------------------+
    |
    v
Local GGUF Models (Vulkan GPU inference)

📖 Full Documentation -- configuration, API reference, code examples, and architecture details for every module.

Getting Started

Clone the repository:

git clone https://github.com/tinyBigGAMES/BoxedLLaMA.git

Open projects\Testbed\Testbed.dproj in Delphi 12 or higher
Build the Testbed project (Win64 target)
Run -- the server binary downloads automatically on first launch
Place your GGUF model files in C:\Dev\LLM\GGUF (or update the path in projects\Testbed\UTestbed.Common.pas). Single-file models go in the root; multimodal models with a mmproj file get their own subfolder. Reference local models by filename without the .gguf extension.
(Optional) Set the TAVILY_API_KEY environment variable for web search tools. Get a free key at Tavily (1,000 credits/month).

Recommended Models

These vetted models work out of the box with the testbed demos:

Purpose	Model	Size	Download
💬 Chat (multimodal)	Gemma 4 E4B Abliterated Q4_K	5.3 GB	Download
👁️ Vision projector	mmproj for Gemma 4 E4B (bf16)	992 MB	Download
📐 Embeddings	Qwen3 Embedding 0.6B Q8_0	639 MB	Download

System Requirements

	Requirement
🖥️ Host OS	Windows 10/11 x64
🎮 GPU	Vulkan-capable GPU recommended
⚙️ Building from source	Delphi 12.x or higher
📦 Runtime dependencies	None -- server binary downloaded automatically

Important

This repository is under active development. Follow the repo or join the Discord to track progress.

Contributing

BoxedLLaMA is an open project. Whether you are fixing a bug, improving documentation, or proposing a feature, contributions are welcome.

🐛 Report bugs: Open an issue with a minimal reproduction
💡 Suggest features: Describe the use case first
🔧 Submit pull requests: Bug fixes, documentation improvements, and well-scoped features

Join the Discord to discuss development, ask questions, and share what you are building.

Support the Project

If BoxedLLaMA saves you time or sparks something useful:

⭐ Star the repo -- helps others find the project
🗣️ Spread the word -- write a post, mention it in a community
💬 Join us on Discord -- share what you are building
💖 Become a sponsor -- sponsorship directly funds development
🦋 Follow on Bluesky -- stay in the loop on releases

License

BoxedLLaMA is licensed under the Apache License 2.0. See LICENSE for details.

Apache 2.0 is a permissive open source license that lets you use, modify, and distribute BoxedLLaMA freely in both open source and commercial projects. You are not required to release your own source code. The license includes an explicit patent grant. Attribution is required -- keep the copyright notice and license file in place.

Links

📖 Documentation
💬 Discord
🦋 Bluesky
🏠 tinyBigGAMES

BoxedLLaMA™ -- Put llama.cpp on rails.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
bin		bin
docs		docs
media		media
projects		projects
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What is BoxedLLaMA?

Features

Architecture

Getting Started

Recommended Models

System Requirements

Contributing

Support the Project

License

Links

About

Uh oh!

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

What is BoxedLLaMA?

Features

Architecture

Getting Started

Recommended Models

System Requirements

Contributing

Support the Project

License

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages