feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache)#9532
Open
feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache)#9532
Conversation
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds two independent features on top: DFlash block-diffusion speculative decoding (via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4. Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp grpc-server sources verbatim, patches only the build copy to extend the KV allow-list and wire up buun-exclusive tree_budget / draft_topk options. DraftModel is already wired end-to-end (proto field 39 → params.speculative), so DFlash activation only needs the existing options passthrough (spec_type:dflash) plus the drafter path in draft_model. CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown shows them — benefits turboquant too (previously users had to type them in YAML manually). Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Two additions that pair with the new backend: - An Import()-side case that asserts preference buun-llama-cpp produces backend: buun-llama-cpp in the emitted YAML (mirrors the existing ik-llama-cpp and turboquant cases). - AdditionalBackends() spec now asserts all three drop-in replacements are advertised, and verifies buun-llama-cpp's Modality/Description alongside the other two. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI's shared grpc-server.cpp reaches ctx_server.get_meta().logit_bias_eog twice (the twin params_from_json_cmpl callsites). That accessor was added to server_context_meta upstream after buun's 2026-04-05 fork-point, so compiling against buun errors with 'struct server_context_meta' has no member named 'logit_bias_eog'. Rewrite the call sites — only in the buun grpc-server.cpp copy — to source the vector from params_base.sampling.logit_bias_eog instead. That vector is the underlying data the upstream meta accessor eventually returns (buun still carries common_params_sampling::logit_bias_eog at common.h:280), so the substitution yields identical behavior on both trees. The sed is guarded by a grep for the call site, so this patch is self-disabling once buun rebases past the upstream refactor. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Previous substitution kept the call as 5 args, but buun predates the upstream refactor that also *added* the logit_bias_eog parameter to params_from_json_cmpl — buun's signature is still the 4-arg form (const llama_vocab*, const common_params&, int, const json&) and it still derives logit_bias_eog internally from the common_params. Replace the substitution with a line-delete. Guard matches both the original call (ctx_server.get_meta().logit_bias_eog) and the previously substituted form (params_base.sampling.logit_bias_eog) so the script stays safe across re-runs and whatever state the tree was left in. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds the buun-llama-cpp backend to the same CI pipelines that turboquant and sherpa-onnx already use: - scripts/changed-backends.js: path resolution for Dockerfile.buun-llama-cpp, plus fork-of-fork detection (changes under backend/cpp/llama-cpp/ also retrigger the buun pipeline, mirroring how turboquant is handled). - .github/workflows/test-extra.yml: detect-changes output and a new tests-buun-llama-cpp-grpc job that runs make test-extra-backend-buun-llama-cpp (turbo3 V-cache, same rationale as tests-turboquant-grpc). - .github/workflows/backend.yml: 9 matrix entries (CUDA 12/13, L4T CUDA 13 ARM64, ROCm, SYCL f32/f16, CPU, L4T ARM64, Vulkan) paired with each existing turboquant entry so image builds have platform parity. Also updates .agents/ai-coding-assistants.md to clarify that AI agents operating under the human submitter's git identity SHOULD emit Signed-off-by via `git commit -s` (never inventing or guessing another identity) — documents the workflow this PR is using. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
d07668e to
6233feb
Compare
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:
fattn.cu:812: error: no instance of overloaded function "atomicAdd"
matches the argument list, argument types are: (double *, double)
Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.
Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp
merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var,
laneMask), omitting the optional width parameter. The hipification shim
at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that
requires all four arguments, so hipcc fails with:
argmax.cu:265: too few arguments provided to function-like macro
invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Every other call in the same file already passes WARP_SIZE explicitly;
aligning these two with that convention fixes the hipblas build without
changing CUDA codegen (warpSize is the CUDA default).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ht128 shuffles Two more hipblas-only build failures in buun's fattn.cu, fixed under the same patches/ infrastructure: 1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration + TCQ codebook upload paths call the symbol variants of cudaMemcpy. ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy* name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the symbol pair was never added. 15+ "use of undeclared identifier" errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883, 905, 954, 976, 1449, 1463. Add the two missing aliases alongside the existing memcpy block. 2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp butterfly) and 536 (fwht128_store_half neighbor fetch) drop the width argument that hip.h:33 requires. Add WARP_SIZE. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds two independent features on top: DFlash block-diffusion speculative decoding (via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4.
Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp grpc-server sources verbatim, patches only the build copy to extend the KV allow-list and wire up buun-exclusive tree_budget / draft_topk options. DraftModel is already wired end-to-end (proto field 39 → params.speculative), so DFlash activation only needs the existing options passthrough (spec_type:dflash) plus the drafter path in draft_model.
CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown shows them — benefits turboquant too (previously users had to type them in YAML manually).
Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]