Skip to content

GrokImageCompression/freebyrd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Freebird

A modern C++23 header-only thread pool with dependency-driven scheduling, priority domains, and lock-free SPSC buffering.

Features

  • Header-only — Drop the include/freebyrd/ directory into your project; no compiled library needed
  • Lock-free task queues — Bounded MPMC ring buffers using Vyukov's algorithm with per-slot sequence counters and cache-line padded head/tail atomics
  • Dependency gates — Atomic countdown barriers that auto-schedule a completion task when all dependencies are satisfied; chain gates for multi-phase pipelines
  • Priority domains — Named task categories with dedicated queues and configurable priority ordering; workers dequeue from higher-priority domains first
  • Per-worker local queues — Cache-affine task submission with work stealing across workers for load balancing
  • SPSC swath buffers — Double/triple/quad-buffered handoff for producer–consumer pipelines; all state packed into a single atomic<uint64_t> with C++20 wait/notify
  • Small-buffer optimized tasks — Type-erased callables with 48-byte SBO; lambdas with typical captures avoid heap allocation entirely
  • Clean lifecyclestd::jthread with std::stop_token for cooperative shutdown; no condition variables, no mutexes on the hot path

Requirements

  • C++23 compiler (GCC 13+, Clang 17+)
  • CMake 3.20+ (only needed for tests)
  • pthreads (Linux/macOS)

Integration

Option 1: Copy headers (recommended)

Copy the include/freebyrd/ directory into your project's include path. Then:

#include <freebyrd/freebyrd.h>

int main() {
    frb::thread_pool pool;
    pool.submit(frb::task([]{ /* work */ }));
    pool.wait_idle();
}

Compile with -std=c++23 -lpthread.

Option 2: CMake add_subdirectory

add_subdirectory(path/to/freebyrd)
target_link_libraries(your_target PRIVATE freebyrd)

The CMake target freebyrd is an INTERFACE library — it only propagates include paths, the C++23 standard requirement, and the Threads dependency. No static/shared library is built.

Option 3: CMake FetchContent

include(FetchContent)
FetchContent_Declare(freebyrd
    GIT_REPOSITORY https://your-repo/freebyrd.git
    GIT_TAG main
)
FetchContent_MakeAvailable(freebyrd)
target_link_libraries(your_target PRIVATE freebyrd)

Building the Tests

mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./freebird_tests

Run a subset of tests by name filter:

./freebird_tests pipeline       # runs only tests containing "pipeline"
./freebird_tests stress         # runs only stress tests

Architecture

Component Overview

Header Class Purpose
cache_line.h cache_aligned<T> Pads any type to a full cache line (64 bytes or hardware_destructive_interference_size) to prevent false sharing. Uses a variadic constructor so non-movable types like std::atomic can be constructed in-place.
task.h task Type-erased callable with 48-byte small-buffer optimization. Captures ≤ 48 bytes are stored inline (no heap allocation). Larger captures fall back to new. Move-only; uses bitwise copy of the SBO storage for zero-overhead moves.
task_queue.h task_queue Bounded MPMC (multi-producer, multi-consumer) ring buffer based on Dmitry Vyukov's algorithm. Each slot has an atomic sequence counter that acts as both a ticket and a fence. Head and tail atomics are cache-line padded to eliminate false sharing.
dependency_gate.h dependency_gate Atomic dependency counter. Constructed with a count and a completion task. Each call to arrive() decrements the counter; when it hits zero, the completion task is automatically submitted to the pool. Gates can be reused via reset().
task_domain.h task_domain A named task category with its own task_queue and a sequence_index that determines dequeue priority (lower = higher priority). Workers sweep domains in priority order.
swath_buffer.h swath_buffer<T> SPSC (single-producer, single-consumer) double/triple/quad-buffered handoff. Packs consumer-available count (M), producer-available count (D), and a wait flag (W) into a single atomic<uint64_t>. Uses C++20 atomic::wait()/notify_all() for blocking consumers — no condition variables.
thread_pool.h thread_pool The pool orchestrator. Manages std::jthread workers, each with a local task_queue. Supports global queue fallback, domain-priority dequeue, and round-robin work stealing. Idle workers spin briefly then sleep via atomic::wait(). wait_idle() blocks until all submitted tasks (including those spawned by running tasks) complete.

Worker Dequeue Order

Each worker thread follows this priority when looking for work:

  1. Own local queue — best cache locality
  2. Preferred domain queue — the domain this worker is associated with
  3. Other domain queues — swept in sequence_index order (lower = higher priority)
  4. Global queue — catch-all fallback
  5. Steal from other workers — round-robin across peer local queues
  6. Back off — spin for 64 iterations (with CPU pause), then sleep via atomic::wait()

Dependency-Driven Scheduling

Traditional barrier-based parallelism (all tasks in phase N must complete before any task in phase N+1 can start) leaves cores idle at phase boundaries. Freebird uses dependency gates instead:

┌─────────────────────────────────────────────────────┐
│ Phase 1: Decode blocks (20,480 tasks)               │
│   Each block calls gate.arrive() on its strip's     │
│   dependency_gate when finished                     │
└───────┬─────────────────────────────────────────────┘
        │ (no barrier — strips start as soon as
        │  their blocks complete)
        ▼
┌─────────────────────────────────────────────────────┐
│ Phase 2: Horizontal filter (640 tasks)              │
│   Each strip's gate fires a h_filter task when      │
│   all blocks in that strip are decoded.             │
│   Each h_filter task calls arrive() on the          │
│   resolution's v_gate.                              │
└───────┬─────────────────────────────────────────────┘
        │ (no barrier — resolutions start as soon
        │  as all their strips complete)
        ▼
┌─────────────────────────────────────────────────────┐
│ Phase 3: Vertical filter (40 tasks)                 │
│   Fires when all h_filter strips for a resolution   │
│   are done.                                         │
└───────┬─────────────────────────────────────────────┘
        │ (v_filter completion submits disk write)
        ▼
┌─────────────────────────────────────────────────────┐
│ Phase 4: Disk write + fdatasync (40 tasks)          │
│   Writes decoded strip data to the real filesystem  │
│   and forces to storage with fdatasync(). Runs      │
│   concurrently with decode/filter work on           │
│   remaining tiles.                                  │
└─────────────────────────────────────────────────────┘

This means Phase 2 tasks for early strips can overlap with Phase 1 tasks for later strips, and Phase 4 disk writes overlap with all earlier phases — the pipeline stays full and all cores stay busy.

Performance

On a 16-core machine (GCC 15, Release, -O3):

Metric Pipeline only Full suite (33 tests)
CPU utilization 1465% (91.6% of 16 cores) 1398% (87.4%)
Wall time 2.25s 2.50s
User+System time 33.0s 34.8s
Disk I/O (NVMe, fdatasync) ~20 MB, 0% CPU impact ~20 MB

The disk writes (512 KB × 40 strips with fdatasync() to btrfs/NVMe) complete within the compute budget of remaining tiles — zero measurable impact on CPU utilization.

API Reference

thread_pool

frb::pool_config cfg{
    .num_threads = 0,                // 0 = hardware_concurrency()
    .global_queue_capacity = 65536,
    .local_queue_capacity = 1024,
};

frb::thread_pool pool(cfg);

// Submit a task (default domain).
pool.submit(frb::task([] { do_work(); }));

// Submit to a specific domain.
auto dom_id = pool.register_domain("decode", /*sequence_index=*/0);
pool.submit(frb::task([] { decode_block(); }), dom_id);

// Submit to a specific worker's local queue (cache affinity).
pool.submit_local(frb::task([] { hot_path(); }), worker_id, dom_id);

// Block until all pending tasks complete.
pool.wait_idle();

// Query current worker index from within a task (-1 if non-worker thread).
int32_t wid = frb::thread_pool::current_worker_id();

dependency_gate

frb::thread_pool pool;

// Create a gate that fires when 10 dependencies arrive.
frb::dependency_gate gate(pool, 10, frb::task([] {
    // Runs exactly once, when the 10th arrive() is called.
    std::printf("All dependencies satisfied!\n");
}));

// From various tasks:
gate.arrive();      // decrement by 1
gate.arrive(3);     // decrement by 3

// Reuse the gate for the next round:
gate.reset(pool, 5, frb::task([] { /* new completion */ }));

swath_buffer<T>

frb::swath_buffer<std::vector<float>> buf(3); // triple-buffered

// Producer thread:
auto idx = buf.try_acquire_producer();     // returns std::optional<uint32_t>
if(idx) {
    buf[*idx] = produce_data();
    buf.release_to_consumer();
}

// Consumer thread:
auto idx = buf.acquire_consumer_blocking(); // blocks until data available
process(buf[idx]);
buf.release_to_producer();

// Non-blocking consumer:
if(auto idx = buf.try_acquire_consumer()) {
    process(buf[*idx]);
    buf.release_to_producer();
}

task_domain

// Register domains with priority ordering (lower sequence_index = higher priority).
auto d_decode    = pool.register_domain("decode",    /*sequence_index=*/0);
auto d_transform = pool.register_domain("transform", /*sequence_index=*/1);
auto d_output    = pool.register_domain("output",    /*sequence_index=*/2);

// Submit to specific domains — workers service higher-priority domains first.
pool.submit(frb::task([] { decode(); }),    d_decode);
pool.submit(frb::task([] { transform(); }), d_transform);

Test Suite

33 tests covering unit, integration, and stress scenarios:

Category Tests Description
task 4 SBO path, heap path, move semantics, empty state
task_queue 4 Push/pop, empty pop, fill/drain, MPMC stress (100K tasks, 4P/4C)
thread_pool 5 Single task, 10K tasks, wait-idle-empty, submit-from-task chains, worker ID
domains 1 Multi-domain submit and completion
dependency_gate 4 Single dep, multi-dep, concurrent arrive (10K), 3-stage chain
swath_buffer 3 Basic handoff, double-buffer fill/drain, SPSC concurrent (50K items)
stress 11 Rapid submit/wait (100 rounds × 1K), fan-in (500-wide, 20 rounds), 200-deep gate chains, mixed domains (10K), pipeline simulation (512 blocks → 32 strips), recursive fan-out (4^5=1365 tasks), contention hammer (100K tasks, 16 threads), local submit (10K), pool create/destroy (20 cycles), gate reset/reuse (50 rounds), 3-stage swath pipeline (50K items)
pipeline 1 Full-scale multi-tile multi-resolution codec pipeline with disk I/O: 8 tiles × 5 resolutions × 16 strips × 32 blocks = 20,480 decode tasks → 640 h_filter tasks → 40 v_filter tasks → 40 disk write+fdatasync tasks, 4 priority domains, parallel tile submission, real filesystem output

All tests pass under:

  • Release (optimized, no sanitizers)
  • Debug + AddressSanitizer (memory safety)
  • RelWithDebInfo + ThreadSanitizer (data race detection)

Project Structure

freebyrd/
├── CMakeLists.txt              # INTERFACE library + test target
├── README.md
├── include/
│   └── freebyrd/
│       ├── freebyrd.h          # umbrella header (include this)
│       ├── cache_line.h        # cache_aligned<T>
│       ├── task.h              # task (type-erased callable, 48-byte SBO)
│       ├── task_queue.h        # task_queue (Vyukov MPMC ring buffer)
│       ├── task_domain.h       # task_domain (named priority queue)
│       ├── dependency_gate.h   # dependency_gate (atomic countdown)
│       ├── swath_buffer.h      # swath_buffer<T> (SPSC multi-buffered handoff)
│       └── thread_pool.h       # thread_pool (orchestrator + all inline impl)
├── tests/
│   └── test_pool.cpp           # 33 tests (unit, stress, full pipeline)
└── build/                      # out-of-source build directory

License

This project is licensed under the GNU General Public License v3.0.

Copyright (C) 2025-2026 Grok Image Compression Inc.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors