Freebird

A modern C++23 header-only thread pool with dependency-driven scheduling, priority domains, and lock-free SPSC buffering.

Features

Header-only — Drop the include/freebyrd/ directory into your project; no compiled library needed
Lock-free task queues — Bounded MPMC ring buffers using Vyukov's algorithm with per-slot sequence counters and cache-line padded head/tail atomics
Dependency gates — Atomic countdown barriers that auto-schedule a completion task when all dependencies are satisfied; chain gates for multi-phase pipelines
Priority domains — Named task categories with dedicated queues and configurable priority ordering; workers dequeue from higher-priority domains first
Per-worker local queues — Cache-affine task submission with work stealing across workers for load balancing
SPSC swath buffers — Double/triple/quad-buffered handoff for producer–consumer pipelines; all state packed into a single atomic<uint64_t> with C++20 wait/notify
Small-buffer optimized tasks — Type-erased callables with 48-byte SBO; lambdas with typical captures avoid heap allocation entirely
Clean lifecycle — std::jthread with std::stop_token for cooperative shutdown; no condition variables, no mutexes on the hot path

Requirements

C++23 compiler (GCC 13+, Clang 17+)
CMake 3.20+ (only needed for tests)
pthreads (Linux/macOS)

Integration

Option 1: Copy headers (recommended)

Copy the include/freebyrd/ directory into your project's include path. Then:

#include <freebyrd/freebyrd.h>

int main() {
    frb::thread_pool pool;
    pool.submit(frb::task([]{ /* work */ }));
    pool.wait_idle();
}

Compile with -std=c++23 -lpthread.

Option 2: CMake `add_subdirectory`

add_subdirectory(path/to/freebyrd)
target_link_libraries(your_target PRIVATE freebyrd)

The CMake target freebyrd is an INTERFACE library — it only propagates include paths, the C++23 standard requirement, and the Threads dependency. No static/shared library is built.

Option 3: CMake `FetchContent`

include(FetchContent)
FetchContent_Declare(freebyrd
    GIT_REPOSITORY https://your-repo/freebyrd.git
    GIT_TAG main
)
FetchContent_MakeAvailable(freebyrd)
target_link_libraries(your_target PRIVATE freebyrd)

Building the Tests

mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./freebird_tests

Run a subset of tests by name filter:

./freebird_tests pipeline       # runs only tests containing "pipeline"
./freebird_tests stress         # runs only stress tests

Architecture

Component Overview

Header	Class	Purpose
`cache_line.h`	`cache_aligned<T>`	Pads any type to a full cache line (64 bytes or `hardware_destructive_interference_size`) to prevent false sharing. Uses a variadic constructor so non-movable types like `std::atomic` can be constructed in-place.
`task.h`	`task`	Type-erased callable with 48-byte small-buffer optimization. Captures ≤ 48 bytes are stored inline (no heap allocation). Larger captures fall back to `new`. Move-only; uses bitwise copy of the SBO storage for zero-overhead moves.
`task_queue.h`	`task_queue`	Bounded MPMC (multi-producer, multi-consumer) ring buffer based on Dmitry Vyukov's algorithm. Each slot has an atomic sequence counter that acts as both a ticket and a fence. Head and tail atomics are cache-line padded to eliminate false sharing.
`dependency_gate.h`	`dependency_gate`	Atomic dependency counter. Constructed with a count and a completion task. Each call to `arrive()` decrements the counter; when it hits zero, the completion task is automatically submitted to the pool. Gates can be reused via `reset()`.
`task_domain.h`	`task_domain`	A named task category with its own `task_queue` and a `sequence_index` that determines dequeue priority (lower = higher priority). Workers sweep domains in priority order.
`swath_buffer.h`	`swath_buffer<T>`	SPSC (single-producer, single-consumer) double/triple/quad-buffered handoff. Packs consumer-available count (M), producer-available count (D), and a wait flag (W) into a single `atomic<uint64_t>`. Uses C++20 `atomic::wait()`/`notify_all()` for blocking consumers — no condition variables.
`thread_pool.h`	`thread_pool`	The pool orchestrator. Manages `std::jthread` workers, each with a local `task_queue`. Supports global queue fallback, domain-priority dequeue, and round-robin work stealing. Idle workers spin briefly then sleep via `atomic::wait()`. `wait_idle()` blocks until all submitted tasks (including those spawned by running tasks) complete.

Worker Dequeue Order

Each worker thread follows this priority when looking for work:

Own local queue — best cache locality
Preferred domain queue — the domain this worker is associated with
Other domain queues — swept in sequence_index order (lower = higher priority)
Global queue — catch-all fallback
Steal from other workers — round-robin across peer local queues
Back off — spin for 64 iterations (with CPU pause), then sleep via atomic::wait()

Dependency-Driven Scheduling

Traditional barrier-based parallelism (all tasks in phase N must complete before any task in phase N+1 can start) leaves cores idle at phase boundaries. Freebird uses dependency gates instead:

┌─────────────────────────────────────────────────────┐
│ Phase 1: Decode blocks (20,480 tasks)               │
│   Each block calls gate.arrive() on its strip's     │
│   dependency_gate when finished                     │
└───────┬─────────────────────────────────────────────┘
        │ (no barrier — strips start as soon as
        │  their blocks complete)
        ▼
┌─────────────────────────────────────────────────────┐
│ Phase 2: Horizontal filter (640 tasks)              │
│   Each strip's gate fires a h_filter task when      │
│   all blocks in that strip are decoded.             │
│   Each h_filter task calls arrive() on the          │
│   resolution's v_gate.                              │
└───────┬─────────────────────────────────────────────┘
        │ (no barrier — resolutions start as soon
        │  as all their strips complete)
        ▼
┌─────────────────────────────────────────────────────┐
│ Phase 3: Vertical filter (40 tasks)                 │
│   Fires when all h_filter strips for a resolution   │
│   are done.                                         │
└───────┬─────────────────────────────────────────────┘
        │ (v_filter completion submits disk write)
        ▼
┌─────────────────────────────────────────────────────┐
│ Phase 4: Disk write + fdatasync (40 tasks)          │
│   Writes decoded strip data to the real filesystem  │
│   and forces to storage with fdatasync(). Runs      │
│   concurrently with decode/filter work on           │
│   remaining tiles.                                  │
└─────────────────────────────────────────────────────┘

This means Phase 2 tasks for early strips can overlap with Phase 1 tasks for later strips, and Phase 4 disk writes overlap with all earlier phases — the pipeline stays full and all cores stay busy.

Performance

On a 16-core machine (GCC 15, Release, -O3):

Metric	Pipeline only	Full suite (33 tests)
CPU utilization	1465% (91.6% of 16 cores)	1398% (87.4%)
Wall time	2.25s	2.50s
User+System time	33.0s	34.8s
Disk I/O (NVMe, fdatasync)	~20 MB, 0% CPU impact	~20 MB

The disk writes (512 KB × 40 strips with fdatasync() to btrfs/NVMe) complete within the compute budget of remaining tiles — zero measurable impact on CPU utilization.

API Reference

`thread_pool`

frb::pool_config cfg{
    .num_threads = 0,                // 0 = hardware_concurrency()
    .global_queue_capacity = 65536,
    .local_queue_capacity = 1024,
};

frb::thread_pool pool(cfg);

// Submit a task (default domain).
pool.submit(frb::task([] { do_work(); }));

// Submit to a specific domain.
auto dom_id = pool.register_domain("decode", /*sequence_index=*/0);
pool.submit(frb::task([] { decode_block(); }), dom_id);

// Submit to a specific worker's local queue (cache affinity).
pool.submit_local(frb::task([] { hot_path(); }), worker_id, dom_id);

// Block until all pending tasks complete.
pool.wait_idle();

// Query current worker index from within a task (-1 if non-worker thread).
int32_t wid = frb::thread_pool::current_worker_id();

`dependency_gate`

frb::thread_pool pool;

// Create a gate that fires when 10 dependencies arrive.
frb::dependency_gate gate(pool, 10, frb::task([] {
    // Runs exactly once, when the 10th arrive() is called.
    std::printf("All dependencies satisfied!\n");
}));

// From various tasks:
gate.arrive();      // decrement by 1
gate.arrive(3);     // decrement by 3

// Reuse the gate for the next round:
gate.reset(pool, 5, frb::task([] { /* new completion */ }));

`swath_buffer<T>`

frb::swath_buffer<std::vector<float>> buf(3); // triple-buffered

// Producer thread:
auto idx = buf.try_acquire_producer();     // returns std::optional<uint32_t>
if(idx) {
    buf[*idx] = produce_data();
    buf.release_to_consumer();
}

// Consumer thread:
auto idx = buf.acquire_consumer_blocking(); // blocks until data available
process(buf[idx]);
buf.release_to_producer();

// Non-blocking consumer:
if(auto idx = buf.try_acquire_consumer()) {
    process(buf[*idx]);
    buf.release_to_producer();
}

`task_domain`

// Register domains with priority ordering (lower sequence_index = higher priority).
auto d_decode    = pool.register_domain("decode",    /*sequence_index=*/0);
auto d_transform = pool.register_domain("transform", /*sequence_index=*/1);
auto d_output    = pool.register_domain("output",    /*sequence_index=*/2);

// Submit to specific domains — workers service higher-priority domains first.
pool.submit(frb::task([] { decode(); }),    d_decode);
pool.submit(frb::task([] { transform(); }), d_transform);

Test Suite

33 tests covering unit, integration, and stress scenarios:

Category	Tests	Description
`task`	4	SBO path, heap path, move semantics, empty state
`task_queue`	4	Push/pop, empty pop, fill/drain, MPMC stress (100K tasks, 4P/4C)
`thread_pool`	5	Single task, 10K tasks, wait-idle-empty, submit-from-task chains, worker ID
`domains`	1	Multi-domain submit and completion
`dependency_gate`	4	Single dep, multi-dep, concurrent arrive (10K), 3-stage chain
`swath_buffer`	3	Basic handoff, double-buffer fill/drain, SPSC concurrent (50K items)
`stress`	11	Rapid submit/wait (100 rounds × 1K), fan-in (500-wide, 20 rounds), 200-deep gate chains, mixed domains (10K), pipeline simulation (512 blocks → 32 strips), recursive fan-out (4^5=1365 tasks), contention hammer (100K tasks, 16 threads), local submit (10K), pool create/destroy (20 cycles), gate reset/reuse (50 rounds), 3-stage swath pipeline (50K items)
`pipeline`	1	Full-scale multi-tile multi-resolution codec pipeline with disk I/O: 8 tiles × 5 resolutions × 16 strips × 32 blocks = 20,480 decode tasks → 640 h_filter tasks → 40 v_filter tasks → 40 disk write+fdatasync tasks, 4 priority domains, parallel tile submission, real filesystem output

All tests pass under:

Release (optimized, no sanitizers)
Debug + AddressSanitizer (memory safety)
RelWithDebInfo + ThreadSanitizer (data race detection)

Project Structure

freebyrd/
├── CMakeLists.txt              # INTERFACE library + test target
├── README.md
├── include/
│   └── freebyrd/
│       ├── freebyrd.h          # umbrella header (include this)
│       ├── cache_line.h        # cache_aligned<T>
│       ├── task.h              # task (type-erased callable, 48-byte SBO)
│       ├── task_queue.h        # task_queue (Vyukov MPMC ring buffer)
│       ├── task_domain.h       # task_domain (named priority queue)
│       ├── dependency_gate.h   # dependency_gate (atomic countdown)
│       ├── swath_buffer.h      # swath_buffer<T> (SPSC multi-buffered handoff)
│       └── thread_pool.h       # thread_pool (orchestrator + all inline impl)
├── tests/
│   └── test_pool.cpp           # 33 tests (unit, stress, full pipeline)
└── build/                      # out-of-source build directory

License

This project is licensed under the GNU General Public License v3.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Freebird

Features

Requirements

Integration

Option 1: Copy headers (recommended)

Option 2: CMake `add_subdirectory`

Option 3: CMake `FetchContent`

Building the Tests

Architecture

Component Overview

Worker Dequeue Order

Dependency-Driven Scheduling

Performance

API Reference

`thread_pool`

`dependency_gate`

`swath_buffer<T>`

`task_domain`

Test Suite

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
include/freebyrd		include/freebyrd
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Freebird

Features

Requirements

Integration

Option 1: Copy headers (recommended)

Option 2: CMake add_subdirectory

Option 3: CMake FetchContent

Building the Tests

Architecture

Component Overview

Worker Dequeue Order

Dependency-Driven Scheduling

Performance

API Reference

thread_pool

dependency_gate

swath_buffer<T>

task_domain

Test Suite

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Option 2: CMake `add_subdirectory`

Option 3: CMake `FetchContent`

`thread_pool`

`dependency_gate`

`swath_buffer<T>`

`task_domain`

Packages