Skip to content

Replace CFS/EEVDF with compact O(1) tiny scheduler#13

Merged
jserv merged 2 commits intomainfrom
sched
May 1, 2026
Merged

Replace CFS/EEVDF with compact O(1) tiny scheduler#13
jserv merged 2 commits intomainfrom
sched

Conversation

@jserv
Copy link
Copy Markdown
Owner

@jserv jserv commented May 1, 2026

No description provided.

jserv added 2 commits May 1, 2026 17:37
The first decompressor cleanup (e1849f1) deferred RD_LZMA, RD_BZIP2,
and RD_LZO with the note that LZO carried 728 measured .text bytes
while LZMA and BZIP2 had zero -- they were "default y" hygiene risks
rather than direct cost.  After RD_ZSTD/RD_LZ4/RD_XZ shipped, that
deferral is the last decompressor surface that olddefconfig can drift
back: any future enable of a selector (squashfs / btrfs / jffs2 /
f2fs / zram / crypto / lib) re-pulls LZO_DECOMPRESS or one of the
DECOMPRESS_* umbrella bools and the size win evaporates silently.

Add explicit "# CONFIG_RD_LZMA / RD_BZIP2 / RD_LZO is not set" lines
to the inline kernel .config block, mirror them into the positive
olddefconfig-survivor verifier list, and extend the negative guard
loop with LZO_DECOMPRESS plus the three DECOMPRESS_LZMA /
DECOMPRESS_BZIP2 / DECOMPRESS_LZO umbrellas.  The guard now lists
LZO_DECOMPRESS on its own line because it has the widest set of
upstream selectors of the six; if any of them ever come back through
a different path the build aborts loudly instead of regressing.

Result: linux.axf 1,212,960 -> 1,204,768 bytes (-8,192 / -0.68%);
vmlinux .init.text -5,744, .rodata -1,504, .text -736 (the bulk of
the drop is the boot-time decompressor-selection registration code
in .init.text, freed after init -- consistent with the pre-cut
rollup which only saw lib/lzo 728 bytes resident).  QEMU
MPS2-AN386 boots clean.
Linux 7.0's kernel/sched/fair.c carries no #ifdef CONFIG_SMP guard.
On a UP NOMMU image the SMP load-balancer (select_task_rq_fair 1,484,
sched_balance_rq 1,460, update_sd_lb_stats 912,
sched_balance_find_*_group 1,262, _nohz_idle_balance 424,
can_migrate_task 324, active_load_balance_cpu_stop 316, ~7.8KB total)
gets pinned by the sched_class callback table; --gc-sections cannot
reach it through the table.  Add the same kind of out-of-tree gate
0012/0013 used for debug.c and deadline.c, but for the whole class.

CONFIG_SCHED_FAIR_TINY (default n) wraps fair.c body in #ifndef and
provides a three-priority O(1) class in the #else branch:

  - per-CPU bitmap + per-priority FIFO (HIGH/NORMAL/LOW)
  - O(1) pick:    find_first_bit(active) + list_first_entry
  - O(1) enqueue: list_add_tail + __set_bit
  - O(1) dequeue: list_del_init + __clear_bit when queue empties
  - cross-priority preemption at wakeup; round-robin within a
    priority via a fixed jiffies time-slice reset on set_next_task

Priority is a pure function of nice value: nice<0 -> HIGH,
nice==0 -> NORMAL, nice>0 -> LOW; SCHED_IDLE collapses to LOW;
SCHED_BATCH uses nice normally.  Tasks chain through the existing
&p->se.group_node (dead under !FAIR_GROUP_SCHED) so task_struct
stays unchanged.  The bucket index is recomputed from p->static_prio
on every callback; core.c's dequeue-modify-enqueue protocol (verified
across all four static_prio mutation sites: sched_fork at 4650/4653,
syscalls.c set_user_nice at 84 for RT/DL and at 89 for fair via
scoped_guard(sched_change, ...)) keeps the value stable across the
removal/insertion bracket.  RT preemption is unchanged: rt_sched_class
still preempts fair via the existing class chain walk in
pick_next_task_balance.

Not the historical 2.6 O(1) scheduler -- no active/expired arrays,
no interactivity estimator (the gameable heuristic that motivated
CFS), no priority recalculation.  Just the priority bitmap + FIFO
data structure that O(1) got right, without the policy machinery that
O(1) got wrong.

The #else branch re-exports every symbol other TUs depend on:
update_curr_common (rt.c / deadline-class stub / stop_task.c / ext.c
runtime accounting), init_cfs_rq, fair_server_init,
init_sched_fair_class, sched_init_granularity, update_max_interval,
init_entity_runnable_average, post_init_entity_util_avg,
sched_balance_trigger, nohz_balance_{enter,exit}_idle,
nohz_run_idle_balance, update_group_capacity, __setparam_fair,
arch_asym_cpu_priority, plus sysctl_sched_base_slice and
sysctl_sched_migration_cost storage.  switched_to_fair and
prio_changed_fair filter on rq->donor->sched_class != fair_sched_class
to avoid spurious resched_curr when the runner is RT (mirrors mainline
behavior; the kernel dispatches both hooks unconditionally).

pelt.c is left untouched.  With fair.c gated, its CFS-side entry
points lose their callers; rt.c keeps update_rt_rq_load_avg live.
The remaining PELT symbols are non-static so LTO largely cannot strip
them, but the cost is small (1.7KB).

Build wiring: build.sh adds the 0014 patch to the apply glob, sets
CONFIG_SCHED_FAIR_TINY=y in the inline kernel .config block, and
extends the post-olddefconfig verifier with a positive presence check.

Result: linux.axf 1,204,768 -> 1,188,352 bytes (-16,416 / -1.36%);
vmlinux .text 729,380 -> 713,412 (-15,968), .rodata -96, .init.text
-116, .bss -36, .data -32; kernel/sched/fair.c collapses 16,782 / 97
syms -> ~1,160 bytes / 22 syms; pick_task_fair compiles to 24 bytes
of pure O(1) machine code (find_first_bit + list head deref +
container_of, no loops or rb-tree walks).

QEMU MPS2-AN386 boots clean to the BusyBox shell across three
back-to-back validate-qemu.sh runs against the full PGO workload
(17 fork/exec/wait sequences exercising hush spawn, cp /bin/busybox,
mv, ln, mkdir, rm, test pipelines).
@jserv jserv merged commit c9676f8 into main May 1, 2026
2 checks passed
@jserv jserv deleted the sched branch May 1, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant