Conveners
sched_ext: The BPF extensible scheduler class MC
- Andrea Righi (NVIDIA)
- Daniel Hodges (Meta)
- Changwoo Min (Igalia)
- Joel Fernandes (NVIDIA)
Description
sched_ext[1] is a Linux kernel feature which enables implementing safe task schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production environments.
This MC is the space for the community to discuss the developments of sched_ext, its impact on the community, and to outline future strategies aimed at improving the integration with the other Linux kernel subsystems.
Last year the sched_ext MC proved highly productive in facilitating coordination with distribution maintainers, allowing us to clarify their requirements and ease potential maintenance burdens. This collaboration directly contributed to upstream changes, including patches such as [2].
Ideas of topics to be discussed include (but are not limited to):
- Use of BPF arenas for task/CPU context sharing between kernel, BPF, and user space
- Composable schedulers/scheduler libraries with BPF arenas
- Deadline server(s) for the SCHED_EXT class
- Integration with other scheduling-related features (RCUs, proxy execution, PREEMPT_RT, etc.)
- Potential integration with other Linux subsystems (e.g., Rust-for-Linux)
- Scheduling for gaming and latency-sensitive workloads
- User-space scheduling
- Tickless scheduling
- Tools and benchmarks to analyze and understand scheduler activities
While we already have a tentative schedule with existing talk proposals to cover the topics mentioned above, we are also planning to open a public CFP to accept additional topics to discuss. Time permitting, we are open to readjust the schedule to accommodate further discussions that are relevant to the Linux community.
[1] https://github.com/sched-ext/scx
[2] https://lore.kernel.org/all/20240921193921.75594-1-andrea.righi@linux.dev/
-
Andrea Righi (NVIDIA)12/12/2025, 15:00
This talk will kick off the sched_ext MC session with a brief overview of the project's current state: what features are available, what's missing and what remains under development.
We'll also look ahead to discuss gaps in the framework, ideas yet to be explored, and how we envision the sched_ext community growing.
The goal is to align contributors and spark discussions around...
Go to contribution page -
66. From Fragmentation to Integration: Enhancing sched_ext BPF Scheduler Interoperability with LinuxDaniel Hodges (Meta)12/12/2025, 15:18
In this talk, we will explore the challenges and opportunities in improving the interoperability of sched_ext BPF schedulers with various Linux and in particular existing scheduler code as well as other subsystems. While sched_ext BPF schedulers offer powerful and flexible scheduling capabilities, their integration with other kernel components can often be fragmented and complex. This talk...
Go to contribution page -
John Stultz (Google)12/12/2025, 15:36
Proxy Execution provides a generalized form of priority inheritance, which leaves mutex-blocked tasks on the run-queue. Then if the scheduler tries to run a mutex-blocked task, it will instead run the mutex owner on the blocked task's behalf, so the mutex can be released.
In order for this to work, we introduced the idea of split contexts (scheduler and execution), tracking both the task...
Go to contribution page -
Jake Hillion (Meta)12/12/2025, 15:54
sched_ext has guardrails in kernel and lots of examples in BPF for how to schedule tasks effectively. We use sensible defaults for idle tracking, NUMA aware masks, and prevent you losing track of tasks in BPF. But what happens when you try to schedule badly?
scx_chaos builds on top of scx_p2dq, another sched_ext scheduler. It adds options for introducing delays, randomly decreasing CPU...
Go to contribution page -
Emil Tsalapatis (Meta Platforms)12/12/2025, 16:12
This talk will present our progress on arena-based data structures for quickly evolving scheduler abstractions (DSQs, CPU topology).
We currently write scheduling algorithms in terms of operations on primitives provided by the kernel (BPF hash maps/arrays, CPU bitmasks, DSQs). Adding new operations to these primitives is work-intensive because it requires modifying the underlying kernel...
Go to contribution page -
Patrick Lu (Meta), Valentin Andrei (Meta), Pat Somaru (Meta)12/12/2025, 17:00
We present one of the first deployments of sched_ext to a large fleet of AI training hardware composed of multi CPU socket systems with attached Nvidia GPUs. GPU training workflows run frequent synchronization across all the training processes which makes them extremely sensitive to task scheduling micro-delays that prevent work from being dispatched to the GPUs. In addition, the training...
Go to contribution page -
Aniket Gattani (Google), Josh Don (Google)12/12/2025, 17:36
Thread placement on machines with complex cache hierarchies (such as AMD CPU Core Complexes (CCXโes)) requires careful management for optimal performance. Unlike NUMA domains, which are large enough that hard partitioning is a viable strategy, these chiplet domains are too small to schedule efficiently without a means of enforcing some degree of soft affinity. Spillover of threads to...
Go to contribution page -
Dr Changwoo Min (Igalia)12/12/2025, 17:54
The LAVD scheduler is a sched_ext scheduler designed to optimize latency and energy efficiency, with an initial focus on gaming workloads. This talk will present the current state of LAVD development and explore its future roadmap. In particular, we will discuss how LAVD leverages heterogeneous CPU architectures (Intel P/E cores, ARM big.LITTLE) to improve performance per watt, along with...
Go to contribution page -
David Dai (Meta), Ryan Newton (Meta)12/12/2025, 18:12
With the proliferations of many sched_ext schedulers, including ones that caters for very specific workloads within Meta. There exists a need for a "default" fleet scheduler that "just works" for a wide range of hardware and use cases. SCX_LAVD is one such candidate as one of the more mature sched_ext schedulers out there with various heuristics to favor latency critical threads.
The talk...
Go to contribution page -
Neeraj Kumar (Meta)
Applications can greatly benefit from a workload-aware scheduling policy of worker threads that optimizes cache usage. For example, if a scheduling policy is aware of a workloadโs data access patterns, it can make informed decisions on how to schedule threads to cores to take advantage of cache locality. However, a key technical challenge is achieving this in a workload-agnostic manner.
The...
Go to contribution page -
Neeraj Kumar (Meta)
Applications can greatly benefit from a workload-aware scheduling policy of worker threads that optimizes cache usage. For example, if a scheduling policy is aware of a workloadโs data access patterns, it can make informed decisions on how to schedule threads to cores to take advantage of cache locality. However, a key technical challenge is achieving this in a workload-agnostic manner.
The...
Go to contribution page -
pat somaru
Optimizing GPU bound workloads with sched_ext via scx_layered
In this talk, I will discuss how to optimize GPU bound workloads through the use of the sched_ext scheduler, scx_layered and how API changes could make make this simpler.
I will use a well understood open source GPU benchmark job (something like mnist or resnet) and a common cpu-bound open source workload (something like...
Go to contribution page -
pat somaru
Optimizing GPU bound workloads with sched_ext via scx_layered
In this talk, I will discuss how to optimize GPU bound workloads through the use of the sched_ext scheduler, scx_layered and how API changes could make make this simpler.
I will use a well understood open source GPU benchmark job (something like mnist or resnet) and a common cpu-bound open source workload (something like...
Go to contribution page