Conveners
sched_ext: The BPF extensible scheduler class MC
- Changwoo Min (Igalia)
- Daniel Hodges (Meta)
- Andrea Righi (NVIDIA)
- Joel Fernandes (NVIDIA)
Description
sched_ext[1] is a Linux kernel feature which enables implementing safe task schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production environments.
This MC is the space for the community to discuss the developments of sched_ext, its impact on the community, and to outline future strategies aimed at improving the integration with the other Linux kernel subsystems.
Last year the sched_ext MC proved highly productive in facilitating coordination with distribution maintainers, allowing us to clarify their requirements and ease potential maintenance burdens. This collaboration directly contributed to upstream changes, including patches such as [2].
Ideas of topics to be discussed include (but are not limited to):
- Use of BPF arenas for task/CPU context sharing between kernel, BPF, and user space
- Composable schedulers/scheduler libraries with BPF arenas
- Deadline server(s) for the SCHED_EXT class
- Integration with other scheduling-related features (RCUs, proxy execution, PREEMPT_RT, etc.)
- Potential integration with other Linux subsystems (e.g., Rust-for-Linux)
- Scheduling for gaming and latency-sensitive workloads
- User-space scheduling
- Tickless scheduling
- Tools and benchmarks to analyze and understand scheduler activities
While we already have a tentative schedule with existing talk proposals to cover the topics mentioned above, we are also planning to open a public CFP to accept additional topics to discuss. Time permitting, we are open to readjust the schedule to accommodate further discussions that are relevant to the Linux community.
[1] https://github.com/sched-ext/scx
[2] https://lore.kernel.org/all/20240921193921.75594-1-andrea.righi@linux.dev/
In this talk, we will explore the challenges and opportunities in improving the interoperability of sched_ext BPF schedulers with various Linux and in particular existing scheduler code as well as other subsystems. While sched_ext BPF schedulers offer powerful and flexible scheduling capabilities, their integration with other kernel components can often be fragmented and complex. This talk...
This talk will present our progress on arena-based data structures for quickly evolving scheduler abstractions (DSQs, CPU topology).
We currently write scheduling algorithms in terms of operations on primitives provided by the kernel (BPF hash maps/arrays, CPU bitmasks, DSQs). Adding new operations to these primitives is work-intensive because it requires modifying the underlying kernel...
Thread placement on machines with complex cache hierarchies (such as AMD CPU Core Complexes (CCXโes)) requires careful management for optimal performance. Unlike NUMA domains, which are large enough that hard partitioning is a viable strategy, these chiplet domains are too small to schedule efficiently without a means of enforcing some degree of soft affinity. Spillover of threads to...
We present one of the first deployments of sched_ext to a large fleet of AI training hardware composed of multi CPU socket systems with attached Nvidia GPUs. GPU training workflows run frequent synchronization across all the training processes which makes them extremely sensitive to task scheduling micro-delays that prevent work from being dispatched to the GPUs. In addition, the training...
With the proliferations of many sched_ext schedulers, including ones that caters for very specific workloads within Meta. There exists a need for a "default" fleet scheduler that "just works" for a wide range of hardware and use cases. SCX_LAVD is one such candidate as one of the more mature sched_ext schedulers out there with various heuristics to favor latency critical threads.
The talk...