Linux Plumbers Conference 2025

Name: Linux Plumbers Conference 2025
Start: 2025-12-11T09:00:00+09:00
End: 2025-12-13T22:00:00+09:00
Location: No location set

11–13 Dec 2025

Asia/Tokyo timezone

2025

contact@linuxplumbersconf.org

Accelerating AI training fleets with sched_ext

12 Dec 2025, 17:00

36m

"Hall B2" (Toranomon Hills Mori Tower)

"Hall B2"

Toranomon Hills Mori Tower

sched_ext: The BPF extensible scheduler class MC sched_ext: The BPF extensible scheduler class MC

Patrick Lu (Meta) Valentin Andrei (Meta) Pat Somaru (Meta)

We present one of the first deployments of sched_ext to a large fleet of AI training hardware composed of multi CPU socket systems with attached Nvidia GPUs. GPU training workflows run frequent synchronization across all the training processes which makes them extremely sensitive to task scheduling micro-delays that prevent work from being dispatched to the GPUs. In addition, the training systems boxes run several components of the stack like data loading, preprocessing and model checkpointing on the CPUs which increases scheduling congestion. We used sched_ext, a user-space scheduler (scx_layered) and we deployed it to the entire Reality Labs GPU fleet with tens of thousands of GPUs. We were able to improve the GPUs’ compute unit utilization on certain model types by 9% and reduce the fleet training cost. The presentation describes our journey in identifying the latency critical system tasks, developing the scheduler, ensuring resource isolation, debugging corner cases and monitoring the performance across the entire fleet.

Patrick Lu (Meta) Valentin Andrei (Meta) Pat Somaru (Meta)

[LPC] Accelerating_AI_Training_With_Sched_Ext.pdf

Linux Plumbers Conference 2025

2025

Accelerating AI training fleets with sched_ext

"Hall B2"

Toranomon Hills Mori Tower

Speakers

Description

Primary authors

Presentation materials

Choose timezone

Linux Plumbers Conference 2025

2025

Speakers

Description

Primary authors

Presentation materials