With the ongoing work on RV and the deadline scheduler, coupled with timed automata, we introduced a practical way to validate timing properties in the kernel.
Now we can have models guaranteeing that tasks are throttled when consuming their runtime and don't miss their deadline.
The few models for the deadline scheduler are barely scratching the surface of what could be done to validate...
CPU Isolation enables a system administrator to shield a subset of CPUs from
most kernel interference, but not all of it. Activity on the housekeeping CPUs
can still trigger IPIs targeting isolated CPUs, which defeats the requested
isolation.
At Red Hat, we've mostly observed IPIs caused by instruction patching
(e.g. static key updates) and TLB flushes (e.g. due to vmap'd stacks...
In para-virtualized environment, vCPU overcommit is a common configuration which helps customer to make better use of CPU resources since not all VMs would be active at the same time and hence underlying hypervisor will be able to meet the CPU demand and workloads running on VMs can benefit from the extra resource.
Acronyms:
vCPU - virtual CPU - CPU in VM
pCPU - physical CPU - CPU...
This talk will cover some of the boot time optimizations that we've found to be helpful on Android systems that should apply equally well to embedded systems. Most of these guidelines have been launched in a public product and have been shown to work well.
Zephyr RTOS offers a rich ecosystem, but embedded engineers often face the challenge of porting code across environments—from Zephyr to another RTOS or even to bare metal. This discussion is about a practical guide to extracting a “bare metal flavor” of code out of Zephyr so that it can run independently of Zephyr’s driver and subsystem layers.
NOTE: The HAL approach isn't the right...
In this session we will discuss how to improve system stability of boards using fusb302 (or similar) chips for their USB-C port without any backup power source. This kind of setup is often found on Rockchip boards (e.g. Libre Computer ROC-RK3399-PC, Radxa ROCK 5B or ArmSoM Sige 5) and quite a pain, because a hard-reset effectively kills the board power.
The session starts with a short...
The Common Clk Framework (CCF) is expected to keep a clock’s rate stable after setting a new rate with "clk_set_rate(clk, NEW_RATE)". However, several longstanding issues affect how rate changes propagate through the clock tree when CLK_SET_RATE_PARENT is involved.
Current behavior allows a child clock to change its parent’s rate to satisfy its own request, but this adjustment happens...
One of the main selling points for Rust's inclusion in the kernel is safety, which is strongly associated with a reduction of runtime panics. Yet, in Rust an integer overflow or out-of-bounds array access translates into an implicit panic, inserted without any warning to the programmer.
The inability to easily identify where these implicit panic sites are introduced creates a blind spot...
In regulated industries, Linux is widely used due to its strong software capabilities in areas such as dependability, reliability, and robustness. These industries follow best practices in terms of processes for requirements, design, verification, and change management. These processes are defined in standards that are typically not accessible to the open source kernel community.
However,...
The open-source community is hard at work on building the framework
and mechanisms allowing the assignment of devices to a trusted virtual
machine (TVM), a process commonly known as device assignment (DA).
For the TVM to trust a device, the device must provide the TVM with
Evidence claims [[RFC9334]][1] confirming its identity, the state of its firmware and
its configuration. Since...
Some C kernel data structures exposed to Rust code apply internal
synchronization (XArray). Depending on the type of lock, such data structures
need to unlock locks when allocating memory. Sometimes it is beneficial to use a
single external lock to protect multiple such data structures.
In Rust this creates a problem that is not present in C. This is because that
mutably borrowing...
This presentation is to revive last [year's discussion][1] on PCIe device attestation. The first thing to understand is if last year's consensus to use netlink sockets to convey device attestation information to user space still holds. The second thing to review is the device attestation workflow itself. Given the difference between the CMA and PCI/TSM scenarios, it may be better to build...
Last year in Vienna we held a session about "Improving kernel design documentation and involving experts".
Following such session the ELISA Architecture working group drafted an initial template for the SW Requirements definition, started documenting the expected behaviour for different functions in the TRACING subsystem and made upstream contribution accordingly and finally also started...
Summary
This talk is a follow-up of LPC'24, where the community had diverse opinions on the suitable approach of attested TLS protocols for confidential computing. Meanwhile, we have defended our position (cf. [expat BoF][1]) to standardize the protocol in the [IETF][2], and a new Working Group named [Secure Evidence and Attestation Layer (SEAL)][3] is being formed to exclusively tackle...
Showcase the current state of Tyr, a new Rust kernel driver for Arm Mali GPUs, briefly mentioning current status of the driver and the associated Rust abstractions needed to support it, as well as the future plans for both upstream and Android.
The discussion should be centered on whether the current upstreaming plan makes sense to the DRM community, considering our efforts both upstream...
High-integrity applications require rigorous testing to ensure both reliability and compliance with industry standards. Current testing frameworks for the Linux kernel, such as KUnit, face challenges in scalability and integration, particularly in environments with strict certification requirements.
KUnit tests, which are currently the most widely accepted and used solution for testing...
The ELISA project currently works on bringing the Linux kernel closer to safety compliance by proposing enhancements to the kernel documentation. This includes a model expressed as requirement templates inlined to source code. At the same time, comparable efforts with a similar goal are also ongoing in the wider open-source ecosystem. For example, the Zephyr OS is using the FLOSS StrictDoc...
TDISP, designed to allow a confidential VM to establish a trust relationship with a PCI device, creates new headaches for the Linux PCI stack and for virtualization components:
- Evaluating whether a device is trustworthy.
- Establishing trust with the device.
- And in particular, re-establishing trust across a VM migration to a different physical device, without workload...
The Secure VM Service Module (SVSM) for Confidential VMs can expose multiple services and virtual devices to the Linux guest. To manage these, we need a proper bus in the kernel for discovery and enumeration.
So, what is the right architectural choice for this bus? Should we write a new, minimalist bus from scratch? Or should we adapt the standardized VIRTIO framework for its broad...
Open Discussion based on previous agenda items
TL;DR We propose to present the Rex project (Rust-based kernel extension) and discuss its integration with Rust for Linux.
Rex is a Rust-based kernel extension framework (https://github.com/rex-rs/rex). It offers similar safety guarantees as eBPF. Different from eBPF, which verifies the safety of extension code via an in-kernel verifier, Rex builds its safety guarantees atop the...
Today the kernel's UAPIs are tested through userspace testcases using the kselftests framework, which provides a uniform build system and output formatting infrastructure. However it does currently not provide an out-of-the-box solution to run the tests against the current in-development kernel tree.
I am proposing a framework which allows to build the test applications as part of the...
In this talk, we will explore the challenges and opportunities in improving the interoperability of sched_ext BPF schedulers with various Linux and in particular existing scheduler code as well as other subsystems. While sched_ext BPF schedulers offer powerful and flexible scheduling capabilities, their integration with other kernel components can often be fragmented and complex. This talk...
Fuzzing the Linux kernel with coverage-guided tools like syzkaller has proven to be an extremely effective method for finding kernel bugs. However, complex subsystems like KVM present unique and significant challenges that standard syscall fuzzing cannot easily address. Fuzzing KVM effectively requires managing complex state across both the host and the guest, and necessitates the coordinated...
This talk will present our progress on arena-based data structures for quickly evolving scheduler abstractions (DSQs, CPU topology).
We currently write scheduling algorithms in terms of operations on primitives provided by the kernel (BPF hash maps/arrays, CPU bitmasks, DSQs). Adding new operations to these primitives is work-intensive because it requires modifying the underlying kernel...
Since last LPC in Vienna, we have continued to explore how to add support for multiple system-wide low power-states to the Linux kernel. A series [1] has been posted that suggests us to add a user-space interface, to allow a system-wakeup latency constraint to be specified. The series also includes deployment for how the latency-constraint can be taken into account during s2idle and especially...
Thread placement on machines with complex cache hierarchies (such as AMD CPU Core Complexes (CCX’es)) requires careful management for optimal performance. Unlike NUMA domains, which are large enough that hard partitioning is a viable strategy, these chiplet domains are too small to schedule efficiently without a means of enforcing some degree of soft affinity. Spillover of threads to...
We present one of the first deployments of sched_ext to a large fleet of AI training hardware composed of multi CPU socket systems with attached Nvidia GPUs. GPU training workflows run frequent synchronization across all the training processes which makes them extremely sensitive to task scheduling micro-delays that prevent work from being dispatched to the GPUs. In addition, the training...
The [kdevops][1] project automates complex Linux kernel development subsystem testing. Around Q3 we started evaluating advances in generative AI. The experimentation on kdevops shows project significantly enhances the speed and accuracy of generative AI for extending its features and adding new workflows. This capability was a core design principle. While generative [AI may not yet be optimal...
For DT based platforms fw_devlink allows us to track supplier/consumer dependencies, which helps to avoid having drivers returning -EPROBE_DEFER, while they probe their devices. Moreover, fw_devlink provides the so called ->sync_state() support, allowing a driver for a supplier device to receive a notification through its ->sync_state() callback, when all its consumer devices have been probed...
The LVFS Host Security ID (HSI) has become the de facto standard for measuring
platform security in Linux, with major distributions adopting it to present
security posture to end users. Designed primarily around proprietary UEFI
implementations, HSI may present edge cases for open-source firmware vendors
working with diverse firmware stacks like coreboot and edk2.
This session examines...
Android currently collects telemetry data from devices in the field. While these metrics are important and can indicate overall system health issues, they are often lacking enough low-level system information that is necessary for finding root causes.
Android has been striving to improve BPF support to enable developers to extend the Linux kernel by creating BPF programs. This development...
Running Android on RISC-V platforms has been a long-standing goal, filled with technical hurdles and real world economical evaluation.
Initially I proposed the idea at Andes Technology, it didn't pan out, leading me to pursue its realization at SiFive. However, due to org restructuring, I was eventually laid off from SiFive. (Hi Samuel 👋)
Ultimately, I returned to Andes to finish what I...
Abstract
We have defended our position (cf. [expat BoF][1]) to standardize the attested TLS protocol in the [IETF][2], and a new Working Group named [Secure Evidence and Attestation Layer (SEAL)][3] is being formed to exclusively tackle this specific problem. We would like to present the work (candidate [draft][4] for standardization) and gather feedback from the security community on the...
The transition to a 16kB base page size creates a significant compatibility issue for legacy ELFs built with 4kB segment alignment. This misalignment can place Read-Execute (RX) and Read-Write (RW) segments within a single page, which would require insecure RWX mappings. While recompiling is the ideal fix, it is often impossible for apps that depend on **unmaintained, closed-source third-party...
Android boot flow quick recap
- Current problems
- Fastboot
GBL proposal
- Android meets UEFI
- Existing protocols adoption
- GBL custom protocol for Android Boot
Android UEFI Upstreaming
- EFI implementation for LittleKernel
- GBL protocols (EDK2, LittleKernel, Uboot)
Android Adoption of DRTM - How could the ARM DRTM spec be updated to account for Android boot'isms in a HLOS...
Content:
Android's transition to 16kb page sizes necessitates that hardware components work seamlessly with 16kb page sizes in order to get optimal performance. This presentation will focus on hardware and software recommendations for devices running with 16kb page sizes.
This section will highlight the hardware design decisions that need to be made to support 16kb page sizes...
The great benefit of Devicetree bindings in the current DT schema format is the ability to validate the correctness of DTS (Devicetree sources) against those bindings. However, once validation was introduced, we discovered that many in-kernel DTS files simply did not pass.
A few years and thousands of commits later, we can now ask:
1. What is the current status of in-kernel DTS...
The high-level idea behind the linux kernel GPIO consumer API is that lines are an exclusive resource - only one logical consumer can request and control a GPIO pin. This results naturally from the type of operations that a low-level user can perform on GPIO lines - after all: one user setting the line's direction to output while another sets it to input is an example of a very clear conflict...
On x86 / ACPI platforms, devices on enumerable busses can normally be seen directly by the OS. On DT platforms, these devices sometimes require extra power sequencing like toggling regulator supplies or GPIO lines. Over the years most of these cases have been solved, but there are still some gaps.
This talk will go through what already works, and attempts to identify:
- What's missing...
A large percentage of feedback when reviewing DTS changes ends up being style-related. This ends up hurting both the reviewer and submitter, as the former wants to maintain codebase coherence, while the latter wants their changes to get merged.
This session will be a discussion on where we currently stand, what functionality should be targeted and what are the obvious roadblocks.
On many device tree based devices, the device tree blobs are commonly shipped with the kernel or OS image, not the firmware. If the image is meant to be generic, it would include multiple DTBs and possibly many DTBO combinations. The bootloader selects a DTB and optionally applies overlays matching the hardware. Known image "standards" include:
- FIT image: maps a compatible string to a...
CFP ends on September 30th (CLOSED)
The Android Micro Conference brings the upstream community and Android systems developers together to discuss issues and changes to the Android platform and their dependencies and interactions with the Linux kernel, allowing for collaboration on solutions for upstream.
Some highlights of progress made since last year’s MC:
- Community...
Content:
Android already supports 16kb page sizes and the number of devices supporting 16kb page sizes will increase in the future. A key challenge with 16kb page sizes is their potential to increase the memory footprint. In this presentation, we will explore several memory optimization strategies that partners should consider to help mitigate this issue, focusing on areas such...
CFP ends on October 3rd (CLOSED)
The Linux ecosystem supports a diverse set of methods for assembling complete, bootable systems, ranging from binary distributions to source-based systems, embedded platforms, and container-native environments. Despite differences in tooling and architecture, all of these systems face shared challenges: managing build complexity, ensuring security and...
Problem Statement
The security and stability of the Linux kernel are paramount to the entire open-source ecosystem. A critical component in achieving this is the availability of debug kernels—builds specifically enabled with intensive debugging features like KASAN, UBSAN, and other sanitizers. While enterprise distributions such as RHEL, Fedora, and SUSE rely heavily on official...
When building a custom Linux OS, a pivotal decision involves selecting an appropriate build system from the available options within the ecosystem. The suitability of a particular build system may vary based on product requirements, constraints and development preferences, with kernel development and customization capabilities representing a key aspect for this decision.
This presentation...
When we designed the OpenWrt One, the OpenWrt build system allowed us to easily create a self-contained source tarball that included everything needed for GPL and other compliance purposes. Because of its history in supporting embedded OS deployment on a wide range of heterogeneous devices, the OpenWrt build system has a variety of features that lend themselves to this swift assemblage of the...
CFP ends on October 10th (CLOSED)
The Containers and Checkpoint/Restore micro-conference focuses on both userspace and kernel related work.
The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.
The microconference will be discussing recent advancements in container technologies with...
CFP ends on September 30th (CLOSED)
The Device and Specific Purpose Memory Microconference is proposed as a space to discuss topics that cross MM, Virtualization, and Memory device-driver boundaries. Beyond CXL this includes software methods for device-coherent memory via ZONE_DEVICE, physical memory pooling / sharing, and specific purpose memory application ABIs like device-dax,...
CFP ends on September 14th (CLOSED)
The Devicetree Microconference focuses on discussing and solving problems present in the systems using Devicetree as firmware representation. This notably is Linux kernel and U-Boot, but also can cover topics relevant to Zephyr or System Devicetrees. Systems using Devicetree are majority of embedded boards, mobile devices and ARM64 laptops (and many...
CFP ends on October 8th (CLOSED)
The Gaming on Linux Microconference welcomes the community to discuss a broad range of topics around performance improvements for Gaming devices running Linux. Gaming on Linux has pushed the kernel to improve in several areas and has helped create new features for Linux, such as the futex_waitv() syscall, the Unicode subsystem, HDR support, and much more....
When device drivers reserve big blocks of MIGRATE_CMA pages, the underutilized MIGRATE_CMA can be used for MIGRATE_MOVABLE requests and these pages can be short-term pin for DMA, so if we require MIGRATE_CMA pages, the allocations might fail.
This topic has been discussed...
ABSTRACT
Enabling cgroup-level control over swap devices
PROPOSAL
In certain restricted environments, there is a technical requirement to use otherwise idle devices as extended swap memory - including remote storage systems accessible over the network. A motivating scenario is to configure background processes to use these slower network-backed swap devices, while foreground...
The "zombie memory cgroup" problem is a long-standing issue in the Linux Kernel. It occurs when a memory cgroup is destroyed by users, but kernel metadata cannot be freed because its Least Recently Used (LRU) pages, particularly shared file pages, remain charged to it. These pages can outlive the cgroup that originally owned them, acting as a permanent pin. In environments where cgroups are...
CFP ends on October 10th (CLOSED)
The Linux System Monitoring and Observability Track brings together developers, maintainers, system engineers, and researchers focused on understanding, monitoring, and maintaining the health of Linux systems at scale. This track addresses the needs of engineers managing millions of Linux servers, where proactive monitoring, rapid problem detection, and...
CFP ends on September 10th (CLOSED)
Live Update is a specialized reboot process where selected devices are kept operational and kernel state is preserved and recreated across a kexec. For devices, DMA and interrupts may continue during the reboot.
The primary use-case of Live Update is to enable hypervisor updates in cloud environments with minimal disruption to running virtual...
Hello, I'd like to propose discussion about KSTATE
as a solution for [de]serializing kernel's state.
Thanks,
Andrey
At scale, virtualization uncovers hidden bottlenecks, including the cost of PCI configuration space accesses. In SR-IOV deployments with thousands of VFs, each configuration read triggers a hardware transaction. As VFs increase, these accesses scale linearly, leading to longer VM boot times, heavier bus contention, and noticeable startup delays.
The PCI subsystem today treats every...
Restarting a node running a stateful workload for an infrastructure software upgrade can be an extremely costly operation. Modern infrastructure software upgrades must also account for applications which are using accelerators such as GPUs, RDMA NICs and NIC stateful flow accelerators. While these workloads may typically run in isolated VMs, a hypervisor reboot for a kernel update can lead to...
In this talk, we'll discuss our work to support Live Update in VFIO for PCI devices.
Live Update is a mechanism to quickly update the kernel while running virtual machines using kexec. VFIO is a kernel module to allow devices to be controlled by userspace and virtual machines.
During our talk we will cover the problems that need to be solved to support Live Updates in VFIO, and our...
CFP ends on September 29th (CLOSED)
We’d like to propose bringing back the RISC-V Microconference at Linux Plumbers 2025. As the RISC-V ecosystem continues to grow, so does the importance of having a space where developers, hardware vendors, toolchain maintainers, and distro folks can come together to solve real-world problems. This microconference has always been a great venue for open,...
It’s the elephant in the room: in the Linux kernel, RV64 has become significantly more popular and better supported than its smaller sibling — the RISC-V 32-bit platform.
There have been multiple open discussions about dropping RISC-V 32 support to "liberate" kernel development.
However - - and it’s a big however - - many people actively use RISC-V 32 Linux in production, and some of...
CFP ends on September 30th (CLOSED)
Rust is a systems programming language that is making great strides in becoming the next big one in the domain. Rust for Linux is the project adding support for the Rust language to the Linux kernel.
Rust has a key property that makes it very interesting as the second language in the kernel: it guarantees no undefined...
CFP ends on October 5th (CLOSED)
As Linux continues to be deployed in systems with varying criticality constraints, progress needs to be made in establishing consistent linkage between code, tests, and requirements, to improve overall efficiency and ability to support necessary analysis.
This MC addresses critical challenges in expectation management (aka requirements tracking),...
CFP ends on September 30th (CLOSED)
sched_ext[1] is a Linux kernel feature which enables implementing safe task schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production...
CFP ends on October 10th (CLOSED)
The goal of the Toolchains micro-conference is to hold discussions about toolchain related topics that are relevant to the Linux kernel. This covers both the GNU toolchain and the Clang/LLVM toolchain.
In the last years we have had either a micro-conference or a complete track to discuss about Toolchain topics during LPC, and along with LSFMMBPF they...
In 2014 I added "WIP: Kernel syscalls wrappers" [1] item to the upstream glibc consensus documentation.
Over the last 11 years the idea that we should add C library wrappers for all Linux syscalls has waxed and waned, but I would like to revisit the idea with the help of the kernel community.
I want to look at the...
The goal of this activity is to go through a list of specific problems and issues concerning the BPF support in GCC.
TBD
CFP ends on September 30th (CLOSED)
The [PCI][1] interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:
- Address Translation Service (ATS)/Page Request Interface (PRI)
- [Single-root I/O...
Cloud workloads with strict performance needs (AI, HPC, large-scale data processing) frequently use PCIe device passthrough (e.g., via VFIO in Linux/KVM) to reduce latency and improve bandwidth. While effective for performance, this approach also exposes low-level device configuration interfaces directly to guest workloads, which may be malicious or running untrusted software.
In our...
Over multiple mainline iterations, a new CPUID API is slowly getting in shape for both drivers and internal x86 architecture code.
This talk will show that new API, plus its benefits for call-sites.
The API's interaction with the existing X86_FEATURE mechanisms will be covered at the second half of the talk.