- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.
The Linux ecosystem supports a diverse set of methods for assembling complete, bootable systems, ranging from binary distributions to source-based systems, embedded platforms, and container-native environments. Despite differences in tooling and architecture, all of these systems face shared challenges: managing build complexity, ensuring security and reproducibility, maintaining cross-platform compatibility, and responding to increasing regulatory and supply chain scrutiny.
Building on the success of last year’s microconference, we invite the community to continue the conversation with a broadened scope in 2025. This year, we aim to explore the intersection of build systems with CI/CD pipelines, supply chain security, critical infrastructure, and secure development practices. With legislation such as the Cyber Resilience Act, rising expectations for Software Bill of Materials (SBOMs), and mandates for reproducible and auditable builds, collaboration across the ecosystem has never been more essential.
This microconference provides a venue for architects, maintainers, and practitioners from all facets of the Linux build and distribution ecosystem to come together and share ideas, discuss pain points, and identify potential shared solutions.
Target communities and projects include (but are not limited to):
Proposed discussion topics:
We welcome proposals beyond this list, particularly those that address emerging issues in the creation, validation, maintenance, and secure delivery of Linux-based software systems.
Improving coordination across build systems strengthens the foundations of the open-source ecosystem. Whether you’re maintaining a distro, building firmware, managing containers, or designing infrastructure for high-assurance or real-time systems, this microconference is your forum to advance the state of Linux software construction and security.
The Device and Specific Purpose Memory Microconference is proposed as a space to discuss topics that cross MM, Virtualization, and Memory device-driver boundaries. Beyond CXL this includes software methods for device-coherent memory via ZONE_DEVICE, physical memory pooling / sharing, and specific purpose memory application ABIs like device-dax, hugetlbfs, and guest_memfd. Some suggested topic areas include, but not limited to:
NUMA vs Specific Purpose Memory challenges
Core-MM services vs page allocator isolation
CXL use case challenges
Hotness Tracking and Migration Offloads
ZONE_DEVICE future for Accelerator Memory
ZONE_DEVICE future for CXL Memory Expansion
PMEM, NVDIMM, and DAX "legacy" challenges
Memory hotplug vs Device Memory
Memory RAS and repair gaps and challenges
Dynamic Capacity Device ABI (sparse memfd?)
Confidential Memory challenges
DMABUF beyond DRM use cases
virtiomem and virtiofs vs DAX and CXL challenges
Peer-to-peer DMA challenges
CXL Memory Pool Management
Device Memory testing
Why not the MM uConf for these topics? One of the observations from MM track at LSF/MM/BPF is that there is consistently an overflow of Device Memory topics that are of key interest to Memory device-driver developers, but lower priority to core MM developers.
Rajneesh Bhardwaj
Terry Bowman
Davidlohr Bueso
John Groves
Jason Gunthorpe
David Hildenbrand
John Hubbard
Alistair Popple
Gregory Price
Contingent or unknown travel availability:
Jonathan Cameron
Dave Jiang
David Rientjes
Ira Weiny
Merged: CXL EDAC support for Memory Repair: http://lore.kernel.org/20250521124749.817-1-shiju.jose@huawei.com
Launched: CXL Management Library: https://github.com/computexpresslink/libcxlmi
Patches Available: FAMFS over FUSE: http://lore.kernel.org/20250703185032.46568-1-john@groves.net
Patches Available: Dynamic Capacity: http://lore.kernel.org/20250413-dcd-type2-upstream-v9-0-1d4911a0b365@intel.com
Patches Available: Type-2 CXL Accelerators: http://lore.kernel.org/20250624141355.269056-1-alejandro.lucero-palau@amd.com
"Device Memory" is a catch-all term for the collection of platform
technologies that add memory to a system outside of the typical "System RAM" default pool. Compute Express Link (CXL), a coherent interconnect that allows memory and caching-agent expansion over PCIe electricals, is one such technology. GPU/AI accelerators with hardware coherent memory, or software coherent memory (ZONE_DEVICE::DEVICE_PRIVATE), are another example technology.
In the Memory Management track of the 2025 LSF/MM/BPF Summit it became clear that CXL is one of a class of technologies putting pressure on traditional NUMA memory policy. While solutions like
memory-interleave-sysfs and device-dax mitigate some of the issues there are still lingering concerns about memory of a certain performance class leaking into allocations that assume "default memory pool" performance.
The problem is how to keep Device / Specific Purpose memory contained to its specific consumers while also offering typical core-mm services. Solutions to that problem potentially intersect mechanisms like numactl, hugetlbfs, memfd, and guest_memfd. For example, guest_memfd is a kind of specific-purpose memory allocator.
The last year has seen a massive amount of work in
our documentation build infrastructure and continuous
improvement in the docs themselves. This session will provide a brief update of what has happened recently, followed by discussion on what we want to do next. How would we like our documentation to evolve in the next year or three?
Despite all controversies, Rust in recent times has gained popularity as the second Linux kernel high-level language. The author once decided to go with the flow. What had started as an attempt to overcome zsmalloc inefficiency for large (16K+) pages became a broader initiative to rewrite parts of the swapping infrastructure in Rust, gaining better safety and reducing the code footprint in one of the most crucial Linux kernel subsystems. This talk will briefly cover this historical background and then the focus for the discussion will be which parts of the swapping subsystem and related drivers (e. g. zram) are better off reimplemented in Rust, and why.
Following up on an ELC presentation about the future of 32-bit Linux, this is a discussion session specifically about highmem, which is a feature needed for a small number of 32-bit machines that regularly gets kernel developers upset.
The goal of this session is to find consensus on whether we can resonably drop support for highmem in the next few years, what the best time for that would be, and what work has to get completed first in order to minimize the impact to current users.
https://lore.kernel.org/ksummit/4ff89b72-03ff-4447-9d21-dd6a5fe1550f@app.fastmail.com/t/#u
PREEMPT_RT was merged but the tree is still around. To further improve PREEMPT_RT in certain scenarios, the local bottom halves were introduced allowing to remove locking elsewhere. Another improvement went into to the futex code to allow process private hash. This change allowed another change to improve the hash handling across NUMA nodes.
This talk presents the changes, why they were needed and why the patch queue/ devel-tree is still around.
The Linux Foundation Technical Advisory Board (TAB) represents Linux Kernel project interests to the Linux Foundation. It also uses the pooled influence of its elected members to support the long term health of the project.
This open forum / panel discussion is an opportunity to learn about and discuss TAB initiatives and ongoing project needs.
A "classic" Use-After-Free (UAF) can occur when resources tied to hot-pluggable devices are accessed after the device has been removed. For example, an open file descriptor may hold references to such resources; if the device is unplugged, subsequent file operations on that descriptor can trigger an UAF. This talk, a follow-up to a previous presentation[1], explores an approach to this challenge.
We will present "revocable"[2], a new kernel mechanism for resource management. A revocable allows a resource provider (e.g., a device driver) to invalidate access to a resource from a consumer (e.g., a character device) when the underlying device is no longer available. Once a resource is revoked, any further attempts to use it will fail gracefully, thus preventing the UAF.
We will discuss the design and implementation of the revocable mechanism and its application in the ChromeOS Embedded Controller drivers to fix a real-world UAF bug. We hope to also start a discussion on how this generic mechanism could be adopted by other drivers to handle similar resource lifecycle issues.
[1] https://lpc.events/event/17/contributions/1627/
[2] https://lore.kernel.org/chrome-platform/20250820081645.847919-1-tzungbi@kernel.org/T/#u
This is Coly Li. I’ve been maintaining bcache for a while and have met Linus,
Greg, Ted, and other maintainers in person at many conferences. Yes, I am a
sustained and reliable kernel developer.
Recently, I joined a startup (https://fnnas.com) that provides AI/LLM
capabilities for personal or micro-enterprise storage. We help users share and
communicate AI/LLM-processed information from their stored data more
conveniently.
Our users can run highly compact LLMs on their own normal and inexpensive
hardware to process photos, videos, and documents using AI. Of course, it’s slow
but that’s expected and acceptable. They can even come back to check the results
weeks later.
In our use case, different people or roles store their personal and sensitive
data in the same storage pool, with different access controls granted to AI/LLM
processing tasks. When they share specific information or data with others
within the same machine or over the internet, the access control hierarchy or
rules become highly complicated and impossible to handle with POSIX ACLs.
We tried bypassing access control to user space, which worked well except for
scalability and performance:
- As the number and size of files increase, storing all access control rules in
user space memory doesn’t scale—especially on normal machines without huge
memory resources.
- For some hot data sets (a group of files and directories), checking access
control rules in user space and hooking back to the kernel is highly
inefficient.
Therefore, the RichACL project comes back to mind. Of course, RichACL alone
isn’t enough. A high-level policy agent (in user space) is still needed for
task/session-oriented access and sharing policy control, but RichACL can help
implement file system-level access control. This would give us a context-aware
and highly efficient access control implementation.
What I’d like to discuss is:
- After almost 10 years, should we reconsider RichACL in the AI/LLM era?
- What are the major barriers or remaining work needed to get RichACLs into
upstream?
Since our first public beta was released 13 months ago, we now have over one-
million active installations running daily. This is a real workload for RichACL
and represents real feature demand from end users. If you’re interested in this
topic, we’d be happy to provide more details about the access control
requirements in AI workloads and even show a live demo of the use case.
Traditionally the drm subsystem has managed memory for GPU devices it it's own space. Device having large amounts of VRAM and mostly being client focused, meant there wasn't a large amount of attention paid to integrating with cgroup memory management.
However this is changing with more GPUs in servers/HPC scenarios and more shared memory devices without device RAM becoming available.
Providing an integration between the two worlds faces some challenges and this talk will discuss those challenges and hopefully open up discussion on how best to solve them in a backwards compatible and forward looking way.
Modern Linux faces fundamental scaling challenges with shared resource contention, noisy neighbor effects, and monolithic kernel constraints. VMs provide isolation but impose significant hypervisor overhead, while containers share kernel vulnerabilities and lack performance isolation.
We propose the multikernel architecture enabling multiple isolated Linux kernel instances on a single machine. A privileged host kernel dynamically spawns independent kernel instances, each with dedicated CPU cores, memory regions, and I/O hardware resources. Our implementation extends existing kernel mechanisms: dynamic kernel spawning using enhanced kexec for on-demand instantiation without system reboot, hardware resource partitioning through fine-grained CPU/memory/device isolation, inter-kernel communication via IPI and shared memory regions, and live resource migration enabling runtime resource reassignment for zero-downtime upgrades.
Each spawned kernel runs standard Linux userspace, delivering strong isolation at near-native performance while maintaining the complete Linux compatibility. More importantly, comparing with KVM/virtio stack, multikernel is significantly simpler thus has great potential.
Multikernel architecture enables seamless application deployment with complete workload isolation, specialized kernel optimization, and enhanced security boundaries suitable for cloud multi-tenancy and safety-critical systems.
Test coverage is a measurement of how much code is executed by a given test or test suite. Current implementations in the kernel are measured against source code with tools such as "gcov" or "llvm-cov". However, source-based coverage measurements are unable to account for additional code not present in the original source, such as code inserted by the build system (compiler, linker, build scripts, etc.). To supplement source-based coverage, The Boeing Company and the University of Illinois Urbana-Champaign (UIUC) have investigated, created a tool, and measured coverage of the object code of the Linux kernel.
Object code coverage is a requirement in several safety critical industries, such as Automotive and Aerospace software. In addition to certification, object code coverage is effective at identifying machine code which ends up in an executable but cannot be traced back to any source code. As an example, build toolchain exploits which maliciously insert object code (such as the XZ Utils backdoor) can be identified by object code coverage.
This presentation will cover:
- An overview of object code coverage and why it is useful
- The need for object code coverage in safety critical applications (e.g. Automotive, Aerospace, Medical)
- The development of an open-source object code coverage tool which works on the Linux kernel
- Approaches used to measure object code coverage on emulated targets (QEMU) and real hardware
- Measurement differences between x86-64 and ARM64 kernels
- Challenges collecting coverage of the Linux kernel and getting accurate results
- Object code coverage results of the kernel when run with existing test suites
Ken Thompson's lecture "Reflections on Trusting Trust" summarized the issue stating, "No amount of source-level verification or scrutiny will protect you from using untrusted code." Many issues can only be identified by inspecting object code, and object code coverage is one metric to assist object code inspection.
Rust for Linux is the project adding support for the Rust language to the Linux kernel. This talk will give a high-level overview of the status and the latest news around Rust in the kernel since LPC 2024.
About a year ago, I presented the initial efforts to upstream the fundamental infrastructure needed to enable complex Rust drivers in the Linux kernel -- work that started with the Nova GPU driver. At the same time, much of the discussion around Rust in the kernel centered on concerns about (long-term) maintenance and the potential burden of supporting a second language.
Today, that foundational infrastructure is upstream. This includes core pieces of driver support such as the generic device/driver model, PCI, platform and auxiliary bus support, device resource management, memory-mapped I/O, memory allocation primitives, and DMA. In this talk, I will focus on two aspects:
First, we will look at how this infrastructure - especially driver-core infrastructure - has turned out in practice. Using a real upstream driver as a reference, I will highlight the benefits Rust’s type system brings, walk through some interesting implementation details, and also discuss shortcomings that still require cooperation with the C side to address.
Second, we will revisit the maintenance concerns mentioned previously. Now that the corresponding infrastructure is upstream and active driver development in Rust is happening in-tree, we can make an initial assessment of how those concerns have played out in practice. As a maintainer of both Rust and C kernel infrastructure, I will share my experience from both directions: collaborating with C maintainers while maintaining Rust code, and working with Rust contributors in the context of C infrastructure that I maintain.
Runtime Verification (RV) was introduced in v6.0 of the Linux Kernel and
regained some traction recently with the integration upstream of the scheduler
and rtapp models.
RV was successfully employed to model and validate a subsystem like the
scheduler. With the ongoing work on timed automata, RV monitors get the ability
to verify timing requirements of a subsystem, let's use it to validate the
deadline scheduler.
Join us in this interactive session where we are going to dig into the
experience of supporting a new subsystem in RV from the peculiarities that only
a subsystem maintainer can unveil to new capabilities and paradigms from an RV
maintainer.
This work can highlight inconsistencies in the subsystem, improve its general
understanding and documentation as well as make the RV infrastructure more
robust.
Over the past decade (or three) of container runtimes on Linux, the attacks against container runtimes with the most bang-for-your-buck have generally been filesystem related—often in the form of a confused-deputy style attack. In particular, the past few years have seen quite a few security issues of this form, including a series of issues in runc (the most popular container runtime, used by Kubernetes and Docker).
However, this is far from a container-specific issue. Many Unix programs have historically suffered from similar issues, and the various attempts at resolving it have not really measured up.
This talk will go through the myriad of issues necessary to protect user space programs against these kinds of attacks, completed and ongoing kernel work to try to make these problems easier to resolve, and our experience migrating a container runtime's codebase to a design which emphasises path-safety.
In the Linux kernel, design intent is an emergent property. There are a number of well understood reasons for this, most notably the evolving needs of user space and the distributed nature of its development. It is also reasonable to suggest that, although remarkably complex, kernel design intent can be framed in rather simple high level terms: to behave as an arbitration layer between hardware and application software. Therefore, in practice the Linux kernel is said to “work properly” when user space behaves as expected in specified (e.g. POSIX) and unspecified (e.g. scheduler behavior) ways.
Although a reasonable explanation for the path taken from humble beginnings to where we find ourselves today, the Linux kernel development process is straining under the weight of its success. The community has responded with tools like syzkaller, kselftest, KUnit, and vendor-specific test suites to help the kernel stay aligned with expectations. In addition, there is a robust documentation set and a vast archive of community knowledge documenting the design and use of the Linux kernel. Despite all of this, a persistent gap separates detailed design intent from all forms of testing. This matters because tests risk misalignment without explicit developer signoff.
Perhaps in very narrow situations one might find evidence of a maintainer affirming that a test accurately reflects the details of their intended design. But on a widespread basis, there is no evidence that the intent of the designer can be unambiguously traced to any form of repeatable testing. Even among kernel maintainers this is the case, with PREEMPT_RT developers famously needing to reverse engineer kernel scheduling behavior to determine its fairness and latency properties. In effect, all kernel testing is largely based on a system of piecemeal reverse engineering and educated guesses. This does not discount the significant value of community driven testing, but it is important to acknowledge that no repository housing test code can lay claim to the usage of a metaphorical (or literal) Test-Reviewed-by-Maintainer signoff tag.
Setting aside, for a moment, the sheer size of the Linux kernel, we can reason about a solution to this problem and then work our way towards a sensible approach. Design expression is found at many levels, some of which become impractical to reason about at the kernel source code level because they rely on complex assumptions of use. Therefore we restrict ourselves to the lowest reasonable level of design intent, described in RTCA DO-178C as, "software requirements from which source code can be directly implemented without further information". Best practices for developing these requirements, which we shall call "testable expectations", are out of scope for this presentation. Suffice to say, authorship of testable expectations is a specialized skill which forces one to reason very deeply about the intent of code.
With testable expectations in hand, we have the level of clarity required to apply a virtuous cycle to Linux kernel code that affirmatively ties code to design intent in a traceable way. With maintainer agreement (a "Signed-off-by" as it were), a test case can be derived from the testable expectation. When the test case passes, coverage tools like llvm-cov and gcov can be brought to bear to ensure that relevant code is not overlooked. With expectation, test, and coverage in agreement, we can affirm that code accurately reflects design intent, achieving the purpose of software verification.
We end with addressing the "elephant in the room", the sheer size of the Linux Kernel and potential for added burden to the maintainer community. As any maintainer knows, bug triage is a huge part of the job, and while bugs will never go away, this virtuous cycle will shift the burden away from verification ("is the code doing the thing right") to validation ("is the code doing the right thing"). Both verification and validation are necessary parts of a maintainer’s job, but this approach promises to alleviate a significant portion of the verification aspect. Thus the "juice is worth the squeeze", but significant skepticism is warranted and input from the kernel developer community is needed to help refine the approach.
This talk will cover the following topics:
1) the current limitations of Linux Kernel low level design guidelines (as in Documentation/doc-guide/kernel-doc.rst)
2) Detailed examples of virtuous cycle behavior and progress to date in filling the gap from point 1) (a proposal is already being discussed in [1])
[1] https://lore.kernel.org/all/20250910170000.6475-1-gpaoloni@redhat.com/
ABSTRACT
The lack of standardized documentation for the Linux kernel poses a
barrier to its adoption in safety-critical industries such as aerospace,
where compliance with standards like DO-178C is required. We explored
the use of locally trained Large Language Models (LLMs) to automatically
generate compliant documentation for kernel modules and tools. As a case
study, we applied this approach to the Linux kernel’s ftrace utility and
evaluated four LLM families—Meta Llama, StarCoder2, Mistral Devstral,
and Google Gemma—across documentation validity, Graphical Processing
Unit (GPU) utilization, and throughput. Results show that Mistral Devstral
produced the most accurate and standards-aligned documentation and
demonstrated that LLMs can provide an effective method for bridging the
gap between open-source software and regulated environments, enabling
safer and broader integration of the Linux kernel into aerospace and other
compliance-driven domains.
Reason Behind Effort
The broader goal is not to propose an immediate solution, but to present empirical results that raise questions for the community: What criteria should kernel-generated documentation meet? Can LLMs be integrated into existing toolchains (e.g., Sphinx, SPDX) to support compliance goals? What processes would allow reproducibility, traceability, and expert validation in a way that certification authorities might accept?
The Linux System Monitoring and Observability Track brings together developers, maintainers, system engineers, and researchers focused on understanding, monitoring, and maintaining the health of Linux systems at scale. This track addresses the needs of engineers managing millions of Linux servers, where proactive monitoring, rapid problem detection, and automated remediation are essential for operational success.
Track Objectives
This track aims to foster collaboration between kernel developers, system administrators, and monitoring tool creators to advance the state of Linux system monitoring and observability.
Target Audience
Key attendees:
Key Focus Areas
Kernel Health and Runtime Monitoring
- Real-time kernel health assessment techniques
- Early detection of system degradation
Hardware Integration and Error Detection
- Hardware error correlation and root cause analysis
- Integration between kernel monitoring and hardware telemetry
Problem Correlation
- Virtualization stack monitoring (hypervisor ↔ guest relationships)
- Container runtime observability
- Network stack performance and reliability monitoring
- Storage I/O path analysis and optimization
- BMC information
- Scheduler (sched_ext changes)
Memory Management and Analysis
- Runtime memory profiling techniques
- Out-of-memory (OOM) prediction and prevention
- Memory leak detection and mitigation
Automated Analysis and Remediation
- Automated problem categorization and triage
- Anomaly detection algorithms for system behavior
Visualization and Alerting Infrastructure
- Real-time dashboarding for large-scale deployments
- Historical trend analysis and capacity planning
Tools and frameworks that would fit here
- eBPF/BPF: Advanced tracing and monitoring programs
- ftrace/perf: Low-level kernel tracing infrastructure
- Runtime Sanitizers: KFENCE, KASAN, and similar detection tools
- Hardware Interfaces: EDAC, MCE, ACPI error reporting
System-Level Tools
* bpftrace: Dynamic tracing language and runtime
* systemd: Service monitoring and system state management
* netconsole: Network-based kernel logging
Crash Analysis and Post-Mortem
* kdump/crash/drgn: Kernel crash dump analysis
* Core dump analysis: Userspace failure investigation
* Live debugging: GDB integration and kernel debugging techniques
Performance Analysis
* perf: Hardware performance counter analysis
* Memory profilers: Heap analysis and memory usage optimization
* Runtime memory profilers, such as `below`, strobelight, Open Telemetry, etc
Example Topics and Presentations. Previous LPC Presentations That Align With This Track
Over the past decade, Brendan Gregg’s Flamegraph has become an indispensable tool for pinpointing performance bottlenecks. Based on the canonical Flamegraph, it has been evolving into various flavors tailored to address specific performance issues in the production systems. We'll share how the novel approach generates Flamegraphs from latency, memory usage, crashdump, and kern.log traces. Furthermore, we'll discuss how to take it further with the collective user experience.
Methodology and Practice in Observing Kernel Networking
Blindly enumerating all counters extracted from the kernel and haphazardly monitoring every function in the hot path is hardly practical in production. Three key issues deserve greater attention: 1) performance degradation, 2) ineffective metrics, and 3) the prohibitive cost of massive data storage. In most time, system administrators are forced to spend excessive time repeatedly sifting through data to find just one or two useful historical entries to have a better understanding of what happens in the underlying layer - an exhausting ordeal. Why not identify the truly impactful key metrics, build a robust methodology around them, and then present to admins in a user-friendly way? The topic will revolve around methodologies and real-world practices in the networking domain, though such methodologies should transcend this domain.
Steam Deck is a successful console from Valve that runs on top of FOSS, having Linux as its operating system.
For the regular gamers, user experience is smooth and they don’t even need to think about what’s going under the hood to ensure such good experience is possible. Specially, there are interesting bits from the tracing system and in-kernel debug features leveraged in order to achieve the smooth run.
In this talk, we’re going to dive into both proactive mechanisms to detect how well the system is running and detect if there are sub-optimal paths that could be improved, as well as tooling to collect logs in case of a more severe set of issues, leading to kernel crashes.
All of that is then tied to an opt-in feature to send information to a Valve’s Sentry instance to be debugged by the SteamOS engineering team.
Context: https://lore.kernel.org/all/20251001045456.313750-1-inwardvessel@gmail.com/
Periodically reading data directly through memory.stat can be expensive across a large enough fleet. Most of the overhead lies in the string formatting and numeric conversion not only in the kernel but also by any user mode program parsing the data. I’d like to discuss the proposed kfuncs in the patch and how they can be paired with a BPF program to collect stats more efficiently. I will show a real world application of this concept in the tool called “below”.
This talk will cover the on-going effort to evolve bpftrace from an observability tool into a flexible, composable framework that can make many observability tools and drive the larger BPF observability ecosystem - instead of trailing behind it.
Over the past year, the bpftrace development team has focused on removing obstacles that hinder users from efficiently observing and debugging their systems. From a clunky type system that doesn’t play well with BTF, to one-off features that don’t compose with each other, to the slow process of adding access to new BPF features and kfuncs, bpftrace can be as frustrating as it is helpful. We’re working to change that by providing primitives for code reuse, developing a standard library, and adding the ability to interop with raw/custom BPF C code. In this session, we’ll discuss the technical hurdles we’ve encountered, share our progress so far, and outline our vision for the future of bpftrace as a composable, expressive toolbox for the BPF observability community.
DAMON simplifies the collection of system and workload data access patterns. However, interpreting this data and transforming it into actionable insights for humans remains a challenge. Representing the data in an actionable format is difficult. While efforts have been made to visualize this data, opinions vary on its accessibility. This session will review past attempts to make the data consumable and actionable, outline future plans, and foster a discussion with the audience on improved visualization methods.
We build robust kernel code by properly handling errors and recovering
gracefully. But many critical error conditions are hard to replicate
in testing, so error injection becomes essential for validation. Past
error injection approaches were often considered too intrusive and got
rejected [1].
This talk presents moderr, an eBPF tool using libbpf for error injection
in the kernel module subsystem. The tool targets paths that are
otherwise difficult to simulate and validates both code correctness and
memory leak detection in error scenarios.
The approach uses eBPF's lightweight nature where you only need to
annotate functions in-kernel that are error injectable, while the
complex error injection logic lives in a separate standalone tool. This
addresses the intrusiveness problem that killed previous attempts.
I'll show the current implementation [2], share what we learned from
community feedback [3], and discuss how moderr could evolve into a
generic error injection framework for kernel developers.
Link: https://lore.kernel.org/all/20210512064629.13899-1-mcgrof@kernel.org/ [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/da.gomez/linux.git/?h=b4%2Fmodules-error-injection [2]
Link: https://lore.kernel.org/all/20250122-modules-error-injection-v1-0-910590a04fd5@samsung.com [3]
Monitoring the kernel on millions of servers in production poses significant problems in terms of scale and diversity of the environment, both in terms of software and hardware. An observability system should allow detecting, debugging and fixing a large number of issues, as well as allowing engineers to focus on the most important ones in terms of spread and severity. This is made challenging by trying to run the newest kernel out of all the hyperscalers, which usually means we find problems before many others. Meta uses commonly available tools (netcons, kdump, drgn, etc) as well as other, internally developed tools that are able to fullfill these requirements. The session will cover: how kernel events and data are collected and aggregated, the tools and data sources being using and finally how and where is data used in the release process. We'll also discuss the challenges of maintaining observability at Meta's scale, including performance overhead considerations, data volume management, and the balance between granularity and practicality. This talk is relevant for anyone interested in production kernel observability, or operating Linux at scale.
The existing page_owner debug feature tracks the stack trace of memory allocations in the system at the page level. It can answer questions like: 'What allocated this page?' and 'How many pages are allocated by what?' -- pointing right at the source code.
That allows for profiling and monitoring all of the system memory per allocation stack trace to identify trends, leaks, spikes, regressions, correlations with events and metrics/statistics, and so on; then validate whether code changes perform as expected.
While using page_owner, we have been working on support for pages in swap space and reducing the overhead of information and processing for userspace, in order to improve memory coverage and sampling frequency.
This talk discusses the related kernel work and some usage examples.
Guider is an open-source, self-contained performance monitoring and observability framework designed for embedded and custom Linux platforms such as AGL, Android, Tizen, webOS, and custom distros.
With over 180 built-in commands and support for TCP/UDS-based remote APIs, Guider provides a flexible yet lightweight system for real-time monitoring, profiling, and fault detection. It continuously watches system behavior, evaluates thresholds defined via JSON configurations, and autonomously generates structured reports when anomalies or degradations are detected. These reports include past runtime traces, flame graphs, peak analysis, and resource usage summaries.
Guider integrates tightly with existing kernel infrastructure—leveraging ftrace, atrace, ptrace, kprobe, uprobe, DWARF, debugfs, and procfs—to support low-overhead, extensible event capture and visualization. In addition to system-level metrics, it also parses logging from multiple sources (kernel, journal, Android, DLT, syslog), correlates them with event timings, and renders them into interactive outputs including flamegraphs, stacked graphs, and histograms.
For Linux Plumber 2025, we propose a joint microconference for Real Time and Scheduler as in the past. These two areas have always been tightly linked and continue to generate cross functional changes especially after PREEMPT_RT has been merged. The scheduler is at the core of Linux performance; With different topologies and workloads, giving the user the best experience possible is challenging, from low latency to high throughput and from small power-constrained devices to HPC.
Since last year’s micro conference, progress has been made on the following topics:
- Progress on proxy execution
- https://lore.kernel.org/all/20241011232525.2513424-1-jstultz@google.com/
- https://lore.kernel.org/all/20250602221004.3837674-1-jstultz@google.com/
- Defer throttle when task exits to user
- https://lore.kernel.org/all/20250520104110.3673059-1-ziqianlu@bytedance.com/
- The EEVDF scheduler responsiveness
- https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/
- https://lore.kernel.org/all/20250209235204.110989-1-qyousef@layalina.io/
Some topics also continued to be discussed at the OSPM conference: http://retis.sssup.it/ospm-summit/
Ideas of topics to be discussed include (but are not limited to):
- Improve responsiveness of fair tasks
- Improvements on EEVDF
- Adding more usages of push callback
- Improve PREEMPT_RT
- Defer throttle to user space
- IPI impact
- Improve Locking and priority inversion
- Proxy execution
- Impact of new topology, including hybrid or heterogeneous system
-Taking into account task profile
- Improvements on SCHED_DEADLINE and DL server
- Tooling for debugging low latency analysis
This is not an exhaustive list. We welcome all proposals related to process scheduling. The goal is to discuss open problems, preferably with patch set submissions already being discussed on the mailing list. Presentations are meant to be limited to 2 or 3 slides intended to seed a discussion and debate - allowing for high bandwidth discussion with key stakeholders in the same room.
Key attendees:
- Ingo Molnar
- Peter Zijlstra
- Juri Lelli
- Vincent Guittot
- Dietmar Eggemann
- Steven Rostedt
- Ben Segall
- Mel Gorman
- Valentin Schneider
- Thomas Gleixner
- John Stulz
- Sebastian Andrzej Siewior
- K Prateek Nayak
- Shrikanth Hegde
- Phil Auld
- Dhaval Giani
- Clark Williams
Scheduler Micro conference Proposal
Title:
Cache Aware Scheduling
Presenters:
Tim Chen (tim.c.chen@linux.intel.com)
Chen Yu (yu.c.chen@intel.com)
We have proposed RFC patch series that implemented cache aware scheduling.
The primary motivation is to keep threads sharing data together in the
same last level cache domain, to reduce cache bouncing.
We'll like community feedback on issues that we need to address to take
it upstream.
Discussion Areas:
1. Overview of the use cases that motivate this feature, and current performance numbers.
2. Whether the basic approach of current patches look good to everyone.
3. Whether this feature should be extended to aggregate threads in a single process, but also
to processes communicating via pipes/sockets or sharing memory.
4. Will it be a good idea to use the memory scanning mechanisms in NUMA balancing
to get an ideas on how many tasks are sharing data (from numa_group)? And
perhaps also estimate the extent of shared data?
5. Whether the load aggregation policy implemented in the current patch series can be improved.
Key Participants:
Peter Zijlstra (Intel)
Len Brown (Intel)
Prateek Nayak (AMD)
Madadi Vineeth Reddy (IBM)
Vincent Guittot (Linaro)
The Binder driver is a lightweight IPC that serves as the communication backbone between processes in Android. It implements a very peculiar Priority Inheritance model that has been rejected upstream. This presentation re-examines this design and presents a few upstream-friendly alternatives to the current model.
The wakeup path and the periodic load balance don’t cover all cases where we’d like to migrate a task on another CPU for the fair scheduling class. There are situations where we’d like to push tasks in a similar way than the wake up one. The EAS is one user which would benefit from a push mechanism as it disables periodic load balance but wants to migrate tasks more often than at wakeup. Non EAS systems would also take advantage of pushing tasks on idle CPUs. We will explore the current status of push callback mechanism, how it could replace periodic load balance in some cases and the related open questions.
During the discussion at OSPM ’25, the idea of using push-based load balancing as an alternate to idle and newidle balance was proposed. A prototype [1] was sent soon after OSPM to gather feedback from the community.
During the review, Peter mentioned optimizing the global nohz idle tracking to be reduced to per-LLC tracking to reduce the cost of access and update to this shared data being used for load balancing [2]. With the infrastructure to track the idle CPUs more efficiently in place [3], there are more fundamental challenges that require further discussion:
Reducing the overhead of pushing the task: Push is always done from a busy CPU’s context, adding latency to runnable task’s execution. Is there a better mechanism to address this by offloading the push to a smp_call_function on the idle target?
Avoid selecting the same CPU for push: Multiple busy CPUs can observe the same CPU to be a potential idle target for push and can overload this single CPU. Is there an inexpensive indicator to ensure task pileup is avoided?
Beyond nohz idle tracking: CPUs can idle without disabling the periodic tick. Is there a potential to extend nohz idle tracking to all idle CPUs efficiently to truly replace newidle balance?
A larger prototype addressing some of the issues listed above will be posted close to the conference proceedings. We also seek to reach an agreement on the infrastructure for pushing fair task in order to align the efforts to enable it for both capacity aware scheduling (CAS) and energy aware scheduling (EAS) use cases.
[1] https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
[2] https://lore.kernel.org/lkml/20250410102945.GD30687@noisy.programming.kicks-ass.net/
[3] https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
PREEMPT_RT has finally been merged. The mainline Linux kernel is now capable of providing real-time guarantees. However, the kernel is only one piece of the puzzle: to achieve real-time behavior, the userspace counterpart must also be designed correctly.
Unfortunately, userspace applications often introduce undesirable latency due to incorrect design. The root cause is that it’s unclear what users should and shouldn’t do to guarantee a real-time response. For example, many users are unaware that they should call mlockall(), and they suffer unwanted latency from page faults. Another common issue is using mutexes without priority inheritance, which can lead to priority inversion.
Runtime verification (RV) monitors are a recent effort to help users detect these incorrect real-time design patterns. At present, the monitors warn users if one of the following occurs:
In this talk, we will discuss further enhancements to RV monitors:
With the ongoing work on RV and the deadline scheduler, coupled with timed automata, we introduced a practical way to validate timing properties in the kernel.
Now we can have models guaranteeing that tasks are throttled when consuming their runtime and don't miss their deadline.
The few models for the deadline scheduler are barely scratching the surface of what could be done to validate it.
A few questions come to mind:
CPU Isolation enables a system administrator to shield a subset of CPUs from
most kernel interference, but not all of it. Activity on the housekeeping CPUs
can still trigger IPIs targeting isolated CPUs, which defeats the requested
isolation.
At Red Hat, we've mostly observed IPIs caused by instruction patching
(e.g. static key updates) and TLB flushes (e.g. due to vmap'd stacks of
dying tasks). Work to mitigate these has been happening over the past few years,
this talk will present the current state of things, ongoing pain points, and
discuss future work.
In para-virtualized environment, vCPU overcommit is a common configuration which helps customer to make better use of CPU resources since not all VMs would be active at the same time and hence underlying hypervisor will be able to meet the CPU demand and workloads running on VMs can benefit from the extra resource.
Acronyms:
vCPU - virtual CPU - CPU in VM
pCPU - physical CPU - CPU governed by Hypervisor.
But when all or most of the VM request for CPU resource at the same time, hypervisor wont be able to meet those requirements and has to context switch among them to meet the CPU demand and be fair. This context switch is called as vCPU preemption. This vCPU preemption is much more expensive than the task preemption within the VM. Workload performance degrades significantly as the result.
In such a situation, if VMs and hypervisor can co-operate among themselves and use lesser number of vCPUs it improves overall performance. Such vCPU which shouldn't be used at this point are called as paravirt CPU. Provide a framework in Linux kernel to identify these paravirt vCPUs and consolidate workload on non paravirt vCPUs.
This is achieved by:
1. Not scheduling any new task on paravirt CPUs.
2. Make load-balancer aware of paravirt CPUs.
3. Push the workload away from paravirt CPUs if it is running on them.
Which vCPUs to mark as paravirt is left to the architecture. Experimentation was done with help of hint from debugfs file.There is effort ongoing to plug in steal time values to decide the hint. It would ideal if the hint is provided by the hypervisor.
Discussion Points:
1. Comparison of different methods (capacity, load balancer, offline)
2. Plugging in steal time to set paravirt CPUs
3. Subtle issues related to implementation.
4. Do all scheduler classes need it to honor it.
5. How userspace such as irqbalance/SCHED_EXT can exploit it.
This is a follow up of discussion in Plumbers 2024.
http://www.youtube.com/watch?v=vMgTAdYAMeQ
http://www.youtube.com/watch?v=pZaO5TlzEjo
Patches: (Latest being first)
https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
https://lore.kernel.org/all/20250217113252.21796-1-huschle@linux.ibm.com/
The eBPF Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s eBPF subsystem and its surrounding user space ecosystem such as libraries, loaders, compiler backends, related system tooling as well as eBPF use cases.
The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of the eBPF subsystem.
The track will be composed of talks, 30 minutes in length (including Q&A discussion).
eBPF Track's technical committee: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko and Martin Lau.
Writing non-trivial BPF programs presents a unique challenge because of the constraints enforced by the BPF verifier. If a program fails to load, the verifier emits a log containing a complete trace of its evaluation with various debugging information. Interpreting such a log to track down the root cause of a failure can be difficult, especially for developers new to BPF.
BPF Verifier Visualizer (https://github.com/libbpf/bpfvv) is a tool that aims to improve developer experience of debugging BPF verification failures. It provides a debugger-like UI to navigate and analyse BPF verification logs, available right in your browser.
In this presentation we will talk about main features of bpfvv and how it has already been used by some developers at Meta.
State pruning allows the BPF verifier to mitigate the path explosion problem and scale to large programs. With its underlying algorithms, precision tracking, strongly connected components computation, and liveness analysis, state pruning accounts for around 15% of the verifier. Its many heuristics have been tuned over a decade of trial and error.
While state pruning inefficiencies can lead to programs being rejected, bugs can lead to bytecode being incorrectly identified as dead code and eliminated. Nevertheless, despite its key role and complexity, this critical pruning logic remains not widely understood and would benefit from broader reviews.
This talk therefore aims to demystify state pruning, covering:
- how state pruning works;
- the underlying algorithms, the heuristics and their evolution;
- limitations & shortcomings.
To conclude this talk, we will describe existing and ongoing works to test the pruning logic, and discuss our own propositions to improve its debuggability.
The BPF verifier has troubles when verifying loops,
and we are slowly moving to address these.
In the talk I want to cover:
- historical evolution of loops handling by verifier;
- problems with current state of things (too crude widening,
no bounds for induction variables);
- describe DFA based liveness analysis that landed recently;
- describe further steps adding DFA-based value range analysis to
the verifier.
eBPF enables safely extending kernel functionality for various applications,
but its static verifier is overly restrictive, preventing many useful and
valid programs in practice from running. It can also miss safety violations
in complex conditions. Recent work proposes adding runtime checks to mitigate
these limitations, but they narrowly target specific cases. Their
instrumentations require significant effort and are error-prone.
We present ePass, a framework that provides systematic and verifier-cooperative
runtime checking for enhancing eBPF flexibility and safety. ePass introduces a novel
Intermediate Representation (IR) that lifts eBPF bytecode into an SSA (Static Single Assignment) form,
enabling systematic instrumentation of runtime checks. It provides intuitive
APIs for developers to easily implement diverse transformation passes. ePass
ensures these passes preserve existing safety rules while enhancing runtime
safety.
To showcase ePass' versatility, we develop 12 passes that address different
verifier limitations and safety gaps, such as instruction limit enforcement,
memory sanitization, and helper function argument validation. They most take
under 100 lines of code. Our evaluation further shows that ePass enables
real-world programs that are previously rejected to execute safely, mitigates
known vulnerabilities, and incurs low overhead.
ePass's toolchain is completely open-source at https://github.com/OrderLab/ePass.
Several fuzzers are able to target the BPF verifier, some achieving high coverage. They are fairly efficient at uncovering deadlocks, unnecessary warnings, and memory errors, but struggle to uncover false negatives: cases where the verifier incorrectly accepts a program. Without a test oracle for these false negatives, fuzzers remain silent.
This talk proposes a new test oracle for the verifier, inspired by recent research results [1] and discussions at last year's Linux Plumbers [2]. When enabled, the oracle preserves the verifier's expectations on registers and stack slots at pruning points. Then, at runtime, the interpreter or JIT compilers check that concrete values are within the verifier's expectations, issuing a warning if any unexpected value is found.
The RFC patchset will be posted to the mailing list before the conference. It relies on a BPF map to save the verifier states (and potentially expose them to userspace for debugging) and focuses on scalar values for now.
1 - https://www.usenix.org/conference/osdi24/presentation/sun-hao
2 - https://lpc.events/event/18/contributions/1933/
The growing demand for sophisticated, high-performance eBPF programs on weakly-ordered architectures like arm64 necessitates finer-grained control over memory ordering, instead of relying on the default of full memory barriers. To that end, the eBPF ISA has been expanded by two new BPF_ATOMIC instructions that provide load-acquire and store-release semantics.
This talk introduces these new instructions, which are now supported in both mainline LLVM and the upstream Linux kernel. We will begin by detailing their encoding design, and demonstrating their usage via C intrinsics through concrete examples. We will then present a deep dive into the end-to-end implementation, covering the necessary changes to the LLVM BPF backend and the Linux kernel—mainly including the verifier and Just-In-Time (JIT) compilers for arm64, x86-64, and riscv64. Finally, we will explore future directions, including a proposal to implement explicit eBPF memory barriers using nocsr kfuncs.
The goal is for every attendee to leave with a thorough understanding of these new additions to the eBPF ISA, ready to leverage them to build more featureful and performant eBPF applications.
With the initial implementation for BPF signing nearly merged, more advanced signing use-cases can be discussed. There are three key cases for signed BPF programs:
This initial implementation lays out the foundation for signing eBPF programs focussing on [1]. The talk is an open discussion format with anchoring on key ideas for building on security policy and signing approaches for [2] and [3] and any other use-cases.
At Meta, due to the proliferation of AI workloads, increased security was needed around key services. In particular, two use cases were jailing untrusted code, and preventing insider and attacker tampering with user data.
AI training and execution of prompts involves executing untrusted code. Meta's network is flat, leading to untrusted workloads operating in the same space as sensitive workloads. Meta built a jailed stack based on microVMs, with a key component being bpf based jailing of the microVM. BpfJailer blocks access to the network, filesystem, IPC, etc. using bpf LSM, in a flexible way tailored to the needs of this workloads.
Another workload involves securing user data in Meta Private Processing. User data is processed in a CVM. BpfJailer is used in this case to prevent tampering with the binaries that can access the CVM, enforcing signed binaries as well as command line argument validation. BpfJailer prevents even root users from tampering with the processes interacting with the CVM, blocking debuggers, /proc access, etc.
This talk will focus on the challenges of supporting these two workloads with bpf LSM, as well as the unsolved problems in this space. Specific discussion will involve:
- Jailing techniques used with bpf LSM in BpfJailer
- Protecting bpf LSM programs and agents from tampering
- Implementing binary and integrity checks and protections
- Managing and orchestrating a bpf LSM agent at scale
- Integrating bpf based enforcement into containerized workloads
Given the increasing concerns around user data and AI model theft, we prioritized developing robust mechanisms to monitor critical files throughout the file system. Leveraging eBPF, we implemented real-time detection for the creation of sensitive files and established comprehensive tracking of their lifecycle events, including renames, moves, deletions, compression, decompression, and uploads. This security enhancement enables us to maintain a detailed lineage of each file, facilitating the identification of unauthorized access or sharing attempts.
A key challenge was designing a reliable method to tag files for persistent tracking, ensuring that identifiers remain consistent even as files are renamed or transformed. Developing heuristics to detect uploads and downloads proved complex, since these actions can occur through various system calls and network behaviors.
Looking ahead, there are several technical directions for expanding this capability. One promising approach is to extend file tracking to cover transfers between systems within our visibility scope, allowing us to monitor file movements not just locally but also across multiple hosts. This would further strengthen our ability to detect and respond to potential data exfiltration or misuse at scale.
Abstract: At Meta’s scale, high-signal telemetry competes with overwhelming noise. We present a pragmatic approach that pushes policy into the kernel to eliminate noise at the source and enforce controls before user space is involved. First, we show how we compile regex patterns into deterministic finite automata (DFAs) and execute them in eBPF at Linux Security Module (LSM) and fentry attach points, matching file paths and other identifiers in-kernel to short-circuit events that previously traversed user-space filters. The result is “no-emit” security: irrelevant events are never emitted, cutting CPU, I/O, and downstream processing costs and enabling faster responses.
Second, we connect this to BpfJailer, an eBPF-based enforcement framework used to dynamically sandbox processes (including internal AI agents). Beyond certificate lockdown, we show how the same kernel-first approach extends to executables and privilege boundaries, e.g. blocking execution of untrusted binaries, constraining unauthorized privilege escalation, and guarding sensitive kernel surfaces. For agentic workflows, we also explore DNS-aware enforcement: using in-kernel DNS inspection and network hooks to constrain name resolution to policy-bound allowlists, detect exfil-friendly domains, and coordinate with connection-level blocks to prevent prompt-injected egress. Together, these controls establish agent-specific, low-privilege identity and bounded egress across developer workstations, ephemeral on-demand compute (e.g., containerized batch jobs), and continuous integration (CI) environments.
We’ll share the engineering journey: a regex => AST => DFA pipeline for kernel-friendly execution; layered filtering and dynamic configuration; MetArmor’s orchestration (BpfHandler, map updaters, event buffers) for fleet rollouts; and response actions (file/process quarantine, targeted network blocks, isolation).
We conclude with operational wins and open questions around verifier limits, DFA size/memory trade-offs, path normalization across filesystems (e.g., btrfs device IDs, mounts, symlinks), uniform kernel/userspace filtering semantics, false positives management, and standardizing “kernel-first” controls for agentic workflows.
Container networking plugins for Kubernetes like Cilium currently implement Fully Qualified Domain Name (FQDN) based DNS network policies using a user-space DNS proxy to intercept the DNS to IP mappings and plumb CIDR based policy into bpf maps.
This architecture introduces some challenges since any downtime with the the userspace proxy would result in DNS resolution failure for all workloads on the node. Solutions like introducing another userspace proxy in a high availability setup can improve the reliability, but it comes with the cost of introducing a protocol to communicate policy state and other metadata to the proxy. Thanks to some of the enhancements in eBPF it is now possible to implement a DNS parser natively in bpf. Cilium already enforces CIDR based policy in the kernel. Combining these two solutions we were able to get rid of the userspace proxy completely, eliminating the dataplane and control plane coupling and resulting in improved tail latencies.
This system uses a set of stream parser and stream verdict bpf programs to even support DNS over TCP. The path to implementing this however was not easy. Features like DNS compression make implementing such a system tricky, requiring us to understand some of the internals of verifier. In some scenarios, upgrading to newer kernel would simply resolve the issue. But not before spending days, if not weeks trying to reason about verifier’s behavior. And then some more time bisecting to understand which commits fixed the issue.
This talk will dive into the details of how the system was built, share our experience during the development process and leave some room to discuss how we could improve the user experience (UX) for relatively new developers. The answer to these challenges could simply be documenting the behavior at the intersection of loops and other features or even a better abstraction in bpf that allows simplifying the work verifier needs to perform in these scenarios.
This talk explores the idea of capturing and identifying DNS requests with BPF and responding to them "in-place" with BPF.
DNS is a relatively simple UDP protocol, and a typical DNS query over UDP usually involves just one packet for the query and one packet for the response. If BPF parses structure of the packet and is able to resolve address from the request, e.g. from the hash map attached to the program, it's also able to rewrite all fields of the incoming packet with the resolved address and place this packet back in the queue as a response.
In the talk we will cover an actual implementation of TC/XDP BPF Programs, potential performance benefits and real world applications, as well as, relevant topics such as DNS DDos protection.
The Embedded and IoT Micro-conference is a forum for developers to discuss all things Embedded and IoT. Topics include tools, telemetry, device drivers, protocols and standards in not only the Linux kernel but also Real-Time Operating Systems.
Current Problems that require attention (stakeholders):
We hope you will join us either in-person or remote for what is shaping up to be another great event full of collaboration, discussion, and interesting perspectives.
Proactive file prefetching has proven effective in reducing system boot times. This presentation details the evaluation of a prefetch solution for Android, inspired by its successful deployment on ChromeOS. We analyze its performance impact through Perfetto traces, confirming notable boot time reductions. The core of the implementation involves a two-phase "Record and Replay" process, and we will discuss the key integration challenges encountered on low memory devices.
Furthermore, we extend this investigation to virtualized environments, presenting I/O trace analysis for integrating the prefetch into a Debian OS running on the Android Virtualization Framework (AVF). We conclude by outlining next steps, focusing on continuous record/replay mechanisms and the utilization of new kernel tracepoints for enhanced observability.
This talk will cover some of the boot time optimizations that we've found to be helpful on Android systems that should apply equally well to embedded systems. Most of these guidelines have been launched in a public product and have been shown to work well.
Zephyr RTOS offers a rich ecosystem, but embedded engineers often face the challenge of porting code across environments—from Zephyr to another RTOS or even to bare metal. This discussion is about a practical guide to extracting a “bare metal flavor” of code out of Zephyr so that it can run independently of Zephyr’s driver and subsystem layers.
NOTE: The HAL approach isn't the right solution, it is making the code stay out of common code base, not following Zephyr standards or coding principles, not gaining advantages of opensource community for reviews of design architecture and the source code.
We will explore:
Structuring Zephyr projects so application and peripheral code can be isolated cleanly
Writing drivers and hardware access layers that can compile both inside and outside Zephyr
Techniques for minimizing dependencies on Zephyr APIs
A case study: extracting a Zephyr-based peripheral driver and reusing it in a bare-metal project
Beyond the mechanics, we’ll discuss the philosophy of portability—how to strike the right balance between leveraging Zephyr abstractions and keeping code flexible enough to live elsewhere.
In this session we will discuss how to improve system stability of boards using fusb302 (or similar) chips for their USB-C port without any backup power source. This kind of setup is often found on Rockchip boards (e.g. Libre Computer ROC-RK3399-PC, Radxa ROCK 5B or ArmSoM Sige 5) and quite a pain, because a hard-reset effectively kills the board power.
The session starts with a short presentation of the problem(s) and work that has already been done to get it working to some degree with the upstream kernel. Afterwards the plan is to discuss how to improve the situation to avoid hard resets as much as possible.
The Common Clk Framework (CCF) is expected to keep a clock’s rate stable after setting a new rate with "clk_set_rate(clk, NEW_RATE)". However, several longstanding issues affect how rate changes propagate through the clock tree when CLK_SET_RATE_PARENT is involved.
Current behavior allows a child clock to change its parent’s rate to satisfy its own request, but this adjustment happens without considering sibling clocks. As a result:
A new parent rate may be incompatible with sibling or grandchild clocks, since no negotiation occurs across the tree.
Even when the parent rate is acceptable, sibling clocks can still have their rates changed unexpectedly. A proposed patch [1] to fix this cannot be merged without breaking boards that rely on the existing behavior.
Gate and mux scenarios raise further complications.
These limitations have been visible in some bug reports and attempted fixes over the years [2]. Recently-posted kunit tests [3] highlight and reproduce the issues.
We'll discuss the problem, some possible ways to solve this, and suggestions for how to keep compatibility with older boards that rely on the existing behavior.
[1] https://lore.kernel.org/linux-clk/20250528-clk-wip-v2-v2-2-0d2c2f220442@redhat.com/
[2]
- https://lore.kernel.org/lkml/20230825-pll-mipi_keep_rate-v1-0-35bc43570730@oltmanns.dev/
- https://lore.kernel.org/linux-kernel/20230807-pll-mipi_set_rate_parent-v6-0-f173239a4b59@oltmanns.dev/
- https://lore.kernel.org/all/20241114065759.3341908-1-victor.liu@nxp.com/
- https://lore.kernel.org/linux-clk/20241121-ge-ian-debug-imx8-clk-tree-v1-0-0f1b722588fe@bootlin.com/
[3] https://lore.kernel.org/linux-clk/20250812-clk-tests-docs-v3-0-054aed58dcd3@redhat.com/
Contemporary embedded systems increasingly come with bootloaders and firmware which expose some sort of ABI toward the Linux kernel, and the Linux kernel depends on such ABI to start other CPU cores, configure clock, power domains, pin multiplexing and other vital parts of the system.
With existing firmware interfaces like ACPI, the ABI stability is strictly enforced and ABI breakage seldom occurs. With newcomer firmware interfaces on contemporary embedded systems, commitment to ABI stability is left to vendors and ABI breakage is much more common. This leads to problems during kernel updates of such systems, which may mandate either a bootloader update, or special-case kernel fixes.
The gravity of this problem is further exacerbated by a growing number of closed source implementations of such firmware ABI provider implementations, either in the form of firmware running on secure coprocessor, or similar.
In this proposal, the speaker explains the firmware ABI stability problem in detail, including examples present on contemporary SoCs, and a grim prediction of where the firmware ABI development is likely to evolve. The talk will then offer multiple possible solutions how to address the problem. Once such option would be to use open source bootloader, like U-Boot, to start the additional firmware components from a fitImage bundle together with Linux kernel, and possibly use U-Boot SPL to minimize the security sensitive code footprint. Another alternative would be improved introspection of firmware ABI interfaces and proper versioning, which would allow Linux kernel to work around bugs in specific firmware versions.
Proper discussion and feedback from audience on this topic would be highly appreciated.
Memory management keeps on being exciting. With a lot of activity on all different kinds of projects, some more controversial subjects that might be worth discussing this year:
The goal of the Toolchains micro-conference is to hold discussions about toolchain related topics that are relevant to the Linux kernel. This covers both the GNU toolchain and the Clang/LLVM toolchain.
In the last years we have had either a micro-conference or a complete track to discuss about Toolchain topics during LPC, and along with LSFMMBPF they have proven to be a quite effective way to keep track and bring to a satisfactory conclusion of problems that often take a long time to resolve and that require a lot of interaction between the kernel and the toolchains communities.
This is particularly important for some topics that are of maintenance in nature and evolve over time. But also for particular projects that cover some particular need of the kernel hackers, or some desired functionality.
Some of the topics we are particularly interested in covering this year are:
As is intended with micro-conference, the emphasis is to have productive discussions and the goal is to reach agreements and satisfactory solutions to fix particular issues. In particular, long talks are discouraged, and the presentation material should be reduced to the minimum necessary to present the problem to discuss. Notepads should then be ready to document the discussion and the reached agreements.
Key participants:
Another year of work is behind us, with lots of progress across GCC, Clang, and Rust to provide the Linux kernel with a variety of security features. Let's review and discuss where we are with parity between toolchains, approaches to solving open problems, and exploring new features.
Parity reached since last year:
counted_by attribute for Pointer Members (GCC, Clang)Compiler-specific features landed since last year:
In progress:
__strong typedef (Clang)Stalled, needs driving:
I'd like to share some toolchain experiences encountered as part of my work on hardening the kernel running on Google's production servers.
I'll discuss "profile guided hardening" (aka "selective sanitization") on how to make kernel cold paths extra hardened using -lower-allow-check-percentile-cutoff-hot and -fsanitize-ignorelist
I'll also share my excitement around the recent Clang developments on the topic of slab isolation using properties of the allocated types to help make memory safety exploitation harder. (eg: the -fsanitize=alloc-partition RFC)
Further topics related to kernel security could be opened up to discussion as well like the recent -fbounds-safety flag and strategies to progressively use it across the kernel.
We're going to talk about the work we've done to enable distributed ThinLTO builds for the kernel. We'll cover why we're doing this, how we implemented it, and how it compares to in-process builds. We'll also discuss the changes we made to other components, like livepatch.
Currently a futex does not expose any information about the owner of the futex to the kernel. When a task blocks on a futex it updates information on the shared memory of the futex about it waiting and enters the kernel to sleep until the owner wakes it up. For a futex that is held for a short time, this can cause a noticeable performance hit because the time it takes for a blocked task to sleep and then be woken up can be much longer than the futex was held. Contention in these cases has a significant performance impact.
If the memory for the futex held ownership information, instead of sleeping when a task can't get the futex, it could act like the adaptive mutexes in the kernel and spin if the owner is running on another CPU. When the owner releases the lock, the blocked task will detect that the owner is releasing the futex and exit out and even take the futex from within the kernel. This could possibly be a tremendous speedup to futexes.
To make this work, a contract is needed between the C library and the kernel to allow this information to be passed between user and kernel space.
In 2014 I added "WIP: Kernel syscalls wrappers" [1] item to the upstream glibc consensus documentation.
Over the last 11 years the idea that we should add C library wrappers for all Linux syscalls has waxed and waned, but I would like to revisit the idea with the help of the kernel community.
I want to look at the history of the policy, why it came out, and what work is required to add C library interfaces for some of the more complex syscalls we don't currently support.
We will visit the worst offender which is futex() and discuss how we might add a wait-wake primitive to userspace that all callers can use.
We will review the recent addition to glibc for SCHED_DEADLINE and how it came about and attempt to understand why we never had a C library interface for sched_setattr() and sched_getattr().
I'll close with a call to action again for always having the syscalls available as C library calls even if they could be used behind the back of the implementation.
Reference:
[1] https://sourceware.org/glibc/wiki/Consensus?action=diff&rev1=13&rev2=14
Performance Inversion (more generalized form of Priority Inversion) is a common problem in the wild. RT tasks are not the only ones susceptible to it; SCHED_NORMAL are prone to it too. Whether it is due to usage of nice value, running on big.LITTLE system or DVFS, the lock holders can cause a delay to an important waiter leading to performance problems.
Proxy execution effort led by John Stultz is generalizing a solution to the problem at scheduler level. While it is not connected to futex_pi yet, but once it does it won’t be effective unless userspace opts in.
Unfortunately userspace coding is too complex and layered today for any single user to identify which locks can be contended badly and cause inversion problems. The library dependency is wide and out of a single owner’s control to enforce an opt-in to use PI locks. On going QoS discussions for the scheduler will mean more reasons to cause inversion problems that can affect performance.
For these reasons making PI the default behavior at the lowest possible level is necessary to address this class of problems hopefully once and for all. To my knowledge non Linux OS did make choices like this.
What would it take to flip PTHREAD_PRIO_INHERIT to be the default on libc?
How can we deal with other forms locks that are non pthread related? Some language like Java implement their own locks, how can we help them to be PI by default?
Non lock based inheritance is an important topic too. What semantics are required to be introduced to enable pthread_condition_variables be PI aware?
The goal of this activity is to go through a list of specific problems and issues concerning the BPF support in GCC.
TBD
We have been using CONFIG_LTO_CLANG_THIN in our production kernel for a few years. While delivering non-trivial performance improvements, kernels with CONFIG_LTO_CLANG_THIN enabled also bring challenges to our work:
LTO causes confusion for tracing users. With LTO, the compiler is more likely to do selective inlining, i.e., inline a kernel function at some call sites, but not some others. When such kernel functions are being traced, the user can easily miss a lot of events without realizing it.
LTO makes kernel live patching more difficult, especially with the kpatch-build toolchain. With LTO, individual object files are LLVM IR, kpatch-build cannot easily perform binary diff on these object files. LTO also makes it more challenging to verify the correctness of a live patch.
In this talk, we will share more information about these issues, and discuss with the community about potential ways to address these issues.
Evening event details are on the back of your badge
Syzbot is a continuous kernel fuzzing system that automatically uncovers and reports hundreds of Linux kernel findings each quarter.
The BoF session aims to facilitate open dialogue between the syzbot engineers and the kernel developers who receive the reports. We'll discuss what's working well, where more attention is needed, and how we can improve.
We'll start by highlighting the key changes over the past year, known problems, and future syzbot/syzkaller development plans.
The majority of the session will be dedicated to your experiences. In particular, we want to hear from you about:
- Debugging: What would make Syzbot reports easier for you to diagnose and fix?
- Maintainer Workflow: How do you manage and prioritize the influx of syzbot reports? What features or changes could help streamline this process?
- Contributing: Have you tried writing syzkaller descriptions for your subsystem? What was your experience, and how can we lower the barrier to entry?
- Anything else: other syzbot/syzkaller questions, ideas, or pain points.
We have had very productive syzbot BoF discussions at past LPCs, and we're looking forward to seeing you again!
The Linux kernel has numerous tools to detect bugs, among them a family of dynamic program analysis called "sanitizers": Kernel Address Sanitizer (KASAN), Kernel Memory Sanitizer (KMSAN), Kernel Concurrency Sanitizer (KCSAN), and the Undefined Behaviour Sanitizer (UBSAN).
Knowing when to apply which sanitizer in the kernel development process may not always be obvious: each sanitizer is dedicated to finding a different class of bugs, and each introduces some amount of performance and/or memory overhead. Not only that, each sanitizer also provides a range of options to tweak their abilities.
This session is dedicated to briefly introducing each kernel sanitizer, the bug classes they help detect, and important gotchas when using them.
The rest of the session is dedicated to answering questions around each of the sanitizers, KASAN, KMSAN, KCSAN, and UBSAN. Feel free to also share success stories that may give other attendees only starting out with some of the sanitizers ideas how to best apply them.
Attending : Geert (m68k arch maintainer), D. Jeff Dionne (Coresemi), Rob Landley (ToyBox), John Paul Adrian Glaubitz (SuperH), Ruinland (AndesTech)
It's like the Sword of Damocles every so often, people push to deprecate support for 32-bit architectures, with or without an MMU.
This year, influential voices are saying it again. Yet developers are still manufacturing, working on, and selling 32-bit CPU intellectual property -- both as actual silicon and as RTL for FPGAs -- so the deprecation would cause a huge debacle.
Just in the past months this years, 32-bit noMMU Linux is configured and runs on newly manufactured RV32 ICs that are available off the shelf :
- HPMicro: hpm6360, hpm6700, hpm6800
There are also fresh hot manufactured RV32 Linux MCUs :
- Allwinner: V821
This is not just hobby work; there is an entire market actively doing business on it.
If the concern is maintenance effort and contributions, Andes and CoreSemiconductor are here with commitment for the RISC-V 32-bit variants and SuperH architectures.
We would like to invite interested developers to discuss this in this BoF and see whether we can work out a path forward together, as a community with diverse needs.
While we suggested that keeping 32bit and noMMU support is a beneficial matter, we still highly value the importance of retro-computing, hobbyist work and moreorver - - educational purpose.
For instance, the end-to-end open-sourcely manufactured kianV-uLinux IC runs RV32 noMMU Linux, showcases the elegence of configuring a Linux mainline down to simplest hardware design.
And several Mackerel boards are still booting Linux and shine the glory of m68k beauty.
Monitoring NVMe devices and paths in production is currently limited to static snapshots via nvme-cli. While sufficient for basic inspection, this model falls short in NVMe-oF (fabrics) deployments, where path conditions can change dynamically due to fluctuating network latency, congestion, or link failures. Administrators troubleshooting fabric multipath environments often need continuous visibility into path state, ANA status,
queue depth, link speed, and error counters, but today they are forced to repeatedly run commands or rely on ad-hoc tooling.
This motivates the idea of nvme-top, a tool providing real-time monitoring of NVMe fabrics paths and devices, similar in spirit to iotop or top. The goal is to give administrators a continuously updating view of device and path health, enabling faster detection of link degradation, imbalances in multipath I/O, or transient failures.
Today, nvme-cli builds a static in-memory tree from sysfs. This works for one-off queries but does not update dynamically. To enable real-time monitoring, we need mechanisms to refresh the topology at regular interval. For an initial proof-of-concept, a simple polling-based refresh (e.g., once per second) may be sufficient to demonstrate the value of a real-time monitor. Longer term, community input will be needed on whether kernel-assisted notification (e.g. inotify/fanotify or uevents) mechanisms are desirable and maintainable.
This BoF aims to:
1. Identify the most useful attributes for monitoring in real time (e.g., ANA status, path state, queue depth, error statistics).
2. Explore trade-offs between polling and kernel notification mechanisms.
3. Discuss expectations around responsiveness, overhead, and presentation (e.g., text vs. rich TUI like btop).
During the session, we also plan to show pre-built mockups of a potential nvme-top dashboard to make the discussion concrete and gather input on usability and presentation. By focusing on NVMe-oF multipath monitoring, this discussion seeks to shape a tool that improves troubleshooting and operational visibility in fabric-based NVMe deployments.
Link: https://github.com/linux-nvme/nvme-cli/issues/2904
Laptops with ARM processors have become more common over the recent years. Starting with Apple Silicon and, more recently, X Elite/Plus laptops.
While a solid basic support for many of them exists, the user experience is still lacking and in-kernel as well as userspace support has major gaps compared to a X86_64 system.
Topics we want to to address and discuss in this BoF are:
A Trusted Execution Environment (TEE) is an isolated execution environment running alongside the rich operating system. It provides the capability to isolate security-critical or trusted code and corresponding resources like memory, devices, etc. The isolation is backed by hardware security features such as Arm TrustZone, AMD Secure Processor, RISC-V TEE, etc.
This BoF will provide a platform to discuss topics related to the ongoing evolution of the kernel TEE subsystem with support for new drivers coming up like Trusted Services TEE, Qualcomm TEE, or any other future TEE drivers. Along with that, we will see how the recently merged RPMB subsystem in the kernel helped the easier enablement of OP-TEE based fTPM in-kernel use cases. The next big feature up for discussion is protected DMA-Bufs managed by a TEE looking for real-world upstream user-space use cases like DRM protected media pipelines, TEE protected crypto accelerator keys, secure user interfaces, etc.
The Confidential Computing microconferences of the past years have been a significant catalyst for better supporting trusted execution workloads in the Linux virtualization and general software stack. Since the last occurrence of the microconference AMD SEV-SNP and Intel TDX support for KVM were merged into the mainline Linux kernel as well as support for the Linux kernel running in ARM CCA environments.
But the open source software stack for confidential computing is still far from being complete. There remain many problems to be solved and functionality to enable. Some of the most important ongoing developments are:
-Support for large-page backing of confidential virtual machines (CVM).
- Privilege separation features in KVM via VM planes.
- Live migration of CVMs.
- Secure VM Service Module architecture and Linux support.
- Trusted I/O software architecture.
- Further topics to discuss are:
- Possible solutions for the full CVM (remote) attestation problem.
- Linux as a CVM operating system across hypervisors.
- Performance of CVMs.
The Confidential Computing microconference of 2025 wants to bring open source developers working on these topics together into productive discussions and to collaborate on solutions for the open problems.
Key attendees:
Ashish Kalra ashish.kalra@amd.com
Borislav Petkov bp@alien8.de
Dan Williams dan.j.williams@intel.com
Daniel P. Berrangé berrange@redhat.com
Dr. David Alan Gilbert dgilbert@redhat.com
David Hansen dhansen@linux.intel.com
David Kaplan David.Kaplan@amd.com
David Rientjes rientjes@google.com
Dhaval Giani dhaval.giani@amd.com
Dionna Amalie Glaze dionnaglaze@google.com
Elena Reshetova elena.reshetova@intel.com
James Bottomley James.Bottomley@HansenPartnership.com
Joerg Roedel joro@8bytes.org
Kirill A. Shutemov kirill.shutemov@linux.intel.com
Michael Roth michael.roth@amd.com
Mike Rapoport rppt@kernel.org
Paolo Bonzini pbonzini@redhat.com
Peter Fang peter.fang@intel.com
Peter Gonda pgonda@google.com
Sean Christopherson seanjc@google.com
Stefano Garzarella sgarzare@redhat.com
Tom Lendacky thomas.lendacky@amd.com
HugeTLB support in guest_memfd is making steady progress, and has also led to some new problems that come with huge page support. HugeTLB support currently relies on runtime folio restructuring (split/merge) for accurate refcount tracking that integrates well with other users of struct folio.
Current support ends up introducing significant cost in terms of conversion performance. Folio restructuring also means extra memory has to be kept around to support (temporarily) undoing HugeTLB vmemmap optimization.
I would like to get the community’s opinions on the possible optimizations:
The open-source community is hard at work on building the framework
and mechanisms allowing the assignment of devices to a trusted virtual
machine (TVM), a process commonly known as device assignment (DA).
For the TVM to trust a device, the device must provide the TVM with
Evidence claims [RFC9334] confirming its identity, the state of its firmware and
its configuration. Since Evidence claims can be consumed by 3rd
party attestation services external to the TVM, there is a need to
standardise the representation of Evidence to ensure interoperability.
This session is about introducing the current proposal and an open discussion with the audience on what needs to be modified, the pieces that may be missing and the way forward for this specification.
This presentation is to revive last year's discussion on PCIe device attestation. The first thing to understand is if last year's consensus to use netlink sockets to convey device attestation information to user space still holds. The second thing to review is the device attestation workflow itself. Given the difference between the CMA and PCI/TSM scenarios, it may be better to build an attestation workflow that fits PCI/TSM and see what can be reused for CMA rather than last year's direction to finish CMA before extending to PCI/TSM.
This talk is a follow-up of LPC'24, where the community had diverse opinions on the suitable approach of attested TLS protocols for confidential computing. Meanwhile, we have defended our position (cf. expat BoF) to standardize the protocol in the IETF, and a new Working Group named Secure Evidence and Attestation Transport (SEAT) has been formed to exclusively tackle this specific problem. We would like to present updates to the work since last year (candidate draft for standardization) and gather feedback from the community on the desired security goals, so that it can be accommodated in the standardization.
Transport Layer Security (TLS) is a widely used protocol for secure channel establishment. However, it lacks an inherent mechanism for validating the security state of the workload and its platform. To address this, remote attestation can be integrated into TLS, which is named attested TLS protocol. At LPC'24, we presented an overview of the three approaches for this integration, namely pre-handshake attestation, intra-handshake attestation, and post-handshake attestation. We also presented insights from the Formal Verification using the state-of-the-art symbolic security analysis tool ProVerif to provide high confidence for use in security-critical applications.
Current project partners include Arm, Linaro, Siemens, Huawei, Intuit, Axis, Bonn-Rhein-Sieg University of Applied Sciences, and Barkhausen Institut. By this talk, we hope to inspire more open-source contributors to this project.
The attendees will gain technical insights into the latest developments of standardization of attested TLS protocols in the IETF and will be able to provide feedback on the requirements for their use cases of attestation for confidential computing.
Our thorough analysis shows that pre-handshake attestation is potentially vulnerable to replay, relay, and diversion attacks. On the other hand, intra-handshake attestation is potentially vulnerable to relay and diversion attacks. While post-handshake attestation results in slightly high latency, it offers better security properties than the other two options, forming a valuable contribution to the TEE attestation ecosystem.
In a nutshell, to provide more robust security guarantees, all applications can replace standard TLS with attested TLS.
TDISP, designed to allow a confidential VM to establish a trust relationship with a PCI device, creates new headaches for the Linux PCI stack and for virtualization components:
Solving these problems natively in the Linux PCI stack comes with one set of challenges. Solving this underneath Linux in a trusted paravisor comes with a different set of tradeoffs.
We propose to guide a discussion around different solutions to this to determine what's most acceptable for the Linux community.
Potential interested stakeholders:
* Joerg Rodel
* Dan Williams
The Secure VM Service Module (SVSM) for Confidential VMs can expose multiple services and virtual devices to the Linux guest. To manage these, we need a proper bus in the kernel for discovery and enumeration.
So, what is the right architectural choice for this bus? Should we write a new, minimalist bus from scratch? Or should we adapt the standardized VIRTIO framework for its broad ecosystem support? Is a hybrid approach possible, giving us the best of both worlds? This talk will explore these questions, aiming to discuss the trade-offs of each path.
To protect SEV-SNP guests against malicious injection attacks, the SEV-SNP
Alternate Injection feature facilitates the services of a Secure VM Service
Module (SVSM) and its APIC emulation to secure interrupt delivery into an
SEV-SNP guest.
This session will explore the lessons learned during enabling Alternate
Injection, including KVM, SVSM, OVMF and the guest kernel. It will cover how
we bypass the KVM APIC emulation layer with a doorbell page and also do
complex VMPL switches when the interrupt for VMPL2 is delivered into the SVSM
with Restricted Injection.
Furthermore, the Restricted Injection handler and APIC emulation in the SVSM,
and SVSM interaction with KVM, OVMF and the guest kernel, will be quickly
illuminated.
Finally, a by-product of the enablement work - one useful trace tool (or
a hack, depending on the beholder) for combining KVM and the guest code
traces - will be shown.
The Containers and Checkpoint/Restore micro-conference focuses on both userspace and kernel related work.
The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.
The microconference will be discussing recent advancements in container technologies with some of the usual candidates being:
On the checkpoint/restore front, some of the potential topics include:
And quite likely a variety of other container and checkpoint/restore topics as things evolve between now and the event.
Past editions of this micro-conference have been the source of many developments in the Linux kernel, including:
While the "new" mount API has been a massive improvement in the flexibility of mount infrastructure on Linux (and has allowed us to develop all sorts of new features over the past 7 years) there are still a handful of usability issues which should be addressed.
Container runtimes in particular would probably like to be able to use the completely-unused FSCONFIG_SET_PATH{,_EMPTY} to avoid race attacks, but the infrastructure for using it is quite baroque. There are also been some recurring issues around the uAPI surrounding singleton superblocks (FSCONFIG_CMD_CREATE_EXCL no longer makes this implicit but doesn't provide much help to userspace to know what is going on).
This talk will go these and a few more issues I've found so far (while working on the long-awaited and finally merged man pages for the "new" mount API) and open into a discussion of any other pain-points folks have had which should also be addressed.
Memory pages typically represent the largest component of a checkpoint, and handling this data efficiently is crucial for reducing the performance overhead of CRIU. Checkpoint compression is often used to minimize the storage requirements for container snapshots and to accelerate live migration by minimizing the amount of data that must be transferred over the network. However, existing approaches implement compression as a subsequent operation after the whole checkpoint has been created. This approach introduces additional I/O overhead and increases the storage footprint, as the uncompressed memory pages must first be written and then read again for compression.
To address these challenges, we have been exploring a new approach that integrates built-in memory page compression directly into CRIU. This approach eliminates intermediate I/O operations by compressing the data in-flight before it is written to CRIU images. This is particularly important for enabling efficient end-to-end encryption of container checkpoints in multi-tenant Kubernetes clusters. In this talk, we are going to discuss the set of changes and image formats that implement this functionality, as well as the associated trade-offs and common use-cases, such as maximizing compression ratio for fault-tolerance with periodic checkpointing and minimizing checkpoint latency during live migration.
Shadow stacks are a key security feature to guard against ROP attacks. Mike Rapoport has worked on enabling checkpoint/restore support for CET-based shadow stacks.
This talk extends that work in the realm of Arm64, specifically the GCS Guarded Control Stack (GCS) ARM extension. I'll present the process of adding GCS support to CRIU, including how process state is detected, dumped and restored, and what changes were required to happen in the parasite code.
I'll cover a key challenge which was meeting the kernel’s sigframe expectations for GCS tokens, a critical part of getting reliable restore. I’ll also discuss the debugging process that led to identifying and understanding gaps in the kernel’s GCS support during dump and restore.
EROFS is a modern, high-performance, block-based Linux image filesystem with an advanced on-disk format (e.g., separated layouts for (un)compressed data, (optional) external data blobs, (optional) data compression supporting multiple algorithms within a single filesystem, fine-grained data deduplication and (optional) metadata compression) and a highly optimized runtime implementation (e.g., fast decompression subsystem, FSDAX support, etc.).
Originally well-known to be used in Android system firmware, EROFS is now deployed on billions of devices and addresses various target scenarios. Nowadays, almost all mainstream Linux distributions support and utilize EROFS, including enterprise distributions like RHEL 9/10 and popular cloud distributions like Amazon Linux, Alibaba Cloud Linux, Azure Linux and Oracle Linux.
In recent years, we have focused on improving EROFS for container-related use cases, such as container images, immutable OS images (e.g. AWS bottlerocket), and even application sandboxes. Note that EROFS has already been adopted by projects such as composefs, containerd and Kata containers.
In this talk, we still concentrate on container use cases. we will recap the highlight features of EROFS, summarize its benefits, and discuss ongoing features and scenarios, such as page cache sharing, filesystem (e.g. container rootfs) data integrity verification, unprivileged mounts and efficient remote storage passthrough (e.g., S3 object storage) for AI infrastructure.
For CRIU to successfully checkpoint/restore a process, files must be opened correctly at the correct mounts.
Generally, we get mnt_id for the mount from /proc/<pid>/<fd>/fdinfo and information about the mount from /proc/<pid>/mountinfo.
But, if a file is open on an "unmounted" mount, i.e, a mount has been unmounted using MNT_DETACH (we still have access to fds),
CRIU can't do checkpoint/restore, since it can't get mountinfo for these "unmounted" mounts.
They are not present in /proc/<pid>/mountinfo, and statmount does not work on them.
This talk will discuss potential ways to extract mountinfo for these mounts.
One such approach is to add "unmounted" mounts to a new "umount" mount namespace, along with adding support to statx for exporting mnt_ns_id, which we implemented here:
https://github.com/bsach64/linux/commits/umount-mnt-ns-plus-statx/
Currently, seccomp listeners (created via SECCOMP_FILTER_FLAG_NEW_LISTENER [1]) are limited to a single listener per process [2]. This becomes problematic in nested container scenarios -- for example, when an outer LXC runtime intercepts the mknod syscall while an inner container runtime needs to hook sysinfo. Today, container runtimes often work around this by disabling seccomp listeners when they detect confinement (see [3]). I propose discussing possible approaches to support multiple or nested listeners, user-space API design, and their kernel-level implications.
[1] https://github.com/seccomp/libseccomp/blob/9b9ea8e7a173b96e59fb21e8d461365110e7b4ef/src/system.c#L405C13-L405C45
[2] https://github.com/torvalds/linux/blob/fd94619c43360eb44d28bd3ef326a4f85c600a07/kernel/seccomp.c#L1926
[3] https://github.com/lxc/lxc/blob/faefb7b82878bec2398f52d8bbb78272d0f50dc5/src/lxc/seccomp.c#L1198
Coming up with a complex architecture to enable the Linux kernel test/CI ecosystem is no easy task. Last year, KernelCI launched its new system. Now, one year later, we want to deliver a progress report tailored for the Linux Plumbers audience.
We want to share our progress in how we are delivering results and value to our users, and what is coming up next. There has been notable progress in improving the quality and stability of the tests, as well as the intelligence behind them, allowing us to automatically triage some regressions.
In this talk, we will detail how kernel developers, test maintainers, and organizations can get involved with KernelCI — be it by setting up tests and following results, or by connecting their own test systems to it.
Lockdep is a tool in the Linux kernel designed to detect potential deadlocks by tracking the order in which locks are acquired. However, deadlocks can occur not only due to incorrect lock acquisition order, but also from waits that cannot be resolved. For more effective deadlock detection, it is crucial to track the waits and events themselves, rather than focusing on lock acquisition order. This is where DEPT (DEPendency Tracker) comes in. DEPT accurately identifies conditions that can lead to deadlocks by tracking waits and events. Let me introduce DEPT and explain how it works.
[limitation of lockdep]
https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SAKURA.ne.jp/
https://lore.kernel.org/all/20250513175633.85f4e19f4232a68ab04c8e41@linux-foundation.org/
[dept playing role in practice]
https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.park@lge.com/
https://lore.kernel.org/all/b6e00e77-4a8c-4e05-ab79-266bf05fcc2d@igalia.com/
[dept series]
https://lore.kernel.org/all/20250519091826.19752-1-byungchul@sk.com/
Modern workloads and systems demand more of the scheduler. A server no longer
runs one type of workload, and general purpose interactive systems like Android
and Desktop are on the rise. The same workload runs on a large variety of hardware and architectures that have different capabilities in terms of
performance and power. And the end user for the same workload and hardware
might have different expectations on perf and power compared to another end
user.
Introducing a Quality of Service into the scheduler will help address all of
these issues. But it is a long and hard challenge that requires addressing
a number of problems in tandem to produce the desired final outcome of
a portable and flexible QoS that gives the users what they need and expect.
To reach a fully functioning QoS we need to look at the following problems:
What QoS plumbing the scheduler needs to expose and how can they be wrapped
with higher level QoS for the end application user?
Our wakeup and load balancer paths need to be multimodal to cater for the
fact that task placement decisions now can be impacted by a user choice.
Admin vs App vs User controls and precedence mechanisms. The user might want
something, the app might request something and the admin might want to
enforce something else.
The adoption problem. Rewriting the whole userspace is not practical and
will take a long time. We need a simple mechanism to allow applying QoS
without requiring applications to opt-in.
Performance and general QoS inheritance support without which the
effectiveness could be considerably impacted as performance and priority
inversion problems are common in practice.
QoS is not an alternative to improving the scheduler behavior. We need to
look at default values and how they might need to be different for different
types of systems/expectations. We need better debuggability as well to
distinguish between cases that require scheduler fixes and cases that need
better or new QoS mechanisms.
We shall present a plan for how these problems can be potentially addressed in
hope to carve the path for a way forward.
The Livepatch consistency model 1 requires the kernel to provide reliable stacktrace in order to be fully supported. On x86, the ORC unwinder provides these reliable stacktraces. However, arm64 misses the required support from objtool: it cannot generate ORC unwind tables for arm64. Prior RFCs have proposed to add this support to objtool, but feedback from the upstream community has indicated that a solution using data produced directly from the compiler would be preferred 2, 3.
In the v6.17 release, the Arm64 kernel is expected to gain livepatch support, but without fully reliable stacktrace 4. With this partial solution, interrupt stacks cannot be reliably traced.
SFrame provides a generalized, compiler-based solution, inspired by the ORC unwinder 5, 6. Currently, there's already an SFrame unwinder proposed for userspace: 7. We would like to propose similar functionality to provide reliable stacktraces within the Arm64 kernel 8. This would enable the same level of support for reliable stacktrace as provided for x86 by the ORC unwinder.
TPMs have been present in modern laptops and servers for some time now, but their adoption is quite low. While operating systems do provide some security features based on TPMs (think of BitLocker on Windows or dm-verity on Linux) third party applications or libraries usually do not have TPM integrations.
One of the main reasons of low TPM adoption is that interfacing with TPMs is quite hard: there are competing TPM software stacks (Intel vs IBM), lack of key format standardization (currently being worked on) and many operating systems are not set up from the start to make TPM easily available (TPM device file is owned by root or requires privileged group for access). Even with a proper software stack the application may have to deal with low-level TPM communication protocols, which are hard to get right.
In this presentation we will explore a better integration of TPMs with some Linux Kernel subsystems, in particular: kernel keystore and cryptographic API. We will see how it allows the Linux Kernel to expose hardware-based security to third party applications in an easy to use manner by encapsulating the TPM communication complexities as well as providing higher-level use-case based security primitives.
It's been exactly 10 years ago when I decided to toss in the towel on finding more co-maintainers, and instead added 10 committers to the drm/i915 tree. Since then commit rights has been a great success story in DRM, and this talk will dig into some of the lessons learned and reasons why that is the case. But there's still challenges and pain points left to sort out for the future, and some which have only been finally tackled this year. And those will be covered too.
DAMON, a Linux kernel subsystem, enables efficient data access monitoring and access-aware system operations. Meta has been utilizing DAMON for data access observability to enhance fleet-wide memory management efficiency. This project, however, highlighted DAMON's limitations, such as the absence of page-level information observability and the need for manual parameter tuning per workload.
To address these limitations, several DAMON features have been developed and integrated upstream. These features have allowed Meta to implement fleet-wide data access monitoring and collect data that confirms its functionality and offers further insights.
This session will delve into the project in detail. We will begin by outlining the long-term, high-level objective of Meta's fleet-wide data access pattern observability project and the inherent limitations of DAMON in achieving this goal. We will then introduce the new DAMON features developed to overcome these challenges, covering their design, usage, and evaluation results. Furthermore, we will present the design of Meta's fleet-level monitoring system, built upon the enhanced DAMON, and share key findings from the collected fleet-wide data access patterns. The session will conclude by discussing the remaining limitations in DAMON that hinder more efficient and useful fleet-wide access monitoring.
Abstract:
In modern Linux video applications—including players, conferencing tools, and editors—GPU-based video post-processing is the norm. Tasks such as scaling, color conversion, rotation/flipping, and blending are typically offloaded to the GPU for performance reasons. While efficient in terms of speed, this approach often comes with a higher power cost compared to using fixed-function 2D hardware acceleration blocks.
This presentation showcases our efforts to enable 2D hardware acceleration for video post-processing in Linux, using AMD’s VPE (Video Processing Engine). We will walk through the kernel driver support, the integration with user-space components including VA-API and Mesa, and the plumbing required to bridge these layers together.
We’ll also share practical results from modifying open-source video applications to take advantage of this 2D pipeline, and demonstrate measurable improvements in power consumption. This talk aims to provide insight into both the technical implementation and the potential benefits of moving away from general-purpose GPU usage in specific video use cases.
Presentation outline:
Speaker Bio:
Solomon Chiu is a software developer at AMD, working on display and multimedia enablement for Linux. His current focus is on bringing up 2D fixed-function hardware acceleration for video processing pipelines, optimizing power efficiency, and improving integration across the Linux graphics stack.
At AMD, he collaborates across kernel, Mesa, and user-space layers to enable features of video post-processing hardware. He is actively involved in enabling open-source video APIs, including VA-API, and Vulkan...etc to make these capabilities accessible to Linux applications.
He is passionate about low-level systems development and believes in contributing to open standards that benefit the broader Linux ecosystem. This is his first time presenting at LPC, where he looks forward to engaging with the community and sharing lessons from AMD’s recent work on video acceleration.
Rust is a systems programming language that is making great strides in becoming the next big one in the domain. Rust for Linux is the project adding support for the Rust language to the Linux kernel.
Rust has a key property that makes it very interesting as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc. It also provides other important benefits, such as improved error handling, stricter typing, sum types, pattern matching, privacy, closures, generics, etc.
This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics.
Possible Rust for Linux topics:
Possible Rust topics:
rustc improvements, LLVM and Rust, rustc_codegen_gcc, gccrs...bindgen, Compiler Explorer, Cargo, Clippy, Miri...Last year was the third edition of the Rust MC and the focus was on discussing the ongoing efforts by different parties that are upstreaming new Rust abstractions and drivers (Giving Rust a chance for in-kernel codecs, hrtimer Rust Abstractions, Atomics and memory model for Rust code in kernel). We also had a topic related to improving the ergonomics and tooling around Rust in the kernel (Coccinelle for Rust) and a tutorial session to help others learn Rust (Introduction to Rust: Quality of Life Beyond Memory Safety), as well as a "Birds of a Feather" slot for open discussion on other topics.
Since the MC last year, it is easy to notice how Rust work around the kernel is accelerating: patches and contributors keep growing, new MAINTAINERS entries have been created and new maintainers have stepped up, new use cases have been developed and/or merged (e.g. the AMCC QT2025 PHY driver and Android ashmem), new projects have been announced (e.g. Tyr)... Even some end users of Linux distributions have already interacted with Rust kernel code in the wild (via the QR code panic screen). This all signifies success, but also poses new challenges ahead.
Note that this year submissions for the Rust MC should aim to be more discussion oriented.
Suggested attendees: the Rust for Linux team (Miguel Ojeda, Boqun Feng, Gary Guo, Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross, Danilo Krummrich), Abdiel Janulgue, Alexandre Courbot, Alistair Francis, Arnaldo Carvalho de Melo, Bjorn Helgaas, Burak Emir, Christian Brauner, Christian Schrefl, Daniel Almeida, Dave Airlie, David Gow, Dirk Behme, Fiona Behrens, Frederic Weisbecker, FUJITA Tomonori, Greg Kroah-Hartman, Igor Korotin, Ingo Molnar, Jocelyn Falempe, Joel Fernandes, Kees Cook, Liam R. Howlett, Lorenzo Stoakes, Luis Chamberlain, Lyude Paul, Masahiro Yamada, Matthew Maurer, Nathan Chancellor, Paolo Bonzini, Paul E. McKenney, Peter Zijlstra, Remo Senekowitsch, Rob Herring, Robin Murphy, Sami Tolvanen, Stephen Boyd, Tamir Duberstein, Tejun Heo, Thomas Gleixner, Viresh Kumar, Will Deacon, Yury Norov...
One of the main selling points for Rust's inclusion in the kernel is safety, which is strongly associated with a reduction of runtime panics. Yet, in Rust an integer overflow or out-of-bounds array access translates into an implicit panic, inserted without any warning to the programmer.
The inability to easily identify where these implicit panic sites are introduced creates a blind spot when writing critical kernel code. Code that processes untrusted user-space data is particularly vulnerable to hitting these conditions, and a formal guarantee that e.g. a given function is panic-free would be extremely helpful.
This topic aims to describe the conditions under which such panic sites can be inserted, review existing solutions from the user-space ecosystem (like the no-panic crate), and discuss how this issue can be mitigated in the kernel.
Potential solutions include new tooling or compiler support to flag potential panic sites, or establishing stricter coding rules to forbid them altogether through e.g. checked variants of potentially panicking operations.
This activity comprises of two parts. First it will be a short update on the development of language features as we close the chapter on the Rust project goal 2025H2, covering features like arbitrary_self_types and trait evolution and the projected availability of the features on stable Rust releases and focusing on how this enables better kernel developer experience when working on Linux kernel code.
Second, we will engage panel discussion on two important language features that we aim to bring into complete feature in the coming year of 2026. First is the in-place initialisation which is the continuation of in-place initialisation programme from the project goal 2025H2. In the panel discussion, we will first present briefly three proposals from both Rust-for-Linux team and other Rust language team members: "init" expression, out-pointer and guaranteed emplacement. In the panel discussion, we will collect feedback on the ergonomics of these three proposals with worked examples derived from existing in-tree Rust code; concerns on uncovered use cases and safety issues; as well as suggestions to the language designs and potentially new approaches.
If time permits, we would like to also have a discussion on trait evolution as the second part and how it would help with refactoring the kernel crate's trait hierarchy. We will first present worked example of this language feature using existing code in the kernel crate, and collect feedback on the ergonomics and use case from the kernel developers.
Some C kernel data structures exposed to Rust code apply internal
synchronization (XArray). Depending on the type of lock, such data structures
need to unlock locks when allocating memory. Sometimes it is beneficial to use a
single external lock to protect multiple such data structures.
In Rust this creates a problem that is not present in C. This is because that
mutably borrowing through a lock guard takes out a mutable borrow on the lock
guard, thus making the guard unavailable, and thus the lock cannot be released
momentarily while allocating memory.
In Kangrejos (the Rust for Linux annual workshop), we presented a Rust API that
allows the use of an external lock while still allowing the data structure to
momentarily drop locks for the purpose of allocating memory.
In this session we address concerns that were raised at Kangrejos; how we handle
races that can occur while the lock is dropped for memory allocation purposes,
and the effect of the locking pattern on data structures applying the Entry
pattern. We also aim to present benchmark results collected from real kernel
code, rather than user space toy examples.
We aim to use the session to collect input from the community to iron out any
potential pain points of the external locking API. We hope that the session will
spark discussion and awareness in the community, such that the basic shape of
the API is widely accepted when patches hit the list.
Rust in the Linux kernel uses the pin-init library for initialization. This library handles ergonomic and safe initialization of address-sensitive types such as Mutex<T> (the abstraction of struct mutex).
Since address sensitivity is an inherited property (a type containing an address-sensitive type also becomes one), lots of types require using the pin-init API to initialize them. Thus knowing how to use pin-init is required in order to write Rust code in the kernel.
This talk will explain how to use pin-init. It will go into two perspectives: first a consumer of an API that uses pin-init and second as the creator of such an API.
The talk is based on a chapter in the upcoming "Rust in the Linux Kernel" book and feedback is greatly appreciated.
Showcase the current state of Tyr, a new Rust kernel driver for Arm Mali GPUs, briefly mentioning current status of the driver and the associated Rust abstractions needed to support it, as well as the future plans for both upstream and Android.
The discussion should be centered on whether the current upstreaming plan makes sense to the DRM community, considering our efforts both upstream and downstream. We also aim to discuss how to combine efforts with the current JobQueue abstraction that is being worked on. This includes possibly using the downstream Tyr branch as a testbed for the design, as well as discussing in what other ways the Tyr team can contribute to this work.
As HID support in Rust is being developed, a number of challenges are arising during the development process. Binding the C API that represents different HID structures will grow verbose as more HID logic is supported in Rust. Specialized HID device drivers will want to make use of APIs from other subsystems, such as input and DRM. Otherwise, such drivers tend to be isolated to HID report fixups. Ideally, Rust HID drivers would avoid doing pure HID device fixups since there are other mechanisms to perform such fixups besides a device driver.
Problems are surfacing even with the initial review for Rust HID support.
Examples
- Converting between kernel::hid::Group and the HID_GROUP_* define
equivalents
- Various manually implemented "getter" functions
- Defining HID report descriptor item attributes
This discussion is intended to find ways to improve such areas in the HID driver abstractions.
RCU (Read-Copy-Update) is a great mechanism in the Linux kernel for read-mostly situations. However, it is used almost exclusively on the C side and there's virtually no support for it on the Rust side. While there are plans to implement RCU in Rust using projections, those are largely still in the making, and the author believes there are some special cases worth addressing now, for instance, the ones related to RCU lists. This talk is basically an invitation to a discussion on how these may be addressed, with one real life example the author would specifically like to bring up.
TL;DR We propose to present the Rex project (Rust-based kernel extension) and discuss its integration with Rust for Linux.
Rex is a Rust-based kernel extension framework (https://github.com/rex-rs/rex). It offers similar safety guarantees as eBPF. Different from eBPF, which verifies the safety of extension code via an in-kernel verifier, Rex builds its safety guarantees atop the language-based safety of Rust, combined with light-weighted runtime protection. Specifically, Rex enforces extension programs to be written in a safe subset of Rust; the Rex compiler performs safety checks and generates native code directly. This approach avoids the overly restricted verification requirements (e.g., program complexity constraints) and the resulting arcane verification errors in eBPF. Rex implements its own kernel crate that offers a safe kernel interface that wraps existing eBPF interface with safe Rust wrappers and bindings. At the same time, Rex employs a lightweight runtime that implements graceful Rust panic handling with resource cleanups, kernel stack usage checks, and extension program termination.
We plan to first go over the design of Rex, and then collect the feedback and answers to the following questions:
- How does the Rust-for-Linux community think about the idea of a new kernel extension mechanism that sits at the middle ground of Rust kernel modules and eBPF?
- Is there any aspect of Rex that is also useful for Rust-for-Linux and how can we contribute?
- Does the trust we put on the Rust toolchain make sense and how can we potentially make it more trustworthy?
As Linux continues to be deployed in systems with varying criticality constraints, progress needs to be made in establishing consistent linkage between code, tests, and requirements, to improve overall efficiency and ability to support necessary analysis.
This MC addresses critical challenges in expectation management (aka requirements tracking), documentation, testing, and artifact sharing within the Linux kernel ecosystem. While tests are contributed for the code, traditionally the underlying requirement that the tests satisfies is likewise not documented in a structured manner. This has resulted in a large amount of "tribal knowledge" associated with subsystems, which results in technical debt when maintainers stop working on subsystems.
Taking in the feedback from last year's "Safe Systems with Linux" miniconference 1, on how we can improve the documentation of the kernel's design [1a] the ELISA (Enabling Linux in Safety Applications) community has focused on prototyping a template for capturing the requirements with volunteer linux kernel subsystem maintainers. The ELISA architecture team 2 has been meeting weekly and has developed a structured approach for documenting testable expectations with a template that allows embedding requirements directly with relevant code (as requested in the initial workshop) while maintaining machine readability and forming a base for improving testing with initiatives like KernelCI. The prototype format got initial review and feedback in December at the ELISA workshop at Goddard [3] and after incorporating that feedback in the workshop in Lund in May [4].
Initial pilots in the TRACING subsystem [5] have demonstrated the value of this approach, even resulting in the identification and fixing of previously unknown issues. [6,7]
Building on the last year's discussions, the goal of this miniconference is to get wider feedback from additional maintainers and developers of different subsystems on the approach being proposed.
Progress on Linux Kernel Requirements Framework
Discussing the SPDX-based template for low-level requirements, lessons learned from initial pilots, and plans for wider adoption.
Technical Debt Reduction
How documented requirements capture understanding of original functionality, and can be leveraged for verification when code needs to be rewritten (ie. C to Rust), etc.
Requirements-Driven Testing
How documented requirements can drive test case development and validation. Connecting relevant test cases with specific requirements and code, should be able to yield more efficient testing.
Semantic Aspects of Kernel Requirements
Exploring how to properly document expected behaviors with consideration for design elements that impact or are impacted by these behaviors.
Practical Implementation Challenges
Addressing the balance between detailed requirements documentation and maintaining kernel development velocity.
Required tools for automation
Progress on tools to generate, validate, and track work products increasing dependability throughout the kernel development process.
Industry Adoption
How safety-critical industries are beginning to leverage these developments for certification and compliance purposes. How their safety engineers can participate in contributing formalized requirements to the kernel and providing linkage.
Requirements as an Education Tool
How linux kernel documentation can mine the requirements, and help new contributors understand kernel functionality and design intent and attract new upsteam developers
Last year, we established the need for better documentation of requirements in safe systems. This year, we will showcase concrete progress, including a working framework for Linux Kernel Low Level Requirements with practical implementations in the TRACING subsystem. This MC aims to bring together kernel maintainers, developers, with safety architects and industry stakeholders to expand adoption of these practices and address remaining challenges in building safe systems with Linux. It should engage with testing and documentation centric activities and how all parts can link together.
Steve Rostedt
Greg Kroah-Hartman
Thomas Gleixner
Jonathan Corbet
Tim Bird
Shuah Khan
Gustavo Padovan
Gabrielle Paoloni
Chuck Wolber
Luigi Pellecchia
Alessandro Carminati
Wolfram Sang (Renesas BSP)
Vincent Mailhol (CAN Subsystem)
Kate Stewart
Philipp Ahmann
Nicole Pappler
[1] LPC 2024 Safe Systems with Linux Miniconf: https://lpc.events/event/18/sessions/187/#20240920
[1a] Kernel design documentation improvement: https://www.youtube.com/watch?v=stqGiy85s_Y
[2] ELISA Architecture meetings: https://lists.elisa.tech/g/safety-architecture
[3] NASA workshop: https://www.youtube.com/watch?v=_N3l_EEV8uM
[4] Link to Lund workshop session: https://drive.google.com/file/d/1--e82k80_D79ycJdFbEwLQ0kpbD0pMR9/view?usp=sharing
[5] Drafting requirements in tracing subsystem: https://github.com/torvalds/linux/compare/master...elisa-tech:linux:linux_requirements_wip
[6] Patch on LKML: https://lkml.org/lkml/2025/3/21/1128
[7] Patch on LKML: https://lkml.org/lkml/2025/5/21/850
In regulated industries, Linux is widely used due to its strong software capabilities in areas such as dependability, reliability, and robustness. These industries follow best practices in terms of processes for requirements, design, verification, and change management. These processes are defined in standards that are typically not accessible to the open source kernel community.
However, since these standards represent best practices, they can be incorporated into structured development environments like the Linux kernel even without the knowledge of such standards. The kernel development process is trusted in critical infrastructure systems as it already covers many process elements directly or indirectly.
The purpose of this session is to initiate a discussion on what is currently available and what may be missing in order to enhance the dependability and robustness of Linux kernel-based systems. How can the artifacts be connected? Where are the appropriate places to maintain them? And who is the best responsible for each element of the development lifecycle?
Unlike the typical path chosen for attempting to use Linux in safety applications, the approach developed by NVIDIA strives to avoid placing any burden on upstream maintainers and developers.
Upstream maintainers should not have to become safety experts, nor the linux kernel should become encumbered by verbose descriptions of what the code does, for it to achieve safety.
We want to start a discussion about how we achieve this, and how it can coexist with upstream processes.
To maintain software safety, defining specifications and ensuring that implementations meet them are both important. The former has become popular in the Linux kernel in various ways [1,2], while the latter still depends on developers' manual effort. Recent advances in techniques and tools, however, have made it feasible to systematically apply program verification to Linux kernel code.
In this talk, we share our experience and practices from ongoing work on verifying task-scheduler code of the Linux kernel. We illustrate the challenges we encountered and how verification can be effectively applied in practice, through case studies (e.g., [3,4]) where proving the correctness of certain kernel features resulted in uncovering real bugs (e.g., [5,6]). Furthermore, we present our work to automate this process as much as possible, making verification more practical and scalable. Our goal is to explore how verification can be made a practical part of the Linux kernel development process.
[1] https://lore.kernel.org/all/20250614134858.790460-1-sashal@kernel.org/ "Kernel API Specification Framework"
[2] https://lore.kernel.org/all/20250910170000.6475-1-gpaoloni@redhat.com/ "Add testable code specifications"
[3] Julia Lawall, Keisuke Nishimura, and Jean-Pierre Lozi. 2024. Should We Balance? Towards Formal Verification of the Linux Kernel Scheduler. SAS 2024.
[4] Julia Lawall, Keisuke Nishimura, and Jean-Pierre Lozi. 2025. Understanding Linux Kernel Code through Formal Verification: A Case Study of the Task-Scheduler Function select_idle_core. OLIVIERFEST '25.
[5] https://lore.kernel.org/all/20231030172945.1505532-1-keisuke.nishimura@inria.fr/ "sched/fair: Fix the decision for load balance"
[6] https://lore.kernel.org/all/20231214175551.629945-1-keisuke.nishimura@inria.fr/ "sched/fair: take into account scheduling domain in select_idle_smt()"
Last year in Vienna we held a session about "Improving kernel design documentation and involving experts".
Following such session the ELISA Architecture working group drafted an initial template for the SW Requirements definition, started documenting the expected behaviour for different functions in the TRACING subsystem and made upstream contribution accordingly and finally also started reviewing and adopting a framework for formally specifying kernel APIs (developed and proposed by Sasha Levin [1]).
This session aims to present the latest updates and to involve the experts to define the best next steps for having a path to introduce and maintain requirements in the Kernel (and link them to tests and other verification measures)
[1] https://lore.kernel.org/all/20250711114248.2288591-1-sashal@kernel.org/
High-integrity applications require rigorous testing to ensure both reliability and compliance with industry standards. Current testing frameworks for the Linux kernel, such as KUnit, face challenges in scalability and integration, particularly in environments with strict certification requirements.
KUnit tests, which are currently the most widely accepted and used solution for testing Linux kernel source code, have a number of drawbacks in the way they are built and executed. The KUnit framework lacks feature parity with other modern unit test frameworks, making it difficult to achieve comprehensive and robust low-level test coverage. Furthermore, the way KUnit tests are built and executed creates a lack of scalability, which is necessary to create and maintain the many thousands of tests that will be required to verify functionality in a robust, complete, and automatable way.
KUnit tests are integrated into the Linux binary. This requires building the kernel and running it to execute the tests. Additionally, the high volume of tests needed for adequate coverage would increase the size of the kernel beyond what is manageable. This makes it necessary to divide the tests so that multiple kernels with different sets of tests are built. This, in turn, necessitates additional analysis to prove that the changes made in each of these individual kernel builds do not negatively impact the fidelity of the tests for the targeted features. Considering the lengthy build and execution times, along with the need to build and analyze multiple kernels, it is evident that scaling up to the creation and maintenance of thousands of tests poses significant challenges.
In addition to the scalability issues, KUnit lacks essential features needed for testing highly reliable systems. It does not provide a consistent, maintainable, and automatable way to isolate small sections of code. This is crucial for low-level testing and coverage. For example, consider a function A that has dependencies on three other functions, B, C, and D. When testing function A, the results of functions B, C, and D may influence the execution path in function A. However, it is not desirable to actually test the implementation of functions B, C, and D. In most common unit test environments, it is straightforward to create either a fake or mock implementation of those functions in the test file, or to link to an object file that contains a fake or mock of those functions. In KUnit, achieving “mocking” (or at least a similar effect) requires creating a patch that modifies the kernel. The simplest approach is to use conditional compilation macros that are controlled through kernel configuration to generate mock versions of the tested function’s dependencies. Every time a mock or fake is used, it must be manually created and maintained through patches. When this effort is multiplied by thousands of tests, the maintenance burden becomes evident.
Topics for discussion:
The ELISA project currently works on bringing the Linux kernel closer to safety compliance by proposing enhancements to the kernel documentation. This includes a model expressed as requirement templates inlined to source code. At the same time, comparable efforts with a similar goal are also ongoing in the wider open-source ecosystem. For example, the Zephyr OS is using the FLOSS StrictDoc model and tool to capture and process requirements. Another example is Linutronix who reached IEC 62443 4-2 security certification by using StrictDoc to trace requirements, design and tests for their Debian based IGLOS Secure Beacon embedded system.
This talk picks up the work of ELISA and compares it to a typical StrictDoc setup with the intention to show that both efforts could be joined. While ELISA focuses on the model and assumes tools will emerge from the community, StrictDoc both defines a similar model and provides mature tooling to validate and render requirements, design and tests. We'll see that the majority of the needs set by ELISA are already fulfilled by StrictDoc. Notably, ELISA's SPDX-REQ-* tags can be represented and parsed with StrictDoc's "Parsing SDoc nodes from source code comments" feature. The remaining gap is drift detection, i.e. to store the hash sum over project ID, source file, requirement text and proximal code with the intention to signal a human to check if requirement and code still align when some of the criteria changes. StrictDoc knows meta data, content and proximal code by function name, class name and line ranges, but has no hash generation built in yet. However, StrictDoc is advanced in defining language constructs as part of the model (functions, classes, test cases). It is also advanced in applicability, where for example OEM requirements and OSS project requirements can be linked together in one compatible model and in consequence can be processed and validated by the same tool in a single run. Various input/output format converters exist and customization of validation is next on the roadmap.
The talk concludes with the proposal that StrictDoc could close its gaps by implementing hash generation, optimizing ELISA requirement template parsing and by setting up conformity tests to maintain compatibility. ELISA could in turn list StrictDoc as one of their reference tools, and kernel developers will be invited to try it in their workflow.
Ensuring traceability and compliance remains a major challenge in the development of safety-critical systems. BASIL — The FuSa Spice — is an open-source tool from the ELISA Project that helps engineers manage traceability across documentation, requirements, implementation, testing, and safety evidence.
Designed to integrate seamlessly with existing tools and test infrastructures, BASIL offers a practical and extensible solution to support functional safety assessments and audits.
This talk introduces BASIL’s architecture and key features, illustrates its application in real-world safety workflows, and highlights opportunities for community collaboration. Attendees will discover how BASIL brings openness, automation, and transparency to traceability management in safety-critical software projects.
Open Discussion based on previous agenda items
The eBPF Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s eBPF subsystem and its surrounding user space ecosystem such as libraries, loaders, compiler backends, related system tooling as well as eBPF use cases.
The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of the eBPF subsystem.
The track will be composed of talks, 30 minutes in length (including Q&A discussion).
eBPF Track's technical committee: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko and Martin Lau.
Writing basic BPF-based profilers nowadays isn't too hard. The challenges start when one needs to go a step further beyond just capturing a bunch of stack traces. When profiler needs to capture user data reliably bypassing restrictions of NMI and/or non-sleepable context, that's when the real fun begins.
This talk will describe recent advancements in BPF tracing domain which now allow BPF programs to escape limitations of NMI context and perform reliable user memory reading in "benign" sleepable BPF mode. We'll also demonstrate how it's now possible to fetch the contents of thread-local variables at runtime (among other things), with no upfront preprocessing required, just by fetching ELF binary contents on demand.
For now, tracing-type BPF programs and BPF trampolines adopt a per-function design, requiring the creation of independent instances for each kernel function to be traced. This leads to significant inefficiencies in large-scale tracing scenarios (e.g., monitoring hundreds or thousands of kernel functions): not only do redundant instances consume substantial additional memory, but the program attachment process also takes an excessively long time, severely limiting overall efficiency. The tracing multi-link feature draws inspiration from the design of kprobe multi-link, enabling efficient binding of a single BPF program to multiple target functions and fundamentally addressing this critical pain point.
The BPF global trampoline is introduced. Unlike bpf trampolines, which hardcode program addresses into instructions (relying on direct calls and thus resulting in per-function binding), the global trampoline constructs a function metadata hash table using rhashtable. This table centrally stores key information such as BPF programs, cookies, and the number of function parameters. When triggered, it queries the hash table in real time based on the target function address to retrieve the corresponding metadata and invokes the associated program. The entire implementation is built using naked functions and C code, greatly simplifying the logic. Practical tests show that this feature reduces the loading time in large-scale tracing scenarios from tens of minutes to within 1 second (even when tracing 50k kernel functions), delivering a significant performance improvement. For detailed information, please refer to: https://lore.kernel.org/bpf/20250703121521.1874196-1-dongml2@chinatelecom.cn/
I'd like to discuss two uprobe related topics:
uprobe optimization - I'd like to update on what's been pushed/accepted and what are the leftovers for backward compatibility with old kernels and discuss if it makes sense to continue with further optimizations.
uprobe override - We had several requests for ebpf being able to force user space function override in a similar way the kprobe override does. Looks like no solution will be great, but I'd like to present the use case and discuss options on what would be the best way to do that.
I would like to discuss my ongoing work to supply true
signatures for available functions in kallsyms. The
change will be in clang, pahole, libbpf and kernel.
For clang, the goal is to add additional functions to
dwarf where these functions are not in current dwarf or
their signatures have changed. The example includes like
- original func 'void foo(int a, int b)' becomes
'void foo(int b)', so a new 'void foo(int b)' will
be encoded in dwarf.
- functions like bar.llvm.<hash> at lto mode.
- functions with struct/union arguments and these
arguments will become two (or more?) separate
arguments (e.g., 16 byte struct will become two actual
arguments for x86_64), or become a pointer to the
struct/union.
In pahole, the new functions will be added to vmlinux.BTF.
If an existing func signature changed, add the
signature-changed func to vmlinux.BTF and discard the old one.
The goal is to represent actual function signatures in vmlinux.
With new vmlinux.BTF, we can have true signatures available
to users for fentry etc. We can also handle functions like
bar(...) (in vmlinux.BTF bar.llvm.<hash>) in libbpf properly
(e.g., through regex match etc.). gcc functions like bar.isra,
tar.constprop, etc. can be handled as well if true signature
is available.
This talk will introduce task local data, an abstract storage type built on top of task local storage map for sharing thread-specific data between user space and BPF programs. A motivational use case is allowing user space programs to pass hints to sched_ext scheduler for better scheduling decisions.
Task local storage map supports a special field, UPTR, to allow sharing user memory with BPF programs efficiently. However, using this feature directly present challenges, especially when multiple user space and BPF programs are involved. For example, how to manage the layout of the storage without tightly coupling these components? How to manage the lifecycle of the memory? Task local data was developed to address these challenges. In this talk, I will discuss its design and share some use cases demonstrating the benefits.
The page cache is central to the performance of many applications. However, its one-size-fits-all eviction policy may perform poorly for many workloads. While the systems community has experimented with new and adaptive eviction policies in non-kernel settings (e.g., key-value stores, CDNs), it is very difficult to implement such policies in the kernel. We design a flexible eBPF-based framework for the Linux page cache, called cache_ext that allows developers to customize the page cache without modifying the kernel.
We demonstrate the flexibility of cache_ext's interface by using it to implement eight different policies, including sophisticated eviction algorithms. Our evaluation finds that it is indeed beneficial for applications to customize the page cache to match their workloads' unique properties, and that they can achieve up to 70% higher throughput and 58% lower tail latency.
The world urgently needs better AI analysis tools to find AI datacenter cost reductions. eBPF has been used for a decade to help find compute performance wins and various companies have now been building eBPF tools for AI analysis. This session discusses one such tool: the open source AI flame graphs built by Intel (by us: the talk presenters) which uses eBPF for kernel driver instrumentation and the Intel eustall hardware profiler for GPU instruction pointer sampling. The resulting flame graph spans all CPU code (including the OS kernel) and code running on the GPU or AI accelerator. (Prior AI flame graphs have focused on the CPU code paths only.) We will explain how it works, challenges involved, solicit feedback, and discuss possible future kernel work to support this kind of observability. CPU flame graphs are widely adopted and have over 100 implementations, so an AI version was the first tool Brendan wanted to get working fully on AI, but it's just the start.
Training large models requires significant resources and failure of any GPU or Host can significantly prolong training times. At Meta, we observed that 17% of our jobs fail due to RDMA-related syscall errors which arise due to bugs in the RDMA driver code. Unlike other parts of the Kernel RDMA-related syscalls are opaque and the errors create a mismatched application/kernel view of hardware resources. As a result of this opacity and mismatch existing observability tools provided limited visibility and DevOps found it challenging to triage – we required a new scalable framework to analyze kernel state and identify the cause of this mismatch.
Direct approaches like tracing the kernel calls and capturing meta involved in the systems turned out to be prohibitively expensive. In this talk, we will describe the set of optimizations used to scale tracking kernel state and the map-based systems designed to efficiently export relevant state without impacting production workloads.
Bridging the Observability Gap: Using eBPF for GPU Workload Identification
Modern computing workloads are increasingly offloaded to GPUs, yet our ability to observe and understand the specific tasks running on these accelerators from the host kernel remains limited. This fundamental lack of visibility hinders system administrators, security engineers, and resource schedulers. While existing solutions often rely on application-level telemetry or proprietary vendor tools, they fail to provide a holistic, kernel-level view of GPU activity.
This paper introduces a novel solution that leverages eBPF to gain unprecedented insight into GPU workloads. By attaching eBPF tracepoints to the NVIDIA kernel driver, we can non-intrusively monitor and profile the sequence of driver function calls for any given task. This method captures a rich set of metrics—including the frequency and timing of calls — to build a unique "execution fingerprint" for the workload.
We demonstrate that these profiles are distinct and reproducible. Using a variety of real-world use cases, including machine learning training, inference, and cryptocurrency mining, we show that our eBPF-based approach can reliably identify the underlying workload with high accuracy.
Our findings highlight the power of eBPF as a versatile and potent tool for bridging the critical observability gap between the host kernel and accelerated compute devices.
Widely used for ML workloads, GPUs are typically SIMT accelerators with threads in warps on SMs, organized into blocks, launched as kernels, using multi-level memory hierarchies (registers, shared/LDS, L2, device memory) and limited preemption. This complexity creates rich but challenging behavior patterns for observability and customization. Today, many tracing tools for GPU workloads sit at the CPU boundary (e.g. probes on CUDA userspace libraries or kernel drivers), which gives you host-side events, but treats the device as a black box: little visibility inside a running kernel, weak linkage to stalls or memory traffic, and no safe way to adapt behavior in-flight. GPU specific profilers(e.g. CUPTI, GTPin, Nvbit, Neutrino) provide device-side visibility, but they are often siloed from eBPF pipelines, make it harder to corelate with events on CPUs.
We prototype offloading eBPF into GPU device contexts by defining GPU-side attach points (CUDA device function entry/exit, thread begin/end, barrier/sync, memory ops, etc) and compiling eBPF programs into device bytecode (PTX/SPIR-V), with verifier, helper, and map support for on-device execution. Built on top of bpftime, this approach can be 3-10x faster than NVBit, is not vendor-locked, and works with Linux kernel eBPF programs like kprobes and uprobes. This enables GPU extensions like fine-grained profiling at the GPU thread, warp or instruction level, adaptive GPU kernel memory management, and programmable scheduling across SMs with eBPF. It can also help accelerate some existing eBPF applications.
The goal of this talk is to explore the usecases, challenges and lessons learned from extending eBPF's programming model to GPUs.
https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu
This talk aims to introduce the audience to Python-BPF, a project that enables developers to write eBPF programs in pure Python. We allow a reduced Python grammar to be used for the eBPF-specific parts of code.
This improves the following things in the eBPF ecosystem
.
This project is in its early stages, and work is being actively done to integrate most features required for observability use cases into Python-BPF right now.
PythonBPF currently supports a subset of map types and eBPF helpers; however, this list is expanding every week. Most examples given in the BCC tutorial have been ported to Python-BPF, and in some cases, have been modified slightly (such as using matplotlib and seaborn to create better visualizations than ASCII histograms). Python-BPF can also run on Jupyter notebooks.
eBPF has emerged as a foundational technology for building observability, networking, and security tooling across modern Linux systems. However, in long-term supported (LTS) and embedded environments—such as automotive or industrial platforms—the deployment of eBPF-based software remains fraught with challenges. These range from kernel version divergence and verifier incompatibilities to inconsistent user-space toolchains and even organizational security policies that prohibit the use of bpf() altogether.
I will explore the real-world pain points of using eBPF in such constrained contexts. Key technical topics will include:
Beyond technical aspects, I will examine cultural and organizational mismatches—such as differing definitions of "backport", lifecycle misalignments between kernels and eBPF tooling, and CI burden across multiple kernel versions.
As part of this discussion, I will introduce a proposed initiative: the "eBPF Embedded Profile" — a structured subset of APIs, helpers, kfuncs, and tooling practices aimed at enabling safe and predictable eBPF use in embedded or production-grade LTS environments. This profile would also include guidance for runtime stability, security vetting, and deployment feasibility and may serve as a stepping stone toward future standardization or conformance efforts.
The eBPF Foundation is rethinking what an open source foundation can be by shifting from simply stewarding projects to actively building an ecosystem around a powerful enabling technology like eBPF.
This session will highlight investments like security audits, research grants, funding for directed development, face to face meeting sponsorship, and documenting the ecosystem’s evolution like the eBPF Documentary strengthen both upstream kernel development and downstream adoption. Lessons learned from balancing upstream needs with enterprise adoption will be shared to spark discussion on where foundation led efforts can have the greatest impact and what we should focus on next.
The Kernel Testing & Dependability Micro-Conference (a.k.a. Testing MC) focuses on advancing the current state of testing of the Linux Kernel and its related infrastructure.
Building upon the momentum from previous years, the Testing MC's main purpose is to promote collaboration between all communities and individuals involved with kernel testing and dependability. We aim to create connections between folks working on related projects in the wider ecosystem and foster their development. This should serve applications and products that require predictability and trust in the kernel.
We ask that all discussions focus on some identified issues, aiming at finding potential solutions or alternatives to resolving them. The Testing MC is open to all topics related to testing on Linux, not necessarily in the kernel space.
In particular, here are some popular topics from past editions:
- KernelCI: Maestro, kci-dev, kci-deploy, kci-gitlab, new dashboard, KCIDB
- Improve sanitizers: KFENCE, KCSAN, KASAN, UBSAN
- Using Clang for better testing coverage: Now that the kernel fully supports building with Clang, how can all that work be leveraged into using Clang's features?
- Consolidating toolchains: reference collection for increased reproducibility and quality control.
- How to spread KUnit throughout the kernel?
- Building and testing in-kernel Rust code.
- Identify missing features that will provide assurance in safety critical systems.
- Which test coverage infrastructures are most effective to provide evidence for kernel quality assurance? How should it be measured?
- Explore ways to improve testing framework and tests in the kernel with a specific goal to increase traceability and code coverage.
- Regression Testing for safety: Prioritize configurations and tests critical and important for quality and dependability.
- Transitioning to test-driven kernel release cycles for mainline and stable: How to start relying on passing tests before releasing a new tag?
- Explore how do SBOMs figure into dependability?
- kernel benchmarking and kernel performance evaluation
Things accomplished from last year:
- progress on Rust testing
- kci-dev is currently used in production for interacting with KernelCI results
- Follow up discussions on kselftest mailing list about "Adding benchmarks results support to KTAP/kselftest"
- Proposal of kci-gitlab sent to kselftest mailing list
Today the kernel's UAPIs are tested through userspace testcases using the kselftests framework, which provides a uniform build system and output formatting infrastructure. However it does currently not provide an out-of-the-box solution to run the tests against the current in-development kernel tree.
I am proposing a framework which allows to build the test applications as part of the kernel build ("userprogs") and then to run them as part of regular KUnit, reusing the existing tooling. The structured TAP output from the selftests are preserved and seamlessly nested within the regular KUnit output. For developers and CI systems these testcases behave exactly the same as regular kernel-only KUnit tests.
In addition the tests can make use of the nolibc library which enables the compilation of simple test programs even with only a kernel toolchain.
The idea is to share testcases between the new framework and regular kselftests.
Discussion topics:
* What is preventing the merge to mainline?
(In case it is not merged yet by then)
* Which features are missing?
* How exactly to coexist with tools/testing/selftests/?
* How to handle libgcc dependencies?
* How to handle different UAPI ABIs per architecture?
* How to handle complex test programs?
Patch series: https://lore.kernel.org/lkml/20250717-kunit-kselftests-v5-5-442b711cde2e@linutronix.de/
LWN article: https://lwn.net/Articles/1029077/
The kernel testing ecosystem is roughly split into kernelspace tests, commonly implemented using the KUnit framework, and an assortment of userspace tests, the most representative of which are kselftests.
Both types have different goals and are used differently: while KUnit tests are meant to be run in a known scenario on a freshly booted kernel, with little or no interactions and no communication to entities outside the kernel, kselftests are generally more loosely defined and more varied in scope and target, and their interaction with the kernel are mostly restricted to the interfaces provided by it, such as system calls.
In this talk I propose a way to fill in the gap between both test types by providing a simple framework that allows any kernel code to define and register kernelspace tests at runtime, and also publishes an interface that lets userspace query these tests, trigger them and collect their results and logs in a safe way.
This opens new possibilities for kselftests and other userspace tests and lets them perform more specialized and advanced tests by having the kernel run specific code that doesn't have a user interface and by performing pre-defined kernel operations before or after an existing test to exercise specific scenarios.
Fuzz testing the Linux kernel with system-call fuzzers has been highly effective, but this approach struggles to reach and test deeply nested internal kernel functions. This leaves significant parts of the kernel’s logic, particularly complex data parsers, under-tested and potentially vulnerable. We introduce KFuzzTest, a novel framework aiming to bridge this gap by directly exposing stateless or low-state internal kernel functions to a userspace fuzzer.
The KFuzzTest architecture is designed to be both powerful and developer-friendly. It provides a simple macro-based API that allows kernel developers to define fuzz test targets alongside their functions, specifying input domain constraints and type annotations. This metadata is compiled into dedicated ELF sections within the vmlinux binary, enabling automatic discovery by the userspace fuzzer. Communication between the fuzzer and the in-kernel test harness is facilitated via debugfs entries.
We have successfully integrated a proof-of-concept with syzkaller, demonstrating how this framework can perform coverage-guided fuzzing on internal kernel functions. This approach empowers developers to write effective, targeted tests for their own code. This presentation will cover the framework’s design, implementation details, and a path forward for upstreaming the work to the Linux community.
Fuzzing the Linux kernel with coverage-guided tools like syzkaller has proven to be an extremely effective method for finding kernel bugs. However, complex subsystems like KVM present unique and significant challenges that standard syscall fuzzing cannot easily address. Fuzzing KVM effectively requires managing complex state across both the host and the guest, and necessitates the coordinated execution of code in both environments.
The biggest hurdle has been the generation of meaningful guest-side code. Randomly generated instruction blobs are difficult to create, trivial to break through mutation, and nearly impossible to reliably test. This fragility has limited the fuzzer's ability to explore deep, guest-driven functionality and device interactions.
This talk introduces SYZOS, a novel framework designed to overcome these challenges. Initially prototyped on ARM64 and now being extended to x86, SYZOS reframes the problem: instead of fuzzing raw instructions, we fuzz higher-level operations within the guest. It consists of a small, immutable C library loaded into the guest that exposes a simple, fuzzer-friendly API. The fuzzer generates a sequence of calls to this API, providing stable, high-level building blocks for complex guest-side actions like interrupt controller setup, privileged register manipulation, and triggering controlled VM exits.
This 15-minute presentation will detail the SYZOS architecture, demonstrate how it enables deeper and more meaningful KVM fuzzing, and share key findings from our work on both ARM64 and x86.
For the past 9 years, syzbot has reported more than 13,000 findings to the Linux kernel mailing by continuously fuzzing upstream Linux trees. However, a notable latency often exists between the introduction of a bug and its discovery, complicating and delaying its resolution. Many regressions, including build/boot failures and shallow bugs, can stall the broader fuzzing effort once they land.
To address this, we introduce syzbot ci, a proof-of-concept system engineered to shift fuzzing left in the kernel development cycle. syzbot ci monitors a number of Linux kernel mailing lists, automatically applies incoming patch series to an automatically determined base tree, and initiates a targeted fuzzing campaign specifically on the code paths modified by the patches. The core assumption is that this focused approach can uncover bugs significantly faster than broad, continuous fuzzing.
As of September 2025, the system has already sent reports with findings to over 50 patch series during their review on the mailing lists.
This talk will cover the functionality of syzbot ci and share the preliminary results and insights. The goal is to gather initial feedback from kernel developers and to discuss the future direction and areas of focus for the project.
Manual management of resources, from locks to reference counts, is a persistent source of bugs, resource leaks, and reduced code robustness. Scope-based resource management, offers a far more reliable approach by automatically releasing resources when they fall out of scope.
This session will demonstrate the practical application of Coccinelle to automate the transition to scope-based cleanup. We will present examples of our semantic patches (SmPL).
This technique has already been successfully applied to numerous locking and reference counter patterns, improving code clarity and removing a substantial number of manual cleanup blocks across the kernel. While effective, the approach is not without its challenges.
We will discuss current limitations, including how to handle complex nested cleanup scenarios and resources with non-trivial teardown logic.
We invite developers to collaborate on refining these SmPL patterns and extending their capabilities, with the goal of expanding this automated cleanup strategy across the kernel.
The kdevops project automates complex Linux kernel development subsystem testing. Around Q3 we started evaluating advances in generative AI. The experimentation on kdevops shows project significantly enhances the speed and accuracy of generative AI for extending its features and adding new workflows. This capability was a core design principle. While generative AI may not yet be optimal for all Linux kernel development, we lower the barrier to its use on kdevops by adopting a structured, declarative approach to defining and implementing Linux workflows.
The kdevops project fundamentally acts as a Software 3.0 enabler for Linux kernel development. New workflows are now being added with generative AI, and the entire kdevops dashboard is fully generated by generative AI. Recent efforts also enable CI integration without developers needing to touch any CI files, which was a deliberate design choice given the inherent complexity and debugging challenges of continuous integration configurations.
We will review the lessons learned so far, our progress, and why we can verify that the singularity is here, at least for complex testing.
KernelCI has become a backbone for Linux kernel review across diverse hardware labs.
kci-dev aims to bring that same power directly into developers workflows both for pre and post merge.
kci-dev can also be useful for doing analysis and validation from your terminal.
This testing discussion topic try to draw a concrete plan to align kci-dev, KernelCI APIs with Kernel developers/maintainers workflow.
Possible discussion topics are:
- Where kci-dev sits in the flow
- Features requests/wanted
- kci-dev future planning
- APIs and events (what we have, what is still needed)
- pre-merge and post-merge
- Labs realities
Most of our upstream efforts with kernel quality have thus far tended to focus on functional testing, but performance is also critical to actual user experiences. There are a large number of benchmarks out there but not much shared tooling or common practices with what or how we benchmark. How can we do better here?
To start the discussion this presentation introduces Fastpath, a tool Arm have recently open sourced which aims to make it straightforward to generalise benchmarks and provide a front end for data storage and analysis with an initial focus on identifying and bisecting regressions over an initial collection of benchmarks.
The Power Management and Thermal Control microconference is about all things related to saving energy and managing heat. Among other things, we care about thermal control infrastructure, CPU, and device power-management mechanisms, energy models, and power capping.
This year has been mainly focused on the maintenance of the frameworks, resulting in cleanups setting up the scene for more improvements and connecting the user space to the kernel:
Thermal thresholds have been finished: https://patch.msgid.link/20240923100005.2532430-2-daniel.lezcano@linaro.org
Thermal library and thermal engine skeleton taking benefit of thresholds: https://patch.msgid.link/20241022155147.463475-5-daniel.lezcano@linaro.org
Performance QoS RFC has been posted: https://lore.kernel.org/all/20250505161928.475030-1-daniel.lezcano@linaro.org/
Embedded systems with very high performances and a higher integration, automotive systems with lifespan and reactivity constraints, push the PM frameworks to their limits.
Big topics to be addressed:
A performance QoS has been sent as an RFC. It apparently needs more discussion as the approach seems to not satisfy all the parties. This framework is needed to allow the user space to interact with the performance on different devices via a unified interface and co-exist with the kernel decisions.
The generalization of the telemetry on embedded systems to capture the energy, the power, and the temperature at high rate. How can the kernel use these data and for what ?
In the context of electric vehicles, power management is important to insure the largest life span of the hardware. Some operation modes, like the sentry mode, challenge several bricks of the power management like resume from suspend time.
The energy model can no longer be static in the kernel and must be dynamically adjusted. Depending on the running scenario, there can be a significant gap between the computed and real power consumption, leading to inappropriate kernel decisions.
Embedded systems are looking for a multi suspend states mechanism like a system-wide C-state. It requires more discussion, which were already initiated at the last LPC.
The sensor aggregation did not reach a consensus because there is a different perception on the goal of the aggregation and its usage, resulting in a different implementation
Key attendees:
Even though the thermal framework has evolved significantly over the past five years, several limitations and open issues remain. The step_wise governor still struggles to strike a better trade-off between performance and avoiding interrupt storms, and thermal-zone mitigations tend to stop too early during system suspend. Moreover, newer hardware provides fine-grained temperature telemetry that the framework does not yet leverage.
This talk will illustrate these limitations with simple diagrams and propose a direction to improve the thermal framework.
The same processor can be made available with different thermal junction (Tj) temperature thresholds by changing packaging characteristics.
Just like voltage and frequency operating points are support by opp-supported-hw property in DT, introduce a similar property for difference in thermal properties of an SoC.
The Energy Model (EM) framework is capable updating the model's information. The Thermal framework is aware of the SoC's and internals' (CPUs, GPU, etc) temperature. We can create a solution which updates the power values in the EM based on increased static power (leakage) caused by the heat. The presentation will go through a proposed design for that.
At the last LPC, we discussed PM QoS. However, the implementation proposed this year did not reach consensus; the semantics of PM QoS are perceived differently.
The proposal was for in-kernel actors and userspace to vote on a constraint. Once a constraint is set, if the userspace process exits, the constraint is automatically removed provided no other actors hold the same constraint. When the last actor releases the constraint, it is removed automatically.
Rather than debating PM QoS itself, it may be preferable to step back and revisit the approach: provide a generic mechanism for voting on a specific value. PM QoS could then be built on top of that and define its own units for the voted value (watts, kbps, performance level, etc.).
The talk and subsequent discussion are about defining the semantics of such a voting mechanism—for example, does a kernel vote have higher priority than a userspace vote, or do we want to support out-of-band votes?
Since last LPC in Vienna, we have continued to explore how to add support for multiple system-wide low power-states to the Linux kernel. A series [1] has been posted that suggests us to add a user-space interface, to allow a system-wakeup latency constraint to be specified. The series also includes deployment for how the latency-constraint can be taken into account during s2idle and especially in regards to CPU/CPU-cluster wakeup-latencies for low-power states.
Let's discuss and try to conclude how to move forward!
[1]
https://lore.kernel.org/all/20250716123323.65441-1-ulf.hansson@linaro.org/
Last time at LPC in Vienna, we discussed whether it would be feasible to try to evolve the support for s2idle, allowing a legacy platform/FW that supports only s2ram to make use of s2idle too. To be clear, in this context we are not able to make an update of the FW (it's not always possible to convince vendors to make an update), so the target have been to make adjustments on the Linux kernel side, to try to trick the FW to make it look like s2ram, while it's (almost) s2idle from Linux point of view.
Since Vienna I have explored this approach and I have managed to get it working on an ARM64 Hikey board. Assuming there is some interest, I intend to share some of my findings and in particular we could discuss the most controversial parts for getting acceptance of the needed changes in the upstream Linux kernel.
While many Linux distributions don't officially support hibernation, OEMs must validate thousands of successful hibernation cycles during hardware certification. This creates significant testing bottlenecks, particularly on multi-CPU systems with complex device configurations where failures can occur intermittently, requiring days of continuous testing to reproduce edge cases.
Some Optimization Areas:
1. CPU Management Efficiency
All CPUs are brought online after hibernation snapshot creation before saving the image. Optimization can be done to limit this number reduces hibernation time, especially on systems where CPU initialization is slow.
2. Enhanced Fault Tolerance
Any single device failure during resume immediately discards the hibernation image by unmarking swap signatures, There is possibility to delay signature unmarking to allow retry attempts. I have also experimented with backup device fallback mechanisms.
Questions for Discussion:
Would addressing these optimization areas provide meaningful values?
On battery driven platforms, flash-based storage devices like NVMe/UFS/eMMC/SD are being used in a combination with a carefully designed support for platform-power-management. Yet, a flash-based storage device typically contributes significantly to the energy-budget for a platform. That means it's highly important to manage them in an optimized way, otherwise we may waste a lot of energy or in worst case, if we get things wrong, we could even damage the device.
In this regards, it's highly problematic that we are lacking a common policy for how deal with low-power states for these storage devices. Especially since they are really sharing the same kind of problems, while their respective subsystems treats them quite differently. Some tends to pick the deepest possible low-power state, while others prefer leaving the device fully powered-on, even during a system-wide-sleep.
Let's discuss these problems in more detail and in particular let's see if we can find a way to start moving things into a more common ground.
For DT based platforms fw_devlink allows us to track supplier/consumer dependencies, which helps to avoid having drivers returning -EPROBE_DEFER, while they probe their devices. Moreover, fw_devlink provides the so called ->sync_state() support, allowing a driver for a supplier device to receive a notification through its ->sync_state() callback, when all its consumer devices have been probed successfully.
The ->sync_state() support has recently been added to the generic PM domain (aka genpd) and it's available since v6.17-rc1. However, several problems have been reported since its introduction, some still remains to be addressed by more general improvements to fw_devlink.
In parallel we are also working on adding ->sync_state() support to other subsystems, such as the common clock framework, for example. Hence it's becoming increasingly important to address these generic problems with fw_devlink, to avoid similar issues we now have with genpd.
Let's discuss these problems in more detail and try to find solutions for how to fix them.
We’d like to propose bringing back the RISC-V Microconference at Linux Plumbers 2025. As the RISC-V ecosystem continues to grow, so does the importance of having a space where developers, hardware vendors, toolchain maintainers, and distro folks can come together to solve real-world problems. This microconference has always been a great venue for open, technical discussions that help move RISC-V Linux support forward — from core architecture work to platform enablement and userspace coordination. In general, anything that touches both Linux and RISC-V is fair game, but the conversation often centers around a few common themes we have been following:
Supporting new RISC-V ISA features in Linux, especially vendor-specific or upcoming standardized extensions
Enabling RISC-V-based SoCs, which often involves working across various Linux subsystems in addition to the core arch/riscv code
Coordinating with distributions and toolchains to ensure consistent and correct userspace-visible behavior
Possible Topics
It’s still early to lock down a full agenda, but a number of topics have been circulating on the various mailing lists lately are:
The issues with WARL discovery
Why CFI is not merged yet (if not merged by then)
ACPI enablement progress and remaining gaps (e.g ACPI WDAT watchdog)
How to handle RVA profiles in Linux kernel ?
QoS
Key Stakeholders
Sorry if I missed anyone, but here’s a quick list of folks who’ve regularly shown up and helped keep the conversation moving at past RISC-V MC:
RISC-V contributors: Palmer, Atish, Anup, Conor, Sunil, Charlie, Bjorn, Alex, Clément, Andrew, Deepak — and probably a few more I’m forgetting.
SoC folks: Arnd, Conor, Heiko, Emil, Drew — with more SoC families bringing up RISC-V support, this group keeps expanding.
We’ve also consistently had great input from contributors across other architectures like arm, arm64, ppc, mips, and loongarch. Since we all end up touching shared code (especially in areas like drivers or platform support), these cross-arch discussions have been super valuable — and we expect even more of that this year as shared SoC platforms become more common.
Accomplishments post 2024 Microconference
Ftrace improvements [1]
System MSI support [2]
RIMT patches available on lore[3]
CFI series is at v17 [4]
[1] https://lore.kernel.org/linux-riscv/174890236299.925497.5731685320676689356.git-patchwork-notify@kernel.org/#r
[2] https://lore.kernel.org/linux-riscv/20250611062238.636753-13-apatel@ventanamicro.com/
[3] https://lore.kernel.org/linux-riscv/20250610104641.700940-1-sunilvl@ventanamicro.com/
[4] https://lore.kernel.org/linux-riscv/20250604-v5_user_cfi_series-v17-0-4565c2cf869f@rivosinc.com/#r
The number of RISC-V extensions is ever increasing. To manage the wide variety of extensions that are available to hardware vendors, RISC-V International (RVI) has introduced "profiles" that define groupings of extensions for different classes of hardware.
The currently relevant profile is named RVA23 and specifies a set of extensions that supervisor and userspace software vendors can rely on for general use on performance focused hardware. Key aspects of RVA23 are:
The RISC-V kernel thus far has conformed to the idea that a single kernel that can support every extension should be able to be built to run on any RISC-V hardware. To accomplish this:
The C code in the kernel is compiled with minimal extensions, namely rv64imac. In order to use any other extension, the routine must be written in assembly and at runtime selectively enabled based upon if the extension is present in the isa string that is passed into the kernel from the firmware. Select C files may be compiled with other extensions such as fd.
The assembly code in the kernel is assembled with rv64imafdvc. Similarly to C, any other extensions need to be selectively enabled at runtime. For the code that is executed without knowledge of what is in the isa string, the misa CSR is read to determine if the relevant extensions, namely fdv, are supported by the hardware.
One of the primary issues with this is that the kernel C code is not being compiled with any of the performance enhancing extensions, such as bitmanip. Another issue is that the kernel has $2^n$ runtime configurations depending on which extensions are in the isa string, where $n$ is the number of runtime extensions. For each of these configurations, either a static branch was set, an alternative flipped, or the isa string checked -- introducing more performance penalties and extra developer overhead. For more invasive extensions that change userspace ABI, it is not always feasible to have dynamic patching so separate kernels must be compiled.
Adapting the kernel to support RVA23, and future profiles, can alleviate many or all of these issues. When RVA23 can be assumed, the kernel can be compiled with performance enhancing instructions, skip checking misa or the isa string for the large majority of extensions, and confidently support extensions that change userspace ABI.
This talk will explain how this is possible and the proposed patches to achieve this goal of a RVA23 compatible kernel that is able to take advantage of all of the benefits.
A RISC-V ISA has a lot of variables, and the ISA string describes a small subset of those variables, so some of the remaining ones are current discovered by directly interacting with the ISA implementation through trial and error (WARL).
WARL hinders virtualization as the discovery is done through registers that we don't want to trap and emulate for performance reasons, and there is no architectural control over the WARL behavior otherwise, essentially throwing the point of hardware accelerated virtualization out of the window.
It is a bit late to get rid of WARL behavior in the ISA, and there wasn't any interest in adding ISA control over the behavior...
I think the best we can do is to tactfully ignore that WARL exists, and only write known legal values, but that requires a more complete ISA description.
The Unified Database project should eventually describe the whole ISA, so we could be deriving a new DT structure for the ISA from its YAML.
I'd like to have a discussion about the desired DT format, and the direction in general, as we can also choose to discover the ISA through SBI (cpuid style) or even more outlandish methods.
It’s the elephant in the room: in the Linux kernel, RV64 has become significantly more popular and better supported than its smaller sibling — the RISC-V 32-bit platform.
There have been multiple open discussions about dropping RISC-V 32 support to "liberate" kernel development.
However - - and it’s a big however - - many people actively use RISC-V 32 Linux in production, and some of them urgently need features like HIGHMEM.
Moreover, many RV32 Linux users operate without an MMU, and even without the Atomic Extension : use cases that would greatly benefit from mechanisms like the good old kuser_helpers in ARM pre-v5 and SuperH.
Unfortunately, such proposals are often shrugged off in community discussions.
This has become an agony for companies that actually manufacture RV32 Linux ICs, for example, HPMicro’s HPM63xx series and Allwinner’s V821.
In this session, I will introduce a Proof-Of-Concept that by splitting arch/riscv into arch/riscv32 and arch/riscv64, just like arm and arm64, would benefit everyone involved.
Kernel control flow integrity RFC patches [1] are out. It uses existing hooks in shadow call stack config for riscv hardware assisted shadow stack. Forward cfi is finer grained cfi using a toolchain which matches landing pad labels between callsite and taken-targets. Talk will focus on following emerging challenges, proposed solutions and further discussions/comments on them.
Forward cfi
- How to co-exist with execution contexts sharing S-mode without awareness of landing pad. Two examples here are UEFI runtime services and loadable kernel modules.
Backward cfi
- How to do faster shadow stack allocations. Kernel shadow stack creation needs that direct mappings must also be unmapped so that attacker doesn't get an alternate way of writing to shadow stack. This means tlb shootdowns on conversions of vmalloced memory <--> shadow stack. Similarly returning shadow stack back to vmalloc requires this memory to become RW again. Changing perms and tlb shootdowns will lead to slower fork path.
Common topic to both fcfi and bcfi
- eBFP, tracing and probes
- policy on enabling and lockdown
[1] https://lore.kernel.org/all/20250724-riscv_kcfi-v1-0-04b8fa44c98c@rivosinc.com/
Some of the next generation of RISC-V SoCs are expected to have QoS (Quality-of Service) functionality to control and monitor the usage of resources such as cache capacity and memory bandwidth. The RISC-V Quality-of-Service Identifiers (Ssqosid) extension [1] adds the srmcfg CSR to configure a hart with two identifiers: a Resource Control ID (RCID) and a Monitoring Counter ID (MCID). These identifiers accompany each request issued by the hart to shared resource controllers. RISC-V Capacity and Bandwidth Controller QoS Register Interface (CBQRI) specification [2] allows resource allocation to be controlled and monitored.
Intel and AMD already have QoS features on x86 for many years. There is an existing user interface in Linux the resctrl virtual filesystem [3]. There was a discussion session on resctrl at LPC 2023 led by Peter Newman [4][5].
ARM has also introduced a similar QoS specification called MPAM (Memory System Resource Partitioning and Monitoring) [6] and it is now present in some ARM64 server chips. The resctrl had the historical problem of having been implemented in arch/x86. ARM kernel developer James Morse led a multi-year effort to decouple resctrl from x86 and move it into fs. The final move happened in the 6.16 when fs/resctrl was recreated. James is now working on upstreaming support for MPAM [7].
CBQRI and MPAM are much more flexible than the existing AMD and Intel QoS capabilities. For example, resctrl MB resource implicitly assumes that memory bandwidth is the same as L3. Both James and I have tried to take the path of only implementing the aspects of MPAM and CBQRI that 'look like a Xeon' [7]. This did present an awkward situation for the CBQRI proof of concept [8] where extra L3 domains are created for each memory controller so that the bandwidth can be monitored.
Resource types in resctrl
- Should a new resource type be added to resctrl to better fit CBQRI bandwidth controllers?
- Anything beyond cache and memory bandwidth? PCI bus, etc?
DT Bindings
- What should the DT bindings look like for capacity controllers and bandwidth controllers?
- The proof of concept had bindings that allowed a cache controller driver and memory controller as a temporary measure to allow the CBQRI code to be exercised
- Does anyone who is implementing CBQRI have any input on what DT bindings should be?
- Bigger question: do people actually care about DT?
- There is now support for the ACPI RQSC [9] table so it may be the case that everyone expects to use that instead of DT?
References
1. https://github.com/riscv/riscv-ssqosid/releases/tag/v1.0
2. https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0
3. https://docs.kernel.org/filesystems/resctrl.html
4. https://lpc.events/event/17/contributions/1567/
5. https://www.youtube.com/watch?v=j6SB_-CeFHo
6. https://developer.arm.com/documentation/107768/0100/Overview
7. https://lore.kernel.org/all/ee08ba7e-2669-447f-ae04-5a6b00a16e77@arm.com/
8. https://lore.kernel.org/linux-riscv/20230419111111.477118-1-dfustini@baylibre.com/
9. https://lf-rise.atlassian.net/wiki/spaces/HOME/pages/433291272/ACPI+RQSC+Proof+of+Concept
Upstreaming kernel support traditionally happens only after silicon becomes available, but this approach often delays software enablement and ecosystem readiness. For the first time in the RISC-V world, we are tackling the challenge of pre-silicon kernel upstreaming—enabling Linux kernel features ahead of actual chip availability.
In this session, we will share the methodology, toolchains, and collaborative workflows that make this possible, including the use of simulation platforms, pre-silicon verification environments, and CI/CD integration for early kernel testing. Attendees will learn how these efforts accelerate software-hardware co-design, reduce bring-up cycles, and ensure that by the time silicon arrives, the kernel is already upstream-ready.
This pioneering approach not only shortens time-to-market but also sets a new model for open source hardware-software collaboration in the RISC-V ecosystem.
Key Takeaways:
- Why pre-silicon kernel upstreaming is a game-changer for RISC-V.
- The tools and processes used to validate and upstream before silicon.
- Lessons learned and best practices for collaborating with the open source community.
Background
When we bring up a RISC-V board from a chipset vendor, the kernel log cannot give us enough details about what happens inside the kernel. Kernel logs do not contain sufficient debugging information. Because of this, a vmcore is necessary to understand what is really happening inside the kernel.
For binary analysis, many Linux developers use the vmcore file. They usually enable KDUMP, then extract the vmcore from the device, and finally analyze it. The vmcore is a kernel dump that keeps a memory snapshot at the exact moment when the crash happened.
It is more important to show the process of enabling KDUMP and extracting vmcore on real RISC-V hardware boards, such as VisionFive2, than only demonstrating it on QEMU. QEMU is useful as a reference board, but it is not the same as the real boards that engineers use in production.
Challenge
In this conference, we will first introduce the method and patches for extracting vmcore from RISC-V device(e.g: VisionFive2). We will also provide a demo of analyzing vmcore with the crash utility.
When we enabled KDUMP, the vmcore was not extracted immediately. We had to apply specific patches to solve this problem, that will be discussed.
Another issue appeared when loading the vmcore into the crash utility. Some important information was not printed correctly We developed and applied a patch set to make the debugging information display properly.
Future Work
We plan to extend this work for wider usage. Many silicon vendors face unexpected issues when they develop RISC-V devices. In such cases, the vmcore is critical because it provides meaningful signatures such as CSRs and kernel data structures.
As our future tasks:
sched_ext[1] is a Linux kernel feature which enables implementing safe task schedulers in BPF, and dynamically loading them at runtime. sched_ext enables safe and rapid iterations of scheduler implementations, thus radically widening the scope of scheduling strategies that can be experimented with and deployed, even in massive and complex production environments.
This MC is the space for the community to discuss the developments of sched_ext, its impact on the community, and to outline future strategies aimed at improving the integration with the other Linux kernel subsystems.
Last year the sched_ext MC proved highly productive in facilitating coordination with distribution maintainers, allowing us to clarify their requirements and ease potential maintenance burdens. This collaboration directly contributed to upstream changes, including patches such as [2].
Ideas of topics to be discussed include (but are not limited to):
- Use of BPF arenas for task/CPU context sharing between kernel, BPF, and user space
- Composable schedulers/scheduler libraries with BPF arenas
- Deadline server(s) for the SCHED_EXT class
- Integration with other scheduling-related features (RCUs, proxy execution, PREEMPT_RT, etc.)
- Potential integration with other Linux subsystems (e.g., Rust-for-Linux)
- Scheduling for gaming and latency-sensitive workloads
- User-space scheduling
- Tickless scheduling
- Tools and benchmarks to analyze and understand scheduler activities
While we already have a tentative schedule with existing talk proposals to cover the topics mentioned above, we are also planning to open a public CFP to accept additional topics to discuss. Time permitting, we are open to readjust the schedule to accommodate further discussions that are relevant to the Linux community.
[1] https://github.com/sched-ext/scx
[2] https://lore.kernel.org/all/20240921193921.75594-1-andrea.righi@linux.dev/
This talk will kick off the sched_ext MC session with a brief overview of the project's current state: what features are available, what's missing and what remains under development.
We'll also look ahead to discuss gaps in the framework, ideas yet to be explored, and how we envision the sched_ext community growing.
The goal is to align contributors and spark discussions around priorities for the next development phase.
In this talk, we will explore the challenges and opportunities in improving the interoperability of sched_ext BPF schedulers with various Linux and in particular existing scheduler code as well as other subsystems. While sched_ext BPF schedulers offer powerful and flexible scheduling capabilities, their integration with other kernel components can often be fragmented and complex. This talk will cover existing methods for interacting with the kernel such as kfuncs and provide some practical examples of how schedulers and the Kernel can be extended for better interoperability.
Proxy Execution provides a generalized form of priority inheritance, which leaves mutex-blocked tasks on the run-queue. Then if the scheduler tries to run a mutex-blocked task, it will instead run the mutex owner on the blocked task's behalf, so the mutex can be released.
In order for this to work, we introduced the idea of split contexts (scheduler and execution), tracking both the task that was initially selected to run (the donor - scheduler context), and the task that actually runs (the execution context).
However, running tasks on behalf of other tasks has the potential to confuse sched_ext schedulers. So currently they are mutually exclusive via Kconfig.
I'm currently looking into what promises are made to sched_ext schedulers, and what might be needed to ensure sched_ext schedulers can be used along with Proxy Execution. At the talk, we'll discuss what I have found and hopefully potential solutions.
Discussion Bait: I'm very much hoping that by merging sched_ext, and exposing its interfaces to BPF, we haven't effectively frozen the scheduler's pre-Proxy Exec behavior as a user ABI.
sched_ext has guardrails in kernel and lots of examples in BPF for how to schedule tasks effectively. We use sensible defaults for idle tracking, NUMA aware masks, and prevent you losing track of tasks in BPF. But what happens when you try to schedule badly?
scx_chaos builds on top of scx_p2dq, another sched_ext scheduler. It adds options for introducing delays, randomly decreasing CPU frequency, inverting mutexes, and slowing down kprobes. This helps with hunting down race conditions or points of contention, and reproducing the conditions that create them. In this MC session we will discuss, with examples:
This talk will present our progress on arena-based data structures for quickly evolving scheduler abstractions (DSQs, CPU topology).
We currently write scheduling algorithms in terms of operations on primitives provided by the kernel (BPF hash maps/arrays, CPU bitmasks, DSQs). Adding new operations to these primitives is work-intensive because it requires modifying the underlying kernel data structure. It also possibly changes the sched-ext BPF API, breaking backwards compatibility. The workaround has been to open-code this functionality in BPF, often at a performance cost. For example, finding and removing a task from a DSQ requires linearly traversing it first.
BPF arenas provide a framework for developing and evolving scheduler abstraction. BPF arenas enable writing recursive data structures directly as BPF libraries and lift the limitations imposed to developers for regular BPF code. These arena-based data structures let us introduce and evolve scheduler abstractions without kernel modifications.
This talk will go over:
Specific instances of arena-based abstractions
Thread placement on machines with complex cache hierarchies (such as AMD CPU Core Complexes (CCX’es)) requires careful management for optimal performance. Unlike NUMA domains, which are large enough that hard partitioning is a viable strategy, these chiplet domains are too small to schedule efficiently without a means of enforcing some degree of soft affinity. Spillover of threads to remote CCXs should be a managed exception, permitted only under specific load conditions, and the resulting spread must be minimized and carefully managed.
This presentation will detail an application of extensible scheduling that leverages BPF and a userspace agent to achieve this fine-grained control. Our policy implements a per-CCX runqueue and utilizes an asynchronous bin-packing algorithm to dynamically assign and manage the soft affinity of thread groups. Additionally, the scheduler also employs heuristics to intelligently decide when and where threads should spill out from their preferred CCX during load spikes. While the concepts here are largely generic to any workload, we will primarily consider how it intersects with VM scheduling, which is our current application.
We will demonstrate how the userspace component enables sophisticated policy management through tunable parameters and complex accounting. The architecture ensures fairness among competing threads while maintaining high throughput and cache locality. We will discuss the design, the BPF-userspace interaction, and the performance benefits of this approach.
Finally, we wish to open a discussion with the community on best practices for tuning such policies and to explore potential improvements to our design.
We present one of the first deployments of sched_ext to a large fleet of AI training hardware composed of multi CPU socket systems with attached Nvidia GPUs. GPU training workflows run frequent synchronization across all the training processes which makes them extremely sensitive to task scheduling micro-delays that prevent work from being dispatched to the GPUs. In addition, the training systems boxes run several components of the stack like data loading, preprocessing and model checkpointing on the CPUs which increases scheduling congestion. We used sched_ext, a user-space scheduler (scx_layered) and we deployed it to the entire Reality Labs GPU fleet with tens of thousands of GPUs. We were able to improve the GPUs’ compute unit utilization on certain model types by 9% and reduce the fleet training cost. The presentation describes our journey in identifying the latency critical system tasks, developing the scheduler, ensuring resource isolation, debugging corner cases and monitoring the performance across the entire fleet.
Optimizing GPU bound workloads with sched_ext via scx_layered
In this talk, I will discuss how to optimize GPU bound workloads through the use of the sched_ext scheduler, scx_layered and how API changes could make make this simpler.
I will use a well understood open source GPU benchmark job (something like mnist or resnet) and a common cpu-bound open source workload (something like compiling the linux kernel, or chromium) to demonstrate how workload-customized scheduling policies, such as those which scx_layered enables the use of, can be leveraged to optimize workload run time and system resource utilization.
After this brief overview, I will highlight the challenges of optimizing workloads such as this encountered while working on scx_layered, and ways in which improved APIs and/or tooling could simplify use cases such as this.
Some particulars I would like to discuss with kernel developers are the following:
If there could be a better way to confirm verifiability of scheduler code across a range of kernels/hardware types other than running scheduler code on hardware/kernel combinations.
If there could be APIs enabling easier association of TIDs with GPU devices or NUMA nodes.
If there could be an API kernel side enabling setting mempolicy from the scheduler.
The LAVD scheduler is a sched_ext scheduler designed to optimize latency and energy efficiency, with an initial focus on gaming workloads. This talk will present the current state of LAVD development and explore its future roadmap. In particular, we will discuss how LAVD leverages heterogeneous CPU architectures (Intel P/E cores, ARM big.LITTLE) to improve performance per watt, along with optimization techniques that balance responsiveness and energy consumption. We will also consider how LAVD’s mechanisms can be applied or generalized to other sched_ext schedulers, and outline potential directions for expanding its applicability beyond gaming.
With the proliferations of many sched_ext schedulers, including ones that caters for very specific workloads within Meta. There exists a need for a "default" fleet scheduler that "just works" for a wide range of hardware and use cases. SCX_LAVD is one such candidate as one of the more mature sched_ext schedulers out there with various heuristics to favor latency critical threads.
The talk will focus on various challenges and strategies in bringing in SCX_LAVD and trying to run it on large production workloads and large topologies:
How do we handle large and varied topologies and cache hierarchies that exists in the fleet to take optimal advantage of the hardware?
How do we tune LAVD such that it performs well in throughput bound use cases without sacrificing its latency advantages?
How do we test and stress schedulers to prevent regressions from reaching production?
The Android Micro Conference brings the upstream community and Android systems developers together to discuss issues and changes to the Android platform and their dependencies and interactions with the Linux kernel, allowing for collaboration on solutions for upstream.
Some highlights of progress made since last year’s MC:
Potential discussion topics for this year include:
MC leads:
- Lukasz Luba lukasz.luba@arm.com
- Amit Pundir amit.pundir@linaro.org
- Mostafa Saleh smostafa@google.com
- Sumit Semwal sumit.semwal@linaro.org
- John Stultz jstultz@google.com
- Karim Yaghmour karim.yaghmour@opersys.com
Android boot flow and GBL quick recap
- Current problems
- GBL updates
Android meets FIT (Flattened Image Tree)
- Existing boot headers structures vs FIT
- Adoption proposal
- Expected problems
Android + EFIStub
- UKI (Unified Kernel Image) adoption
- GBL as a EFIStub proposal
In an increasing number of scenarios, the use of S2D (Suspend to Disk) functionality is required or expected on mobile. For example, when a mobile is running lowing on battery, it can use the S2D suspend process to save the user’s current state and then enter a power-down mode. Once the battery level is sufficient again, the device can quickly resume the previous state through the S2D resume process. Additionally, some mobile devices may aim to support multiple operating systems, and S2D can help quickly restore the last state of a system when switching between them.
The Linux kernel already supports S2D, but there are still some limitations and impacts when used in real-world scenarios. For example, storage-related drivers need to be loaded early during the Linux kernel boot process in order to restore the suspend image from storage. However, in practice, due to system design constraints, storage drivers may be loaded relatively late, making it impossible to restore the suspend image. Another, restoring a suspended image using the Linux kernel requires booting the kernel to a specific stage where the environment is sufficient to perform the restore. Additionally, the image must be restored page by page, which may result in failing to meet system boot KPI requirements.
To address these issues, we designed a bootloader-side restore mechanism for the suspended image. The basic idea is the Linux kernel completes the system suspend and saves the image to storage during the S2D process. Then, upon a cold reboot, the bootloader is responsible for restoring the S2D image. During this process, the image data can be loaded in blocks rather than page by page, thereby maximizing system boot KPI performance and avoiding issues related to delayed loading of storage drivers.
Running a full-featured Linux VM on Android has been a long-standing desire for developers and power users. This presentation details Ferrochrome, a project that leverages the Android Virtualization Framework (AVF) to run a guest Debian OS with deep integration into the host Android environment.
We will discuss recent advancements, including the implementation of hardware-accelerated graphics based on gfxstream which support Android host now for Ferrochrome. This work also covers seamless sharing of various input devices from Android to the Linux guest. Other key integration points include bidirectional storage sharing, dynamic resource management via storage and memory ballooning, and network port forwarding and tunneling.
In this session, we will explore the technical approach behind these features, present our findings, and invite feedback on the future of virtualization on Android.
In virtualization environments, system reliability and efficient debugging are essential—especially for memory-sensitive trusted virtual machines (VMs). This presentation introduces a ramdump-based solution that captures key memory data when a VM crashes, without rebooting the device. It helps developers quickly analyze system states and resolve issues, improving reliability and maintainability.
Android currently collects telemetry data from devices in the field. While these metrics are important and can indicate overall system health issues, they are often lacking enough low-level system information that is necessary for finding root causes.
Android has been striving to improve BPF support to enable developers to extend the Linux kernel by creating BPF programs. This development is particularly beneficial in the kernel diagnostics and telemetry domains.
This talk focuses on how BPF programs can be used as lightweight tools to collect more detailed telemetry about the kernel on Android. It also covers the use of BPF as a cheaper replacement for existing dma-buf accounting (CONFIG_DMABUF_SYSFS_STATS) and wakeup source statistics (/sys/kernel/debug/wakeup_sources).
Running Android on RISC-V platforms has been a long-standing goal, filled with technical hurdles and real world economical evaluation.
Initially I proposed the idea at Andes Technology, it didn't pan out, leading me to pursue its realization at SiFive. However, due to org restructuring, I was eventually laid off from SiFive. (Hi Samuel 👋)
Ultimately, I returned to Andes to finish what I started: porting and booting Android on a real silicon-hardened RISC-V test chip platform, Qilai. And we, Andes, showcased it in various RISC-V conventions.
However, Qilai came in this world way before the ratification of RVA22 and RVA23 profiles which Google are relentlessly pushing forward to latest standards with their Cuttlefish.
In this talk, I will run through the things we need to do nowadays to squeeze AOSP to a platform on which its upstream shuns away.
Additionally, I will explore the question: Do we really need all those fancy extensions for running AOSP?
This talk dives into the progress Google and Linaro have made upstreaming Pixel 6 and the hurdles we’ve faced including:
Lastly, we will talk about how we are integrating the Pixel 6 upstream drivers into android-mainline to improve uprev sustainability.
KUnit is the only unit testing framework in the Linux kernel, but Android kernel changes are rarely accompanied by KUnit tests. Aside from the relative monotony associated with writing tests, one of the main barriers to more widespread KUnit testing seems to its inability (perceived or actual) to accommodate the complex use cases that we are developing features for.
When test code doesn't fit neatly around these changes, sometimes that's a limitation of the tooling, but other times it may signal a need to restructure the feature (or even the whole subsystem) under test. Refactoring an existing feature purely to accommodate testing may seem misguided, especially if doing so increases the memory footprint of the associated structures. Still, in many areas of software engineering, code's "testability" is one measure of its quality.
The goal of this talk will be to cover some patterns where KUnit falls short to facilitate a discussion about when the "correct" solution is to add functionality to KUnit and when it's to modify the code under test. Finally, I hope that having people talk about this will trigger them to consider how they might test their in-flight features, and maybe improve kernel test coverage as a result.
MPAM enables fine-grained control over shared resources such as CPU caches, memory bandwidth, and interconnect bandwidth. In a typical memory hierarchy, the data path looks like this:
CPU(L2/L3)/GPU/NPU/... ⇄ NoC ⇄ SLC ⇄ DDR
This structure introduces several components including System Level Cache between clients and DDR memory. This raises an important question: how can CPU caches and system-level caches be coordinated to optimize overall data path efficiency?
Modern hardware supports tagging each transaction with a client-specific identifier. These tags allow the monitoring unit to track transaction behavior across the system. A userspace daemon can read this monitoring data and, based on current workload scenarios, dynamically adjust cache policies using MPAM. For example, in gaming scenarios where GPU performance is critical, we can create a dedicated MPAM partition for game-related threads, increase the cache allocation for these threads in the CPU cache to prevent interference from other processes and adjust system-level cache usage to prioritize GPU-related data flows.
This can help to achieve better performance and lower power cost. In general, monitoring data plays a crucial role in informing cache policy decisions, enabling adaptive and scenario-specific optimizations across the memory hierarchy.
Android relies on the anonymous shared memory–ashmem–allocator to allocate anonymous (i.e. not file-backed) buffers that can easily and quickly be shared between processes via file descriptors.
Ashmem is an Android specific memory allocator that is implemented on top of the Linux kernel’s shmem subsystem, which allows for fast memory sharing, as all processes that refer to a shmem buffer are able to map and access the same regions of memory.
However, there are several drawbacks to using the ashmem memory allocator:
The ashmem kernel driver is not part of the upstream Linux kernel and therefore does not benefit from any upstream usage/test coverage and is considered technical debt.
The upstream Linux kernel already exposes a similar facility that allows anonymous memory buffers to be shared via file descriptors known as memfd.
Given that, Android has been making an effort to migrate all existing ashmem usage to switch to memfd.
This talk is centered around the approach that Android is taking to seamlessly move away from ashmem to memfd, as well as challenges, solutions, and forward facing work, such as:
Maintaining compatibility with older applications and devices.
SELinux support to eventually allow fine-grained access control for memfds.
The transition to a 16kB base page size creates a significant compatibility issue for legacy ELFs built with 4kB segment alignment. This misalignment can place Read-Execute (RX) and Read-Write (RW) segments within a single page, which would require insecure RWX mappings. While recompiling is the ideal fix, it is often impossible for apps that depend on unmaintained, closed-source third-party libraries. Consequently, these applications fail to load, presenting an open ecosystem challenge that requires a robust compatibility solution.
This talk presents an in-depth analysis of this problem and explores the design space for potential solutions. It will discuss:
Content:
Android's transition to 16kb page sizes necessitates that hardware components work seamlessly with 16kb page sizes in order to get optimal performance. This presentation will focus on hardware and software recommendations for devices running with 16kb page sizes.
This section will highlight the hardware design decisions that need to be made to support 16kb page sizes on:
In addition to this, the talk will also cover filesystem recommendations for the various different partitions on an Android device.
Target audience: For Android partners aiming to launch devices with 16kb page size support,
(1) A description of the overall topic
The number of areas of Machine Learning (ML) approaches application is growing with every day. There are already research works and industry efforts to employ ML approaches for configuration and optimization of Linux kernel. The using of ML approaches in Linux kernel is inevitable step. But there are multiple unanswered questions how ML can be used in Linux kernel efficiently. This micro-conference has goal to discuss:
1. How can we use ML approaches in Linux kernel?
2. Which infrastructure do we need to employ in Linux kernel for adoption of ML methods?
3. How can we use ML to automate testing, bugs detection, and bugs isolation in Linux kernel?
4. Can we use ML for automated refactoring and bug fix in Linux kernel code?
5. Can we use the whole fleet of Linux kernel deployments for massive and distributed testing of Linux kernel functionality?
6. How can we optimize the Linux kernel by using ML techniques?
The main discussion point is how to make ML techniques working in Linux kernel without affecting performance and efficiency the Linux kernel operations?
(2) The list of sub-topics that would be appropriate for the MC
(3) List of key attendees
After last year's BoF on Linux CVEs, several organizations have gathered together to form a workgroup, aiming at sharing CVE assessments to reduce the amount of toil of isolated triaging (https://github.com/cloud-lts/linux-cve-analysis). In this session we would like to present the progress we made so far, going over the current challenges and gather inputs from the community. This is also the occasion to present the initiative to those who may be interested in joining it, showing the value it can bring.
A space for folks looking for solutions to achieve the missing e2e test coverage for their Linux distributions. A Hardware in The Loop (HiTL) pipeline that can test changes directly on the hardware and provide instant feedback to any CI/CD loop. The system I intend to present can:
Run tests on actual hardware DUTs (Device Under Test)
Controlled from the outside using a device/jig running standardized open-source tools/test frameworks.
Scalable, by default, can be integrated with CI/CD tools.
And, completely open-source. Hardware, firmware, and software.
The system enables us to remove humans and instead let the machines automate the testing process to release software not as fast as possible but as confidently as possible. This is what a Hardware-in-the-loop (HiTL) pipeline can do for you.
Anyone expecting to start a jig or already has built a HiTL pipeline can join the discussion, sharing what not to do, the pitfalls to avoid, and, crucially, using the work we have already done to build better, more scalable test systems.
This BoF session would kick off with the work we have done up until now to fully automate releases of embedded OS for more than 30+ boards. It will follow up with feedback as well as looking for folks to help us build this further.
The aim: Each and every Linux developer can be enabled to develop, test, and patch their distributions at home and contribute back.
The Kernel Development Learning Pipeline (KDLP) project has grown significantly since its inception, successfully addressing the critical shortage of qualified entry-level kernel engineers. This talk will provide a quick background on KDLP, showcase its substantial growth, and outline future plans, including the creation of an open, community-driven textbook and resources for teaching and contribution. We will issue a call to action, inviting the community to participate in shaping the future of Linux kernel education.
KDLP originated from the recognition of a significant business risk: the aging of senior kernel engineers and the lack of a robust pipeline for new talent. KDLP was created and is primarily maintained by Joel Savitz and Charles Mirabile, who are responsible for the core course infrastructure, with significant contributions from the open source community. We aim to develop a comprehensive Linux kernel education, mentoring, and talent development pipeline. This initiative began as a pilot program offering an "Introduction to Linux Kernel Development" course at the University of Massachusetts Lowell, delivered by Red Hat associates and interns. The program has since expanded, aiming to democratize access to kernel engineering knowledge and provide hands-on experience to students.
Rado Vrbovsky's Iteration of the Course
The KDLP course has evolved with each iteration, continually improving curriculum and delivery methods. Rado's iteration of the course marked a significant step forward, incorporating updated materials and enhanced practical exercises. This iteration showcased the program's adaptability and commitment to providing the most relevant and effective training.
New University Partnerships
We have successfully established new university partnerships, extending the reach of the KDLP program. These partnerships allow us to engage with a broader range of students and introduce kernel development to more academic communities. Collaborations with universities in Brno, Israel, and the USA have demonstrated the global impact of KDLP.
Proof of Concept: Success Stories
KDLP has a proven track record of developing students into successful professionals. A total of seven individuals started as students in the program and have transitioned into internships and full-time employment. These success stories highlight the effectiveness of our program in preparing individuals for real-world kernel engineering roles, and demonstrate the value of this program to future engineers who want to contribute to the Linux Kernel community.
Continued Delivery in Partner Universities
We are committed to continuing the delivery of the KDLP course in our partner universities. We aim to deepen these relationships and establish long-term educational programs that foster ongoing development and growth. Future plans also include expanding to other colleges and universities to increase the program's accessibility and scale.
Writing a Textbook for the Course
A significant future endeavor is the creation of a comprehensive textbook for the KDLP course. This textbook will serve as a central resource for instructors, students, and contributors, providing structured learning materials and a guide for understanding and working with the Linux kernel. The goal is to create a living document with the community that will grow with time.
Generating Contribution and Teaching Documents
We will generate documents on how to contribute to the program, via the textbook or course infrastructure, and how to teach the course. These resources will ensure the sustainability and scalability of the program by empowering community members to participate in content creation and knowledge dissemination. These guidelines will enable domain experts to become teachers and share their knowledge effectively, ensuring consistency and clarity in instructional materials.
We invite the community to participate in shaping the future of KDLP. We seek feedback on our educational materials, contributions to the textbook, and participation in developing teaching resources. Your support and involvement are crucial to the continued success of KDLP and the growth of the Linux kernel community. By working together, we can address the skills gap in kernel development and ensure the future of Linux.
resctrl provides a Linux infrastructure to allocate and monitor shared memory and cache resources for higher performance and better QoS. It was first introduced into the kernel with X86 Intel Resource Director Technology (RDT). Later resctrl accepted AMD QoS. Over years, resctrl has been widely adopted by CSPs and client customers. It's on-going effort for resctrl to accept new RDT and AMD QoS new features. Currently resctrl is trying to integrate ARM Memory Sub-system Partitioning and Monitoring (MPAM).
The microconference will discuss current development of resctrl including ARM MPAM, Intel memory region based Memory Bandwidth Allocation and Monitoring (MBA/MBM), AMD Assignable Bandwidth Monitoring Counters (ABMC), Cache Monitoring Technology (CMT) based sched_ext, and other related on-going work.
This would be the first microconference for resctrl. With more and more developers from different companies working on resctrl, this microconference would be a good opportunity for the developers to discuss key technical challenges and next steps.
The key people to attend:
Tony Luck tony.luck@intel.com
Reinette Chatre reinette.chatre@intel.com
Dave Martin Dave.Martin@arm.com
James Morse james.morse@arm.com
Babu Moger babu.moger@amd.com
Fenghua Yu fenghuay@nvidia.com
This microconference will center on Nova, the upstream Rust-based kernel driver for NVIDIA GPUs.
Discussion topics will include the design and evolution of the firmware API exposed by the GPU System Processor (GSP), user-space submission interfaces and compute APIs, and interactions with the core kernel (device / driver APIs; locking and lifetimes; memory management APIs). NVIDIA engineers will share their experience around userspace submission interfaces, compute APIs, and the associated (architectural) challenges. This opens the floor for comparing and discussing design trade-offs across existing and emerging drivers, as well as opportunities afforded by Rust in this context.
Potential key participants are members of the Nova development and maintenance team at NVIDIA and Red Hat, contributors from the DRM and Rust-for-Linux communities, and developers working on parallel efforts such as Tyr, the Rust-based GPU driver initiative from ARM and Collabora.
The microconference aims to ensure that Nova remains closely tied to the needs and expectations of the wider graphics / compute stack in the Linux ecosystem, while fostering collaboration around shared challenges in GPU driver design.
By bringing together diverse stakeholders in an open forum, this microconference will encourage meaningful discussions that can lead to actionable outcomes for the future of GPU drivers in the Linux kernel.
BSPs are the essential starting point for new silicon: they demonstrate that hardware works and provide enough software to get products moving.
But BSPs rarely evolve.
Once the kernel they are based on ages, their value declines rapidly unless their features are carried upstream.
The ultimate objective is to bring an entire BSP upstream: many BSPs contain IP blocks already supported in mainline through other SoCs, and the SoC itself may already be partially upstream.
By identifying which parts overlap with existing upstream code, which parts are already integrated, and which remain vendor-only, stakeholders gain a clear map of what is still missing and what must be done to achieve full integration.
This talk explores strategies to analyze and track BSP and make it evolve.
We will look at source and history comparisons with mainline, using code fingerprinting and fuzzy matching to distinguish true features from backports and refactors.
We will also consider build artifact and ABI checks: by comparing images, modules, and DTBs across BSP and upstream builds, it is possible to filter out unused code and focus on the parts that actually matter.
The goal is not only to determine whether a BSP feature has landed upstream, but also to evaluate how BSPs can be sustained, reused, and fully integrated beyond their initial demo role.
The last several years have seen the creation or revival of many compact debuginfo formats. DWARF debuginfo is large and detailed, and it is usually stripped and shipped separately from the kernel and applications. This means that applications cannot expect their debuginfo to be present at runtime. Compact formats, on the other hand, tend to fit a narrow use-case. Their data are compact enough that they can be shipped alongside, or even within the kernel and applications. BTF and CTF represent C type definitions, ORC and SFrame allow reliable stack unwinding, Kallsyms can serve as a kernel symbol table, and .gnu_debugdata serve as an expanded userspace symbol table.
While this has implications for kernel debuggers like GDB, Crash, and drgn, it can apply to other areas of Linux. Many tools are designed with the assumption that reliable debuginfo is not usually available, or if it is, it's large and inefficient. This is changing now. For example, perf will soon be using the SFrame format to allow efficient unwinding of userspace stacks, instead of relying on frame pointers or DWARF. What other parts of the Linux ecosystem can be rethought with compact debuginfo formats?
In this session we will go through several ideas and prototypes including debugging tools, makedumpfile improvements, and improvements to userspace APIs. We'll also spend some time discussing and brainstorming other ideas. The goal of the session is not to promote a specific format or application, but instead to explore the possibilities and trigger new ideas.
This talk will cover various new features and enhancements in the perf tools including:
as well as on-going issues on each subject and more. Many topics for discussion to improve the profiling and analysis experience. Also we want to discuss kernel interfaces to help the tooling.
We would like to be able to select chunks of kernel memory for inspection or debug dumping, but only the essentials, such that we can create a core image, inspect the status of the kernel and vital information, without creating a full core dump, due to memory and time constrains.
Multiple patch series have been proposed, in which meminspect (or previously named kmemdump) API can be used to select parts of kernel memory for a generic inspection table generation. How can that be done in such a way that it will not bring overhead, not clutter the kernel code, be lightweight and easy to disable/enable, but still achieve its goal ?
This talk brings forth the pros and cons of different types of approaches, considering different types of memories that the kernel uses: static variables, global exported symbols, or dynamically allocated memory. The discussion then would circle around possible solutions to integrate the annotation of such memory in the kernel, with the aim of finding consensus with the kernel maintainers for the best possible option to achieve the desired goals.
As the hardware and firmware evolve, the kernel evolution must match the pace, but the selected solution needs to engulf all the requirements from existing users and devised for the future users as well.
Ftrace - tracing in the linux kernel introduced useful features and improvements these days. These include a persistent ring buffer, function-graph tracer with arguments/return values, BTF integration, remote ring buffer, function/tracepoint probes, watch probes, and many performance enhancements on uprobes and fprobes. In this talk, we will explain these features and show how to use them in practice.
Following up on the initial hazard pointer demonstration 1, the
in-kernel hazard pointers have been significantly improved based on
extensive feedback and discussions. A status update is therefore
warranted.
Besides some improvement/fixes of the normal hazard pointers
implementation, we will demonstrate a variant of hazard pointers --
simple hazard pointers (or shazptr) -- that offers an intuitive API and
can outperform RCU (including expedited RCU) in scenarios prioritizing
updater wait times or memory footprint.
Shazptr achieves simplicity through a shared per-CPU hazard pointer slot
and a "wildcard" mechanism, eliminating the need to allocate hazard
pointer slots prior to use.
Applying simple hazard pointers to lockdep accelerates dynamic key
unregistration (~20x speed-up 2), while avoiding additional IPIs.
Micro-benchmarks further indicate that shazptr surpasses RCU in updater
wait times while maintaining reasonable overhead on the reader side.
We will also cover the future potential of merging shazptr and normal
hazard pointers.
Neural Processing Units (NPUs) are becoming as common as GPUs in embedded SoCs, but Linux lacks a unified NPU subsystem. Current drivers are fragmented, vendor-specific, and often only tuned for vision inference (YOLO, ResNet). At the same time, newer workloads such as LLMs and multimodal models demand more flexible memory management, scheduling, and runtime integration.
This talk demystifies how NPUs work at the subsystem level — from DMA mapping and SRAM partitioning to command queue management. It will walk through case studies of deploying both vision models (YOLOv8) and LLMs (LLaMA3-tiny) on NPUs, highlighting where current Linux subsystems (DRM, V4L2, accel, AI/ML proposals) succeed and where they fall short.
Finally, it shows how Edgeble has adapted these learnings in real deployments on SoC and PCIe based NPU engine drivers by adding quantization and model scheduling.
The session aims to start a discussion around a more unified Linux NPU subsystem, drawing parallels to the GPU evolution, and inviting collaboration with kernel developers, hardware vendors, and OSS communities.
In Linux, testing the power management (PM) subsystem, particularly during suspend crashes, presents unique challenges. Logging mechanisms are often suspended during these events, making it difficult to capture critical information about system behavior. To address this gap, we developed a fuzzing system that combines LibAFL for fuzzing, S2E for external basic block coverage, and an input processor to synchronize and manage events across a virtual machine (VM) and an external board emulating a USB keyboard device passed through to the VM.
Unlike existing USB fuzzing tools like Syzkaller, which focus primarily on the enumeration phase, our system provides a persistent USB connection and allows deep coverage of the kernel during periods when self-coverage is unavailable. While this setup is currently focused on discovering races related to power management events such as suspend, it is designed with modularity in mind. Making it extendable for other functionality or fuzzing other subsystems where a persistent USB device is beneficial.
In development, the first challenge we encountered was ensuring coverage information could be safeguarded during suspend crashes. We initially explored solutions like kernel-level logging and QEMU’s QMP socket-based communication, but these approaches failed as they lost functionality during suspend. We also considered using the Real-Time Clock (RTC) value to retain coverage information, as this was thought to be the only data recoverable. However, this method did not work in a virtual machine (VM) with KVM, where the RTC value is not persistent as it is on a real host.
To address this, we turned to external tools capable of extracting coverage from a running VM kernel. After evaluating options such as Unicorn and kAFL, we chose S2E due to its extensive documentation, ease of modification with plugins, and its support for basic block coverage. S2E allowed us to track kernel execution flow during suspend crashes, enabling us to gather coverage of the kernel even when the system could not self-monitor.
We then configured an S2E project to run a custom kernel image that supports power management events, USB passthrough, and remote suspension commands via SSH. A custom S2E plugin was developed to monitor user-specified program counter addresses and return a coverage map. This map informs the mutator about the executed program counters, helping determine the next sequence of events to generate.
For the fuzzing component, we evaluated multiple options. We first considered using uEmu, which provides a fuzzing plugin for S2E. However, we found it to be quite specific to certain use cases. Instead, we opted for LibAFL due to its flexibility and ease of integration with our modular fuzzing setup, which also facilitates future modifications to the system. Currently, we are using the mutator to generate a simple byte array that the system interprets as different events, such as USB keypresses and suspend events.
To trigger these events and ensure they are properly executed within the system, another process, the input processor is used. The input processor is responsible for taking the generated input sequence and distributing it appropriately across the VM and the external USB device emulator, connecting to both systems through SSH. Additionally, the S2E plugin and the LibAFL processor utilize shared status flags, allowing events to be synchronized with the input processor.
For the USB device emulation, we chose the ROCKPI 4B board due to its support for USB 3.0 On-the-Go (OTG), enabling the board to act as both a host and a device. The device is emulated using the raw-gadget framework, with modifications to its keyboard example. Although this setup reliably sends keypresses and a remote wakeup signal via a keypress, the remote wakeup from USB is not currently functional.
The remote wakeup issue mainly involves QEMU’s limited support for advanced power management functionalities since they are normally not needed. This results in missing ACPI tables and an emulated XHCI hub without wakeup functionality. QEMU could potentially be modified to support this, but we currently work around this by using the QEMU QMP socket shell, even though this bypasses the USB channels for proper coverage of a remote wakeup event. This is currently the primary limitation of the setup for fuzzing PM events.
Our framework includes a Coverage Tool to configure the coverage guidance of the fuzzing campaign for a specific part of the kernel. The Coverage Tool is a Java-based static analysis tool designed to analyze XML-format representations of source code modules and generate relationships between functions within a given module. Given a directory of XML files the tool extracts information including caller/callee relationships, line range coverage, and callback associations.
Future work on this solution will focus on resolving the remote wakeup issue and expanding the fuzzing setup with more configurable options. Such as specifying suspend timing (e.g., how long to wait after a keypress before suspending) and extending the system to fuzz additional subsystems beyond power management. Additionally, we would like to test different USB devices, such as a mass storage device, and explore if using a fully emulated device on the host at the file system level is possible instead of a physical board.
This work presents a unique fuzzing setup that uses a persistent USB keyboard for deep kernel coverage during suspend crashes and other USB and power management interactions. The modular system integrates S2E for basic block coverage, LibAFL for fuzzing, and an input processor for event handling. This approach provides a valuable tool for uncovering and fixing USB-related bugs that are otherwise difficult to cover, especially during power management events. This solution can also be extended to test other subsystems, devices, and features in the future due to its modular design.
The System Boot and Security Microconference has been a critical platform for
enthusiasts and professionals working on firmware, bootloaders, system boot,
and security. This year, once again, we want to focus on the challenges that
arise when upstreaming boot process improvements to the Linux kernel and
bootloaders. Our experience shows that the introduction of new and/or not well-known
technologies into the kernel are especially difficult. The TrenchBoot project
is a very good example here, but we think others are also impacted. So, it would
be good to take all project stakeholders in one room and think what does not
work, what can be improved, etc. Though we are also happy to hear and discuss
what is currently happening in other areas related to platform initialization
and OS boot. Especially discussion about obstacles, not only technical ones,
during upstreaming and finding solutions during the MC can be very valuable for
various projects and the audience.
We welcome talks on the following things that can help achieve the goals mentioned above:
- TrenchBoot, tboot,
- TPMs, HSMs, secure elements,
- Roots of Trust: SRTM and DRTM,
- Intel TXT, SGX, TDX,
- AMD SKINIT, SEV,
- ARM DRTM,
- Growing Attestation ecosystem,
- IMA,
- TianoCore EDK II (UEFI), SeaBIOS, coreboot, U-Boot, LinuxBoot, hostboot,
- Measured Boot, Verified Boot, UEFI Secure Boot, UEFI Secure Boot Advanced Targeting (SBAT),
- shim,
- boot loaders: GRUB, systemd-boot/sd-boot, network boot, PXE, iPXE,
- UKI,
- u-root,
- OpenBMC, u-bmc,
- legal, organizational, and other similar issues relevant to people interested
in the system boot and security.
We’re seeing increased adoption of boot security technologies in Linux and utilization of platform root-of-rust mechanisms. There’s also been significant progress in open community efforts around image-based systems, where typically the root partition and/or the usr partition are implemented as signed DM-Verity volumes.
We’d like to demonstrate how to extend the integrity chain from boot to the OS in the general case, using Integrity Police Enforcement (IPE), DM-Verity, mandatory access control, and a mostly immutable filesystem layout. This is a policy-driven approach to code integrity, drawing upon our experiences with special-purpose hyperscale systems, but aimed at more generalized scenarios and broader community adoption.
We’ll utilize ParticleOS (systemd’s “customizable immutable distribution”) as the base for the Ultraviolet model, extending it with emerging features (like script execution control) to provide an end-to-end measured and attestable code integrity system. We’ll discuss technical challenges in adapting existing workloads to image-based systems and propose some ideas for solving them--ideally without overlays and other hacks.
The intention is to share both an architectural model and a reference implementation, firstly so others can benefit from what we’ve learned, and secondly to foster discussions about how we can work together on raising the security bar in the broader ecosystem.
References:
The LVFS Host Security ID (HSI) has become the de facto standard for measuring
platform security in Linux, with major distributions adopting it to present
security posture to end users. Designed primarily around proprietary UEFI
implementations, HSI may present edge cases for open-source firmware vendors
working with diverse firmware stacks like coreboot and edk2.
This session examines platform security measurement approaches across operating
systems and explores opportunities to enhance Linux implementation. We'll
discuss potential kernel API extensions to simplify and unify the assessment of
the advanced security features, such as SRTM or DRTM.
When booting Linux on PowerVM LPARs, there are unique characteristics vs UEFI-based systems that developers must consider. Expanding on last year's talk that discussed booting without a bootloader, this talk explores additional aspects of booting Linux on Power, current platform status, and potential future enhancements. Topics include verified and measured boot with PQC, proposals for bootloader-less boot support, thoughts on QEMU secure boot, and more. Attendees will learn about some problems and solutions that, although particular to PowerVM LPARs, may provide more general cross-architecture insights.
We have defended our position (cf. expat BoF) to standardize the attested TLS protocol in the IETF, and a new Working Group named Secure Evidence and Attestation Transport (SEAT) has been formed to exclusively tackle this specific problem. We would like to present the work (candidate draft for standardization) and gather feedback from the security community on the desired security goals, so that feedback can be accommodated in the standardization.
Transport Layer Security (TLS) is a widely used protocol for secure channel establishment. However, it lacks an inherent mechanism for validating the security state of the workload and its platform. To address this, remote attestation can be integrated into TLS, which is named attested TLS protocol. In this talk, we present an overview of the three approaches for this integration, namely pre-handshake attestation, intra-handshake attestation, and post-handshake attestation. We also present insights from the Formal Verification using the state-of-the-art symbolic security analysis tool ProVerif to provide high confidence for use in security-critical applications.
Current project partners include Arm, Linaro, Siemens, Huawei, Intuit, Axis, Bonn-Rhein-Sieg University of Applied Sciences, and Barkhausen Institut. By this talk, we hope to inspire more open-source contributors to this project.
The attendees will gain technical insights into the latest developments of standardization of attested TLS protocols in the IETF and will be able to provide feedback on the requirements for their use cases of attestation for confidential computing.
Our thorough analysis shows that pre-handshake attestation is potentially vulnerable to replay, relay, and diversion attacks. On the other hand, intra-handshake attestation is potentially vulnerable to relay and diversion attacks. While post-handshake attestation results in slightly high latency, it offers better security properties than the other two options, forming a valuable contribution to the TEE attestation ecosystem.
In a nutshell, to provide more robust security guarantees, all applications can replace standard TLS with attested TLS.
Oak stage0 is a VM firmware, mainly targeting QEMU microvm and Q35 machines (and compatible VMMs) that is simpler (and less featureful) than the traditional choices of EDK2/OVMF and SeaBIOS. The main purpose of stage0 is to provide a smaller and simpler method of booting confidential virtual machines to reduce the TCB. To that end, stage0 supports AMD SEV-SNP and Intel TDX; stage0 is the first step in how Project Oak provides falsifiable claims about confidential workloads via remote attestation.
We go over the basic design decisions behind stage0, what features it has and equally importantly what features it doesn’t have. We will also touch on issues that we’ve encountered in the Linux kernel, as stage0 is not an EFI firmware.
Stage0 is written in Rust and is available in the Oak repository, https://github.com/project-oak/oak/
Who authenticates Linux? In the age of Azure Entra ID, Okta, Google Workspace, and beyond, the answer is increasingly "not your local LDAP or Kerberos realm." Modern identity providers rely on OAuth2, device compliance, and custom multi-factor authentication (MFA) flows that are fundamentally browser-centric — which sits at odds with how a Linux login works.
PAM was designed decades ago for direct credential checks (think passwords). It's poorly suited to today's delegated, web-driven identity. Windows gets around this by invoking MFA flows in a desktop context, popping up a browser window for the user. Linux can't now (and arguably shouldn't) run a browser at the login prompt. So how do we integrate modern cloud identity — with MFA, device trust, policy enforcement — into the Linux login experience?
In this talk, I'll demo Himmelblau, an open-source project that glues Azure Entra ID into PAM and NSS. We'll explore the tricky hacks needed to make this work. But this is a bespoke solution for one IdP. Each provider implements their own authentication extensions on top of OAuth2. There's no standard for non-browser-based MFA integration in a PAM/NSS world. Should we start drafting one? Let's discuss what a sane future might look like — and how Linux systems could better support cloud-native identity out of the box.
Android boot flow quick recap
- Current problems
- Fastboot
GBL proposal
- Android meets UEFI
- Existing protocols adoption
- GBL custom protocol for Android Boot
Android UEFI Upstreaming
- EFI implementation for LittleKernel
- GBL protocols (EDK2, LittleKernel, Uboot)
Android Adoption of DRTM - How could the ARM DRTM spec be updated to account for Android boot'isms in a HLOS agnostic manner?
- ARM DRTM adoption of DT? Or transition Android to ACPI?
- What a proposed Android DRTM flow may look like
- What pitfalls does the community see in ARM implementations of DRTM and adoption of ACPI?
Android + EFIStub
- Proposal to adopt UKI now that Android has its own EFI Stub (GBL)
- What concerns / pitfalls are there?
This talk will provide an overview of the current Android boot flow and introduce Google’s GBL initiative for further standardization, along with key technical details. Our team will be happy to discuss the existing Android boot process and potential Mobile/UEFI challenges with the audience.
The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:
- Address Translation Service (ATS)/Page Request Interface (PRI)
- Single-root I/O Virtualization (SR-IOV)/Process Address Space ID (PASID)
- Shared Virtual Addressing (SVA)
- Remote Direct Memory Access (RDMA)
- Peer-to-Peer DMA (P2PDMA)
- Cache Coherent Interconnect for Accelerators (CCIX)
- Compute Express Link (CXL)/Data Object Exchange (DOE)
- Component Measurement and Authentication (CMA)
- Integrity and Data Encryption (IDE)
- Security Protocol and Data Model (SPDM)
These features are aimed at high-performance systems, server and desktop computing, embedded and SoC platforms, virtualisation, and ubiquitous IoT devices.
The kernel code that enables these new system features focuses on coordination between the PCI devices, the IOMMUs they are connected to, and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems.
The VFIO/IOMMU/PCI MC focuses on the kernel code that enables these new system features, often requiring coordination between the VFIO, IOMMU and PCI subsystems.
Following the success of LPC 2017, 2019, 2020, 2021, 2022, 2023 and 2024 VFIO/IOMMU/PCI MC, the Linux Plumbers Conference 2024 VFIO/IOMMU/PCI track will focus on promoting discussions on the PCI core and current kernel patches aimed at VFIO/IOMMU/PCI subsystems. Specific sessions will focus on discussions that require coordination between the three subsystems.
See the following video recordings from 2024: LPC 2024 - VFIO/IOMMU/PCI MC.
Older recordings are available through the official YouTube channel of the Linux Plumbers Conference and the archived LPC 2017 VFIO/IOMMU/PCI MC web page at Linux Plumbers Conference 2017, where the audio recordings from the MC track and links to presentation materials are available.
The tentative schedule will provide an update on the current state of VFIO/IOMMU/PCI kernel subsystems, followed by a discussion of current issues related to the proposed topics.
The following was a result of last year's successful Linux Plumbers MC:
- The first version of work on solving the complex and pressing issue of secure device assignment that spans across the PCI, IOMMU, and CXL subsystems has been completed, and a series of patches has been sent to view and spark more discussion and debate about how to solve this challenging problem.
Tentative topics that are under consideration for this year include (but are not limited to):
- PCI
- Cache Coherent Interconnect for Accelerators (CCIX)/Compute Express Link (CXL) expansion memory and accelerators management
- Data Object Exchange (DOE)
- Integrity and Data Encryption (IDE)
- Component Measurement and Authentication (CMA)
- Security Protocol and Data Model (SPDM)
- I/O Address Space ID Allocator (IOASID)
- INTX/MSI IRQ domain consolidation
- Gen-Z interconnect fabric
- PCI error handling and management, e.g., Advanced Error Reporting (AER), Downstream Port Containment (DPC), ACPI Platform Error Interface (APEI) and Error Disconnect Recovery (EDR)
- Power management and devices supporting Active-state Power Management (ASPM)
- Peer-to-Peer DMA (P2PDMA)
- Resources claiming/assignment consolidation
- DMA ownership models
- Thunderbolt, DMA, RDMA and USB4 security
- VFIO
- I/O Page Fault (IOPF) for passthrough devices
- Shared Virtual Addressing (SVA) interface
- Single-root I/O Virtualization(SRIOV)/Process Address Space ID (PASID) integration
- PASID in SRIOV virtual functions
- TDISP/TSM Device assignment/sub-assignment
- IOMMU
- /dev/iommufd development
- IOMMU virtualisation
- IOMMU drivers SVA interface
- DMA-API layer interactions and the move towards generic dma-ops for IOMMU drivers
- Possible IOMMU core changes (e.g., better integration with the device-driver core, etc.)
If you are interested in participating in this MC and have topics to propose, please use the Call for Proposals (CfP) process. More topics might be added based on CfP for this MC.
Otherwise, join us in discussing how to help Linux keep up with the new features added to the PCI interconnect specification. We hope to see you there!
Key Attendees:
- Alex Williamson
- Benjamin Herrenschmidt
- Bjorn Helgaas
- Dan Williams
- Ilpo Järvinen
- Jacob Pan
- James Gowans
- Jason Gunthorpe
- Jonathan Cameron
- Jörg Rödel
- Kevin Tian
- Lorenzo Pieralisi
- Lu Baolu
- Manivannan Sadhasivam
Contacts:
- Alex Williamson (alex.williamson@redhat.com)
- Bjorn Helgaas (helgaas@kernel.org)
- Jörg Roedel (joro@8bytes.org)
- Lorenzo Pieralisi (lpieralisi@kernel.org)
- Krzysztof Wilczyński (kwilczynski@kernel.org)
With required updates to the PCI core, device core, CPU arch, KVM, VFIO, IOMMUFD, and DMABUF the TEE I/O effort has a significant amount of work to do reach the starting line of the race to address Confidential Device use cases. Then, the mechanisms for devices to enter the locked state, the attestation and policy infrastructure for deploying secrets to TEE VMs, and the ability to recover a Trusted Computing Base (TCB) when errors inevitably occur , is all follow-on work to that initial base.
This discussion will quickly review what has happened since last Plumbers and then open a discussion on the remaining challenges. Particular focus to be paid to the host side challenges (VFIO, IOMMUFD, DMABUF, KVM) as those are likely to still be open well into the new year.
Review the current state of the page table consolidation project.
Depending on progress in the next months this may be a primer on the design of the consolidated page table system to help reviewers, or a discussion on the next steps to land along the project.
https://patch.msgid.link/r/0-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com
Additionally any iommufd related topics that people may wish to bring.
Hello, I'm planning to attend LPC in person this year, and am interested in presenting our learnings related to running user space drivers built on top of VFIO in production, specifically related to orchestrating access to VFIO-bound devices from multiple processes.
The presentation would cover
- Our current usage patterns.
- Benefits of being able to deploy updates to device policy by shipping user space binaries.
- Proposals for how to augment kernel interfaces to enable our class of use case more securely.
- How the proposals could intersect with ongoing VFIO and IOMMUFD work streams.
Some of the core topics are already being discussed in the tech topic thread [1] that I started. I think a presentation which summarizes the main points could be a useful catalyst for further in-person discussions at the conference.
[1] https://lore.kernel.org/all/20250918214425.2677057-1-amastro@fb.com/
Cloud workloads with strict performance needs (AI, HPC, large-scale data processing) frequently use PCIe device passthrough (e.g., via VFIO in Linux/KVM) to reduce latency and improve bandwidth. While effective for performance, this approach also exposes low-level device configuration interfaces directly to guest workloads, which may be malicious or running untrusted software.
In our experiments across multiple device types and vendors, we observed that legal but unexpected writes to the PCIe configuration space of a passthrough device can trigger PCIe errors that lead to host-level failures, posing a risk to system reliability, availability, and serviceability (RAS). While it is well known that PCIe errors can cause host-level failures, our observations are noteworthy for two main reasons:
To begin addressing this issue, we prototyped a VFIO kernel patch that blocks guest accesses to unassigned regions in the PCIe configuration space. The patch allows two coarse-grained policies that can be enforced through module parameters: 1) Reads and writes to unassigned regions can be independently blocked (e.g., block only writes, block only reads, or block both). 2) Specific devices can be whitelisted to allow accesses to unassigned regions if required.
Looking forward, we propose a more fine-grained model analogous to how the IOMMU validates IOVA mappings. In this design, VFIO would maintain a table of “valid” config-space regions per device, supplied by the device driver (most realistic) or potentially extracted from firmware/device descriptors (longer-term). VFIO would consult this table on each guest config access, permitting safe operations while blocking or emulating unsafe ones.
In this talk, we aim to discuss:
- The implications of exposing PCIe configuration space in passthrough setups.
- Whether VFIO (or related subsystems) should adopt a stricter access model.
- Potential directions for long-term solutions, involving kernel, firmware, and hardware changes.
Our goal is to engage with the community to shape a path forward that preserves both the performance benefits of passthrough and the robustness required for large-scale, multi-tenant deployments.
We present a Hyper-V based pvIOMMU implementation for Linux guest, built upon the community-driven Generic I/O Page Table framework. Our approach leverages stage-1 page tables in the guest(w/ nested translation) to drive DMA remapping(including vSVA). This also eliminates the need for complex device-specific emulation and map/unmap overhead, meanwhile staying scalable across architectures.
This session will cover:
We seek feedback on integration boundaries and the minimal common infrastructure required to upstream the Hyper-V pvIOMMU.
The Smart Data Accelerator Interface (SDXI) is a new SNIA standard that extends traditional DMA engines with support for multiple address spaces, user-space ownership, and extensible offloads such as memory data movement. This talk reports on the progress of Linux enablement in two phases: an initial DMA-engine integration already upstream for review, and a full SDXI 1.0 implementation with a user ABI and supporting library. We will outline the current IOCTL-based UAPI, discuss key design trade-offs in ABI shaping, kernel/user coordination, and address space isolation, and demonstrate early user-space workloads such as inter-VM memory copy. Preliminary findings and evaluation methodology will be presented. The session will highlight open issues around subsystem placement, security, and virtualization support, and invite discussion on integration and optimization strategies, UAPI evolution, and emerging SDXI use cases across Linux subsystems.
AMD’s Smart Data Cache Injection (SDCI) leverages PCIe TLP Processing Hints (TPH) to steer DMA write data directly into the target CPU's L2 cache to reduce latency, improve throughput, and reduce DRAM bandwidth. This talk covers the details of AMD SDCI design, outlines the Linux kernel support we have developed - including a new ACPI _DSM interface in the PCI root complex and extensions to provide TPH API - and demonstrates how driver developers can adopt these features to unlock performance gains. We will present results using two open-source network drivers showing measurable improvements in latency and bandwidth efficiency on AMD SDCI-enabled SoCs, and conclude with lessons learned, practical considerations for driver adoption, and design implications under virtualized environments.
On non-ACPI systems, such as the ones using DeviceTree for hardware description, the PCI host bridge drivers were responsible for managing the endpoint power supplies. While it worked for some simple use cases like the endpoints requiring 12V, 3.3V supplies, it didn't work for complex supplies required by some endpoint devices like the integrated WLAN/BT devices.
The PCI Pwrctrl framework [1] was created to solve these issues by moving the power supply handling away from the PCI host bridge drivers. While it solved the intended issue, like managing the supplies of a single WLAN/BT endpoint connected to a Root Port, it fell short of being the generic PCI framework managing the endpoint supplies. For instance, it doesn't support handling PERST#, which needs to be toggled on some form factors, managing the supplies of multiple endpoints under a Root Port with different power supply requirements, etc. So the developers still cannot use it on all platforms.
This session outlines the issues with the pwrctrl framework and presents proposals on addressing them.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pwrctrl
x86-focused material has historically been spread out at Plumbers. This will be an x86-focused microconference. Broadly speaking, anything that might affect arch/x86 is on topic, except where there may be a more focused discussion occurring, like around Confidential Computing or KVM.
This microconference would look at how to address new x86 processor features and also look back at how older issues might be made less painful. For new processor features like APX, what is coming? Are the vendors coordinating, are they coordinating enough so that Linux engineers can talk and are they compatible? For older issues like hardware security vulnerabilities, is the current approach working? If not, how should they be dealt with differently? Can new hardware features or vendor policies help?
As always, the microconference will be a great place for coordination among distributions, toolchains and users up and down the software stack. All the way from guest userspace to VMMs.
Potential Problem Areas to Address:
Key Attendees
The Devicetree Microconference focuses on discussing and solving problems present in the systems using Devicetree as firmware representation. This notably is Linux kernel and U-Boot, but also can cover topics relevant to Zephyr or System Devicetrees. Systems using Devicetree are majority of embedded boards, mobile devices and ARM64 laptops (and many other ARM/ARM64/RISC-V machines).
Current problems:
Key attendees:
Abel Vesa, AngeloGioacchino Del Regno, Arnd Bergmann, Bartosz Golaszewski, Bjorn Andersson, Casey Connolly, Geert Uytterhoeven, Hervé Codina, Luca Ceresoli, Michal Simek, Neil Armstrong, Nishanth Menon, Thierry Reding, Wolfram Sang
(Confirmed attendance: Abel Vesa, Bartosz Golaszewski, Bjorn Andersson, Casey Connolly, Neil Armstrong; most likely Arnd Bergmann as well)
The great benefit of Devicetree bindings in the current DT schema format is the ability to validate the correctness of DTS (Devicetree sources) against those bindings. However, once validation was introduced, we discovered that many in-kernel DTS files simply did not pass.
A few years and thousands of commits later, we can now ask:
1. What is the current status of in-kernel DTS validation?
2. Which architectures among the main Devicetree users are free of warnings (spoiler: there is at least one)?
3. Which SoC subsystems are doing well and have compliant DTS files?
4. What do the community and SoC maintainers need in order to improve this situation for their platforms?
This talk will address questions 1–3, and we hope the discussion will generate ideas to answer question 4.
MiPi Specifications defines standard device properties that endup into ACPI tables. Whoever Device tree bindings evolve in a very different way.
Even-though all the three are defining hardware, there is no consideration of MiPi or ACPI while defining the bindings.
where do we draw a line?
Is there some consolidation that needs to happen?
How can drivers written for ACPI be use able for Device tree based platforms.
The high-level idea behind the linux kernel GPIO consumer API is that lines are an exclusive resource - only one logical consumer can request and control a GPIO pin. This results naturally from the type of operations that a low-level user can perform on GPIO lines - after all: one user setting the line's direction to output while another sets it to input is an example of a very clear conflict that cannot easily be solved without involving a higher level abstraction wrapping the low-level GPIO primitives.
In real life however, nothing stops hardware designers from connecting the same IO pin to multiple devices and this is what we - as kernel engineers - later see modeled in device-tree as phandles referencing the same GPIO assigned to different device nodes.
In some cases, we have infrastructure in place to handle this correctly. An example would be: the semi-standardized reset-gpios DT property that will be interpreted by the reset core as a simple reset provider with a shared GPIO which is driven "high" when the reset is asserted. Reset core handles multiple references to the same "logical" reset.
Device-tree however is supposed to model the real hardware so as soon as the same GPIO is used by multiple devices in a different context, we are in a situation where two separate drivers need to request the same line. We currently only have a hack in place that allows drivers to basically "fight" over the GPIO with no concurrency protection and no actual logic behind.
That's not all: the problem can become even more complex than a simple reference counting of value setting. An example of such a problem is the PERST# signal in PCI shared between endpoints where the expected (logical) behavior is: physically drive the line high only once ALL the users want it driven high.
This talk will describe the problems in detail, describe existing solutions and the need for better ones, what has been done so far and how much we can do in C code with the information contained in device-tree before we have to resort to quirks.
On x86 / ACPI platforms, devices on enumerable busses can normally be seen directly by the OS. On DT platforms, these devices sometimes require extra power sequencing like toggling regulator supplies or GPIO lines. Over the years most of these cases have been solved, but there are still some gaps.
This talk will go through what already works, and attempts to identify:
- What's missing from the device tree
- What's defined in the device tree, but doesn't really work in the kernel
- What kludges we have accumulated and haven't cleaned up
fw_devlink currently parses about 20+ properties to track dependencies between DT nodes. While this is not terribly slow, it's still an overhead and needs to be maintained as more common DT properties are added.
There are also a ton of other bespoke device specific DT properties that aren't supported today.
This talk about be about the various ways the DTC could make this a lot more optimal and generalized.
A large percentage of feedback when reviewing DTS changes ends up being style-related. This ends up hurting both the reviewer and submitter, as the former wants to maintain codebase coherence, while the latter wants their changes to get merged.
This session will be a discussion on where we currently stand, what functionality should be targeted and what are the obvious roadblocks.
On many device tree based devices, the device tree blobs are commonly shipped with the kernel or OS image, not the firmware. If the image is meant to be generic, it would include multiple DTBs and possibly many DTBO combinations. The bootloader selects a DTB and optionally applies overlays matching the hardware. Known image "standards" include:
- FIT image: maps a compatible string to a kernel image, a DTB and zero or more DTBOs
- Android's DTBO partition format: allows up to six 32-bit unsigned integers for DTB/DTBO matching, but the mechanism is entirely implementation defined
Elliot's "Shipping Multiple Devicetrees: How to Identify Which DTB Is for My Board?" [1] talk from ELC 2024 touched on the need for some generic property(s) for identification. Armit proposed a generic "board-id" property [2] for this purpose.
Putting the board identifiers in the device tree itself only solves half the problem. Device tree overlays cannot reuse the same identifiers, as properties cannot be stacked or appended together. The overlays have extra selection criteria that are often tied to the firmware. The base DT needs to be identified. These bits of "metadata" are what is needed to put together a fully usable image for the firmware to load. The author's own session "Second Source Component Probing on Device Tree Platforms" [3] at ELC 2024 briefly mentioned this.
This talk aims to start a discussion on what bits are needed for various bootloaders and images, and how to store that metadata close to if not within the source files.
[1] https://sched.co/1aBFy
[2] https://lore.kernel.org/all/1710418312-6559-1-git-send-email-quic_amrianan@quicinc.com/
[3] https://sched.co/1aBGe
Devicetree evolved from systems that were tightly vertically integrated, and this has been a continuing trend throughout its adoption in early Linux. This is not a problem when the same organisation is responsible for building the bootloader, the kernel, and the OS - they can just do whatever they want, often with little care given to forward compatibility, or compatibility with custom software at all.
The last 5 years has seen a drastic shift, with ARM powered SBCs and laptops in the hands of consumers the “Devicetree Selection” problem has hit us full force in several ways:
So how do we as bootloader, kernel, middleware and distro developers deal with this and start bringing the "typical Linux" experience to more devices?
In this talk Casey will cover her experiences working with upstream Devicetree, she will explore the problem space and surprising complexity of loading a suitable Devicetree for your hardware and kernel, the problem of overlays and hardware variants, and the different ways that distros, bootloaders, and the kernel community are trying to improve the situation. She will focus on specific issues with the recent Snapdragon laptops, smartphones, and dev boards, with insight from her experience as a distro developer and U-Boot custodian. Finally, she will describe her proposal to improve the situation today, providing a way forward for distros and users to get up and running on devices that are already out in the wild by standardising on a way for users and firmware to express a devicetree preference to the OS loader.
The Gaming on Linux Microconference welcomes the community to discuss a broad range of topics around performance improvements for Gaming devices running Linux. Gaming on Linux has pushed the kernel to improve in several areas and has helped create new features for Linux, such as the futex_waitv() syscall, the Unicode subsystem, HDR support, and much more. Although they were initially created for gaming use cases, now they are used in different scenarios.
The potential topics for this year are around a lot of subsystems in the kernel, including:
Steam Deck is a successful console from Valve that runs on top of FOSS, having Linux as its operating system.
For the regular gamers, user experience is smooth and they don’t even need to think about what’s going under the hood to ensure such good experience is possible. Specially, there are interesting bits from the tracing system and in-kernel debug features leveraged in order to achieve the smooth run.
In this talk, we’re going to dive into both proactive mechanisms to detect how well the system is running and detect if there are sub-optimal paths that could be improved, as well as tooling to collect logs in case of a more severe set of issues, leading to kernel crashes.
All of that is then tied to an opt-in feature to send information to a Valve’s Sentry instance to be debugged by the SteamOS engineering team.
Perfetto is a powerful instrumentation-based tool that enables deep insights into the behavior of computing platforms. In this talk, we’ll demonstrate how Perfetto can be leveraged to analyze the performance of mobile and VR games, focusing on their interactions with the Linux kernel.
We’ll present real-world examples illustrating how Perfetto helps us understand the complex relationships between the kernel scheduler, memory management, GPU drivers, interrupt handlers, and game workloads. Since games are semi-realtime applications, they must complete all processing within strict frame time budgets. The kernel scheduler, in particular, plays a pivotal role in ensuring games meet their frame scheduling deadlines.
By examining Perfetto traces, we’ll showcase key metrics that reveal performance bottlenecks and highlight opportunities for optimization. These insights are invaluable for improving game performance on Linux and Android platforms, ultimately leading to smoother and more responsive gaming experiences.
Emulators and translation layers have been pushing the limits of the existing syscalls and Linux APIs, creating the need for new interfaces. One of such interfaces is the get/set_robust_list() syscall.
This syscall gets as an argument a user pointer to a user linked list. This syscall assumes that the pointer size is the native size, depending on the kernel build. This doesn't works when running a x86-32 bit application in an ARM64 kernel, that doesn't have the compat entry point as x86-64 does.
Also, only one list per task is allowed, so any emulator that wants to support robust lists needs to give up either their own list or the emulated app list.
This goal of this session is to share a proposal for a new interface for this syscall, as shared in the LKML:
https://lore.kernel.org/lkml/20250626-tonyk-robust_futex-v5-0-179194dbde8f@igalia.com/
The CPU scheduler plays a decisive role in the Linux gaming experience. By controlling which task runs first, for how long, and on which CPU, the scheduler directly impacts stutter, latency, energy efficiency, and overall performance.
This talk asks whether a gaming-optimized scheduler is feasible, and if so, what fundamental properties it should preserve. We will outline potential optimization areas specific to gaming workloads, highlight trade-offs against general-purpose scheduling goals, and explore how the scheduler might balance throughput with responsiveness under highly interactive and latency-sensitive conditions.
As a case study, we will share insights from developing LAVD, a sched_ext-based scheduler designed for gaming workloads in mind. The session aims to spark discussion on what an optimized scheduler for gaming could look like, how it might integrate with the broader Linux gaming stack, and what steps could bring us closer to that goal.
This is an open time slot to continue discussions with a more informal format.
Live Update is a specialized reboot process where selected devices are kept operational and kernel state is preserved and recreated across a kexec. For devices, DMA and interrupts may continue during the reboot.
The primary use-case of Live Update is to enable hypervisor updates in cloud environments with minimal disruption to running virtual machines. During a Live Update, a VM can pause and its state is stored to memory while the hypervisor reboots. PCIe devices attached to those VMs (such as GPUs, NICs, and SSDs), are kept running during the Live Update. After the reboot, VMs are recreated and restored from memory, reattached to devices, and resumed. The disruption is limited to the time it takes to complete this entire process.
With Live Update infrastructure in place, other use-cases may emerge, like for example preserving the state of GPU doing LLM, freezing running containers with CRIU, and preserving large in-memory databases.
The Live Update and state persistence functionality touch on different parts of the kernel and this microconference aims to bring together people from different subsystems. Upstream support for Live Updates is still in its infancy and there are a lot of unsolved aspects that will benefit from direct communication.
Key problems that will be discussed:
Support for memfd/guest_memfd/hugetlb/tmpfs
Preserving the state of VFIO, IOMMUFD, and IOMMU drivers.
Kernel <-> userspace interaction during Live Update
Integration of Live Update with PCI and Device Model
Persistence of movable memory
Leveraging suspend/resume functionality for device state preservation
Optimizing kernel shutdown and boot times
Automated Testing of Live Updates
Key attendees:
Pasha Tatashin
David Matlack
David Rientjes
Chris Li
Bjorn Helgaas
Samiullah Khawaja
Vipin Sharma
Josh Hilke
Changyuan Lyu
Alex Graf
David Woodhouse
James Gowans
Pratyush Yadav
Jason Gunthorpe
Mike Rapoport
Alex Williamson
This workshop will center on Nova, the upstream Rust-based kernel driver for NVIDIA GPUs.
Discussion topics will include the design and evolution of the firmware API exposed by the GPU System Processor (GSP), user-space submission interfaces and compute APIs, and interactions with the core kernel (device / driver APIs; locking and lifetimes; memory management APIs). NVIDIA engineers will share their experience around userspace submission interfaces, compute APIs, and the associated (architectural) challenges. This opens the floor for comparing and discussing design trade-offs across existing and emerging drivers, as well as opportunities afforded by Rust in this context.
Potential key participants are members of the Nova development and maintenance team at NVIDIA and Red Hat, contributors from the DRM and Rust-for-Linux communities, and developers working on parallel efforts such as Tyr, the Rust-based GPU driver initiative from ARM and Collabora.
The workshop aims to ensure that Nova remains closely tied to the needs and expectations of the wider graphics / compute stack in the Linux ecosystem, while fostering collaboration around shared challenges in GPU driver design.
By bringing together diverse stakeholders in an open forum, this workshop will encourage meaningful discussions that can lead to actionable outcomes for the future of GPU drivers in the Linux kernel.
Evening event details are on the back of your badge