Linux Plumbers Conference 2022

Europe/Dublin
Description

12-14 September, Dublin, Ireland

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.

    • BOFs Session: Birds of a Feather (BoF)
    • Kernel Memory Management MC
      • 1
        Copy On Write, Get User Pages, and Mysterious Counters

        As we learned throughout the last decade (!), Copy On Write (COW) paired with Get User Pages (GUP) can be harder then it seems. Fortunately, it looks like that we might have both mechanisms working completely reliable in combination soon -- at least for most types of anonymous memory.

        In this talk, I'll explain recent changes to our GUP and COW logic for anonymous memory, how they work, where we stand, what the tradeoffs are, what we're missing, and where to go from here.

        Also, I will talk about which mysterious counters are we using nowadays in our COW logic(s), what their semantics are, and what options we might have for simplifying one of them (hint: mapcount), and what the tradeoffs might be.

        But also, what about the shared zeropage, private mappings of files, KSM ... ?

        Speaker: Mr David Hildenbrand (Red Hat)
      • 2
        Next approach to solve mmap_lock scalability - per-VMA lock

        At this year's LSFMM conference we discussed mmap_lock scalability issue and current approaches of solving it. The main issue is the process-wide scale of mmap_lock, which prevents handling page faults in one virtual memory area (VMA) of a process when another VMA of the same process is being modified.
        Recently posted respin of Speculative Page Faults patchset was deemed too complex to be accepted and the discussion concluded with a suggestion that "a reader/writer semaphore could be put into the VMA as a sort of range lock". This talk will focus on per-VMA lock patchset which implements this approach, its pros/cons and benchmark results.

        Speakers: Liam Howlett (Oracle), SUREN BAGHDASARYAN, Michel Lespinasse (Facebook)
      • 3
        Memory tiering

        CXL enables exploration of a more diverse range of memory technology beyond the DDR supported by the CPU. Those memory technologies come with different performance characteristics from a latency & bandwidth point of view. This means the memory topology of platforms becomes even more complex.

        There is a large burden on how to leverage tiered memory, from letting the end user control placement to trying to automatically place memory on behalf of the end user. This presentation intends to be a review of the choices and technology that are in development (memory monitoring, NUMA, ...) and try to identify roadblocks.

        Speaker: Mr Jerome Glisse (Google)
      • 11:30 AM
        Break
      • 4
        Multi-Gen LRU: Current Status & Next Steps
        • Latest performance benchmark results on ARM64 servers and POWER9
        • How to make MGLRU the default for everyone
        • How to use MGLRU page table scanning to reclaim unused page-table pages
        • A BPF program built on top of MGLRU to create per-process (access) heat maps
        Speakers: Jesse Barnes‎ (Google), Rom Lemarchand (Google)
      • 5
        Preserving guest memory across kexec

        Live update is a mechanism to support deploying updates to a running hypervisor in a way that has limited impact to virtual machines. This is done by pausing the virtual machines, stashing KVM state, kexecing into a new kernel, and restarting the VMM process. The challenge is guest memory: how can it be preserved and restored across kexec?

        This talk describes a solution to this problem: moving guest memory out of the kernel managed domain, and providing control of memory mappings to userspace. Userspace is then able to restore the memory mappings of the processes and virtual machines via a FUSE-like interface for page table management.

        We describe some requirements, options, why the FUSE-style options was chosen, an an overview of the work-in-progress implementation. Opinions are collected around other use cases this functionality could support.
        Next steps around finalising the design and working to get this included upstream are discussed.

        This is a follow-on the the initial RFC presented at LSF-MM a few months ago: https://lwn.net/SubscriberLink/895453/71c46dbe09426f59/

        Speaker: James Gowans (Amazon EC2)
      • 6
        Low-overhead memory allocation tracking

        Tracking memory allocations for leak detection is an old problem with
        many existing solutions such as kmemleak and page_owner. However these
        solutions have relatively high performance overhead which limits their
        use. This talk will present memory allocation tracking implementation
        based on code tagging framework. It is designed to minimize
        performance overhead, while capturing enough information to discover
        kernel memory leaks.

        Speakers: Kent Overstreet, SUREN BAGHDASARYAN
      • 7
        The slab allocators of past, present, and future
        • A summary of how we got to have SLAB, SLOB and SLUB.
        • The strengths and weaknesses of each - performance, debugging, memory overhead.
        • The issues with having three implementations.
        • Code complexity and bitrot
        • Other features having to implement for each variant or limit choice (kmemcg, PREEMPT_RT...)
        • Imperfect common code, recent attempts to unify it more
        • API improvement issues - we would like kfree() to work on kmem_cache_alloc() objects, but SLOB would have to adapt and increase memory overhead.
        • Can we drop SLOB and/or SLAB? What changes would SLUB need in order to replace their use cases?
        Speaker: Vlastimil Babka (SUSE Labs)
    • LPC Refereed Track
      • 8
        PREEMPT_RT - how not to break it.

        The PREEMPT_RT patch set has only a handful patches left until it can be
        enabled on the X86 Architecture at the time of writing.
        The work has not finished once the patches are fully merged. A new issue
        is how to not break parts of PREEMPT_RT in future development by making
        assumption which are not compatible or lead to large latencies.
        Another problem is how to address limitations on PREEMPT_RT like the
        big softirq/ bottom halves lock which can lead to high latencies.

        Speaker: Sebastian Siewior
      • 9
        Launching new processes with `io_uring_spawn` for fast builds

        io_uring allows running a batch of operations fast, on behalf of the current process. As the name suggests, this works exceptionally well for I/O workloads. However, one of the most prominent workloads in software development involves executing other processes: make and other build systems launch many other processes over the course of a build. How can we launch those processes faster?

        What if we could launch other processes, and give them initial work to do using io_uring, ending with an exec? What if we could handle the pre-exec steps for a new process entirely in the kernel, with no userspace required, eliminating the need for fork or even vfork, and eliminating page-table CoW overhead?

        In this talk, I'll introduce io_uring_spawn, a mechanism for launching empty new processes with an associated io_uring. I'll show how the kernel can launch a blank process, with no initial copy-on-write page tables, and initialize all of its resources from an io_uring. I'll walk through both the successful path and the error-handling path, and show how to get information about the launched process. Finally, I'll show how existing userspace can take advantage of io_uring_spawn to speed up posix_spawn, and provide performance numbers for common workloads, including kernel compilation.

        Speaker: Josh Triplett
      • 11:30 AM
        Break
      • 10
        Exercising the Linux scheduler with Yogini

        Introducing "yogini", a flexible Linux tool for stretching the Linux scheduler and measuring the result.

        Yogini includes an extensible catalogue of simple workloads, including execution, cache and memory bound, as well as advanced (Intel) ISAs. The workloads are assigned to threads, which can be run at prescribed rates at prescribed times.

        At the same time, yogini can run a periodic system monitor, which tracks frequency, power, sched stats, temperature and other hardware and software metrics. Since yogini tracks both power and performance, it can combine them to report energy efficiency.

        Measurement results are buffered in memory and dumped to a .TSV file upon completion -- to be read as text, imported to your favorite spreadsheet, or plotted via script.

        As the workloads are well controlled, yogini lends itself well to be used for creating Linux regression tests -- particularly those relating to scheduler-related performance and efficiency.

        Yogini is new. The goal of this session it let the community know it is available, and hopefully useful, and to solicit ideas for making it even more useful for improving Linux.

        Speaker: Len Brown (Intel Open Source Technology Center)
      • 11
        OS Scheduling with Nest: Keeping Tasks Close Together on Warm Cores

        To best support highly parallel applications, Linux's CFS scheduler tends to spread tasks across the machine on task creation and wakeup. It has been observed, however, that in a server environment, such a strategy leads to tasks being unnecessarily placed on long-idle cores that are running at lower frequencies, reducing performance, and to tasks being unnecessarily distributed across sockets, consuming more energy. In this talk, we propose to exploit the principle of core reuse, by constructing a nest of cores to be used in priority for task scheduling, thus obtaining higher frequencies and using fewer sockets. We implement the Nest scheduler in the Linux kernel. While performance and energy usage are comparable to CFS for highly parallel applications, for a range of applications using fewer tasks than cores, Nest improves performance 10%-2x and can reduce energy usage.

        Speaker: Julia Lawall (Inria)
      • 1:30 PM
        Lunch
      • 12
        RV: where are we?

        RV: Where are we?

        Over the last years, I've been exploring the possibility of verifying the Linux kernel behavior using Runtime Verification.

        Runtime Verification (RV) is a lightweight (yet rigorous) method that complements classical exhaustive verification techniques (such as model checking and theorem proving) with a more practical approach for complex systems.

        Instead of relying on a fine-grained model of a system (e.g., a re-implementation of instruction-level), RV works by analyzing the trace of the system's actual execution, comparing it against a formal specification of the system behavior.

        The research has become reality with the proposal of the RV interface [1]. At this stage, the proposal includes:

        • An interface for controlling the verification;
        • A tool and set of headers that enable the automatic code generation of the RV monitor (Monitor Synthesis);
        • Sample monitors to evaluate the interface;
        • A sample monitor developed in the context of the Elisa Project demonstrating how to use RV in the context of safety-critical systems.

        In this discussion, we can talk about the steps missing for an RV merge and what are the next steps for the interface. Also, to discuss the needs of safety-critical and testing communities, to better understand what kind of models and what kind of new features they need.

        [1] https://lore.kernel.org/all/cover.1651766361.git.bristot@kernel.org/

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 13
        Modularization for Lockdep

        Lockdep is a powerful tool for developers to uncover lock issues. However there are things that still need to improve:

        • The error messages are sometimes confusing and difficult to understand, and require experts to decode them. This not only makes read deadlock scenarios challenging to understand, but also makes internal bugs hard to debug.

        • Once one lock issue is reported, all the lockdep functionalities are turned off. Although this is reasonable, because once a lock issue is detected the whole system is subject to lock bugs and it's pointless to continue running
          the system until the bugs are fixed. However this is frustrating for developers when they hit some lock issues that happen in other subsystems, they cannot test their code for lock issues until the existing ones are fixed.

        • Detection takes time to run and creates extra syncronization points than production environments. It's not suprising that lockdep uses an internal lock to protect the data structures for lock issue detections. However, this lock creates
          syncronization points and may make some issues difficult to detect (because the issues may only happen for a particular even sequence, and the extra syncronization points may prevent such a sequence from happening).

        This session will show some modularization effort for lockdep. The modularization use a frontend-backend design: the frontend tracks the current held locks for every task/contexts and reports lock depedencies to the backend, and the backend maintains the lock dependency graph and detect lock issues based on what the frontend reports.

        Along with the design, a draft implementation will be shown in the session too, providing something concrete to discuss about the design and the future work.

        Speaker: Boqun Feng
      • 4:30 PM
        Break
      • 14
        Make RCU do less (save power)!

        On battery-powered systems, RCU can be a major consumer of power. Different strategies can be tried to mitigate power which we will show along with power data. Also I have been working on some patches to further reduce RCU activity in frequently-called paths like file close. This presentation is to discuss some test results mostly on the power consumption side on battery-powered Android and ChromeOS systems and the new work mentioned to delay RCU processing.

        Speakers: Joel Fernandes, Rushikesh Kadam, Uladzislau Rezki
      • 15
        Once upon an API

        Even a seemingly simple API can turn out to have complex and surprising behaviors, as I illustrate by telling the story of an API feature that was added to Linux in 1997 and looking how it interacts (and has evolved) with other parts of the Linux API. These kinds of complexities and surprises of course create pain for user-space programmers, and so I also muse about some of the reasons that these API design problems occur and a few things we can do to reduce the likelihood of such problems reoccurring in the future.

        Speaker: Michael Kerrisk (man7.org Training and Consulting)
    • System Boot and Security MC
      • 16
        Secure bootloader for Confidential Computing

        Confidential computing (CC) provides a solution for data protection with hardware-based Trusted Execution Environment (TEE) such as Intel TDX, AMD SEV, or ARM RME. Today, Open Virtual machine Firmware (OVMF) and shim+grub provided necessary initialization for confidential virtual machine (VM) guest. More important, they acted as the chain of trust for measurement to support TEE attestation. In this talk, we would like to introduce the CC measurement infrastructure in the OVMF together with shim and grub, and how the VM guest uses the measurement information to support TEE runtime attestation. Finally we would like to discuss the attestation-based disk encryption solution in CC and compare the options in pre-boot phase (OVMF), OS loader phase (grub) or kernel early boot phase (initrd) and related cloud use case.

        Speakers: Ken Lu (Intel), Jiewen Yao
      • 17
        Secure Boot auto enrollment

        Based on a current systemd PR (https://github.com/systemd/systemd/pull/20255) that I submitted, I would like to talk about auto enrollment of Secure Boot.

        I would be especially glad to have feedback on any unanticipated issues. Also it is a systemd PR, I think it fits the system boot and security micro conference as it deals with Secure Boot.

        One major issue already identified is proprietary signed option ROMs and the rather low deployment of UEFI audit mode.

        Speaker: vincent dagonneau
      • 18
        Kernel TEE subsystem evolution

        A Trusted Execution Environment (TEE) is an isolated execution environment running alongside an operating system. It provides the capability to isolate security-critical or trusted code and corresponding resources like memory, devices, etc. This isolation is backed by hardware security features such as Arm TrustZone, AMD Secure Processor, etc.

        This session will focus on the evolution of the TEE subsystem within the kernel, shared memory management between the Linux OS and the TEE, and the concept of the TEE bus. Later, we'll look at its current applications, which include firmware TPM, HWRNG, Trusted Keys, and a PKCS#11 token. Along with this, we will brainstorm on its future use-cases as a DRTM for remote attestation, among others.

        Speaker: Sumit Garg
      • 11:30 AM
        Break
      • 19
        Remote Attestation of IoT devices using a discrete TPM 2.0

        There are billions of networked IoT devices and most of them are vulnerable to remote attacks. We are developing a remote attestation solution for IoT devices based on Arm called EnactTrust. The project started with PoC for a car manufacturer in 2021.

        Today, we have an open-source agent at GitHub[1] that performs attestation. The EnactTrust agent leverages a discrete TPM 2.0 module and has some unique IoT features like attestation of the TPM’s GPIO for safety-critical embedded systems.

        Currently, we are working on integrating our open-source agent with Arm’s open-source Trusted Firmware implementation. We are targeting both TF-A and TF-M.

        Our goal is to demonstrate bootloader attestation using EnactTrust. Bootloader candidates are TrenchBoot, Tboot, and U-Boot. Especially interesting is the case of U-Boot since it does not have the same level of security capabilities as TrenchBoot and Tboot.

        EnactTrust consists of an agent application (running on the device) and a connection to a private or public cloud[2]. We believe that the security of ARM-based IoT devices can be greatly improved using attestation.

        [1] https://github.com/EnactTrust/enact
        [2] https://a3s.enacttrust.com

        Speakers: Mr Dimitar Tomov (TPM.dev), Mr Svetlozar Kalchev (EnactTrust)
      • 20
        TrenchBoot Update

        Presented here will be an update on TrenchBoot development, with a focus on the Linux Secure Launch upstream activities and the building of the new late launch capability, Secure ReLaunch. The coverage of the upstream activities will focus on the redesign of the Secure Launch start up sequence to accommodate efi-stub's requirement to control Linux setup on EFI platforms. This will include a discussion of the new Dynamic Launch Handler (dl-handler) and the corresponding Secure Launch Resource Table (SLRT). The talk will then progress into presenting the new Secure ReLaunch capability and its use cases. The conclusion will be a short roadmap discussion of what will be coming next for the launch integrity ecosystem.

        Speaker: Daniel Smith (Apertus Solutions, LLC)
    • VFIO/IOMMU/PCI MC
      • 11:30 AM
        Break
    • eBPF & Networking

      The track will be composed of talks, 30 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.

      This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.

      • 21
        The journey of BPF from restricted C language towards extended and safe C.

        BPF programs can be written in C, Rust, Assembly, and even in Python. The majority of the programs are in C. The subset of C usable to write BPF programs was never strictly defined. It started very strict. Loops were not allowed, global variables were not available, etc As BPF ecosystem grew the subset of C language became bigger. But then something interesting happened. The C language itself became a limiting factor. Compile Once Run Everywhere technology required new language features and the intrinsics were added. Then the type tagging was added. More compiler and language extensions are being developed. BPF programs are now written in what can be considered a flavor of C language. The C language is becoming a safe C. Other languages rely on garbage collection (like golang) or don't prevent memory leaks (like C or C++) the extended and safe C is addressing not only this programmer's concern, but other bugs typical in C code. This talk will explore whether BPF's safe C one day will become the language of choice for the core kernel and user space programs.

        Speaker: Alexei Starovoitov (Meta)
      • 22
        HID-BPF

        HID (Human Interface Device) is an old protocol which handles input devices. It is supposed to be standard and to allow devices to work without the need for a driver. Unfortunately, it is not standard, merely “standard”.

        The HID subsystem has roughly 80 drivers, half of them are fixing only one tiny bit, either in the protocol of the device or in the key mapping for instance.

        Historically, fixing such devices requires users to submit a patch to the kernel. The process of updating the kernel has greatly improved over the past few years, but still, we can not safely fix those devices in-place (without rebooting and risking messing up the system).

        But here is the future: eBPF. eBPF allows loading kernel-space code from user-space.

        Why not apply that to HID devices too? This way we can change the small part of the device that is not working while still relying on the generic processing of the kernel to support it.

        In this talk, we will outline this new feature that we are currently upstreaming, its advantages and why this is the future, for HID at least.

        Speaker: Benjamin Tissoires (Red Hat)
      • 23
        eBPF Kernel Scheduling with Ghost

        Ghost is a kernel scheduling class that allows userspace and eBPF programs, called the "agent", to control the scheduler.

        Following up on last year's LPC talk, I'll cover:
        - How BPF works in Ghost
        - An agent that runs completely in BPF: no userspace scheduling required!
        - Implementation details of "Biff": a bpf-hello-world example scheduler.
        - Future work, including CFS-in-BPF, as well as a request for new MAP_TYPEs!

        Speaker: Barret Rhoden (Google)
      • 11:30 AM
        Break
      • 24
        Tuning Linux TCP for data-center networks

        For better or worse, TCP remains the main transport of many hyperscale data-center networks. Optimizing TCP has been a hot topic in both academic research and industry R&D. However individual research paper often focuses on solving a specific problem (e.g. congestion control for data-center incast) and the industry solutions are often not public or generically applicable. Since Linux TCP default configurations are more or less tuned for wide-area Internet settings, it’s not easy to tune Linux TCP for low-latency data-center environments. For example, simply changing the congestion control to the well-known “dctcp” congestion control may not fully deliver all the benefits Linux TCP can provide.

        In this talk, we’d like to share our knowledge and best practices after a decade-long experience of tuning TCP for data-center networks and applications, covering congestion control, protocol and IO enhancements. We will discuss the trade-offs among latency, utilization, CPU, memory, and complexity. In addition we’ll present the inexpensive instrumentation to trace application frame-aware latency beyond general flow level statistics. It’s worth emphasizing that the goal is not to promote the authors’ own works but to help promote interesting discussions with other data center networking developers and guide new comers. After the meeting we hope to synthesize our recommendations into Documentation/networking/tcp_datacenter.txt

        Speaker: Yuchung Cheng (Google)
      • 25
        Can the Linux networking stack be used with very high speed applications?

        Ethernet networking speeds continue to increase – 100G is common today for both NICs and switches, 200G has been available for years, 400G is the current cutting edge with 800G on the horizon. As the speed of the physical layer increases how does S/W scale - specifically, the Linux IP/TCP stack? Is it possible to leverage the increasing line-rate speeds for a single flow? Consider a few example data points about what it means to run at 400 Gbps speeds:

        1. the TCP sequence number wraps 12.5 times a second - i.e., wrapping every 80 msec, and

        2. at an MTU of 1500B, to achieve 400G speeds the system needs to handle 33M pps - i.e., a packet arrives every 30 nsec (for reference, an IPv4 FIB lookup on a modern Xeon processor takes ~25nsec).

        We used an FPGA based setup with an off-the-shelf server and CPU to investigate the Linux networking stack to determine how fast it can be pushed and how well it performs at high rates for a single flow. With this setup tailored specifically to push the kernel’s TCP/IP stack, we were able to achieve a rate of more than 670 Gbps (application data rate) and more than 31 Mpps (different tests) for a single flow. This talk discusses how we achieved those rates, the lessons learned along the way and what it suggests are core requirements for deploying very high speed applications that want to use the Linux networking stack. Some of the topics are well established such as the need for GRO, TSO, zerocopy and a reduction of system calls; others are not so prominent. This talk presents a systematic and comprehensive review of the effect of variables involved and serves as a foundation for future work.

        Speaker: David Ahern
      • 26
        Overview of the BPF networking hooks and user experience in Meta

        BPF has grown rapidly. In the networking stack, a BPF program can do much more than a few years ago. It could be overwhelming to figure out which bpf hook should be used, what is available at a particular layer and why. This talk will go through some of the bpf hooks in the networking stack with use cases in Meta. The talk will also get to some common questions/confusions that the users have and how they could be addressed in the future.

        Speaker: Martin Lau (Meta)
      • 1:30 PM
        Lunch
      • 27
        BPF Signing and IMA integration

        Signing BPF programs has been a long ongoing discussion and there has been some more concrete work and discussions since the BPF office hours talk in June.

        There was a BoF session at the Linux security summit in Austin between BPF folks (KP and Florent) and IMA developers (Mimi, Stefan and Elaine) to agree on a solution to have IMA use BPF signatures.

        The BPF position is to provide maximum flexibility to the user on how the programs are signed. For this. They way the programs are signed (format, kind of hash) and the way the signature is verified should be up-to the user. IMA is one of the users of BPF signatures.

        The goal of this session is to discuss a gatekeeper and signing implementation that works with IMA and the options that are available for IMA and agree on a solution to move forward.

        The current kernel convention where IMA hard codes a callback into the security_* hooks is at odds with the BPF philosophy of providing flexibility to the user. But, we do see a common ground that can work the best for BPF, IMA and most importantly, the users.

        Speaker: KP Singh (Google)
      • 28
        Revisiting eBPF Seccomp Filters

        Seccomp, the widely used system-call security module in Linux, is among the few that still exposes classic BPF (cBPF) as the programming interface, instead of the modern eBPF. Due to the limited programmability of cBPF, today's Seccomp filters mostly implement static allow-deny lists. The only way to implement advanced policies is to delegate them to user space (e.g., Seccomp Notify); however, such an approach is error prone due to time-of-check time-of-use issues and costly due to the context switch overhead.

        Over the past several years, supporting eBPF filters in Seccomp has been brought up (e.g., by Dhillon [1], Hromatka [2], and our team [3]) and has raised many offline discussions on the mailing lists [4]. However, the community has not been convinced that eBPF for Seccomp is 1) necessary nor 2) safe, with opinions like "Seccomp shouldn't need it..." and "rather stick with cBPF until we have an overwhelmingly good reason to use eBPF..." preventing its inclusion.

        We have developed a full-fledged eBPF Seccomp filter support and systematically analyzed its security [5]. In the proposed presentation, using the insight from our system, we will (1) summarize and refute concerns on supporting eBPF Seccomp filters, (2) present our design and implementation with a holistic view, and (3) open the discussion for the next steps.

        Specifically, to show that eBPF for Seccomp is necessary, we describe several security features we build using eBPF Seccomp filters, the integration with container runtime like crun, and performance benchmark results. To show that it is safe, we further describe the use of root-only eBPF Seccomp in container-based use cases, which strictly obey current kernel security policies and still improve the usefulness of Seccomp. Further, we will go over the key designs for security, including protecting kernel data, maintaining consistent helper function capability, and the potential integration with IMA (the integrity measurement architecture).

        Finally, we will discuss future opportunities and concerns with allowing unprivileged eBPF Seccomp and possible avenues to address these concerns.

        Reference:
        [1] Dhillon, S., eBPF Seccomp filters. https://lwn.net/Articles/747229/
        [2] Hromatka, T., [RFC PATCH] all: RFC - add support for ebpf.
        https://groups.google.com/g/libseccomp/c/pX6QkVF0F74/m/ZUJlwI5qAwAJ
        [3] Zhu, Y., eBPF seccomp filters, https://lwn.net/Articles/855970/
        [4] Corbet J.,eBPF seccomp() filters, https://lwn.net/Articles/857228/
        [5] https://github.com/xlab-uiuc/seccomp-ebpf-upstream/tree/v2

        Speakers: Jinghao Jia (University of Illinois Urbana-Champaign), Prof. Tianyin Xu (University of Illinois at Urbana-Champaign)
      • 29
        State of kprobes/trampolines batch attachment

        There's ongoing effort to speed up attaching of multiple probes,
        which resulted in new bpf 'kprobe_multi' link interface. This allows
        fast attachment of many kprobes (thousands) and is now supported for
        example in bpftrace.

        Similar interface is being developed also for trampolines, but it's
        bit more bumpy road than for kprobes for various reasons.

        I'll shortly sum up multi kprobe interface and some of its current users,
        and mainly focus on state of the trampoline multi attach API changes.

        Speaker: Jiri Olsa (Isovalent)
      • 4:30 PM
        Break
      • 30
        Developing eBPF profiler for polyglot cloud-native applications

        One of the important jobs of system-wide profilers is to capture stack traces without requiring recompilation or redeployment of profiled applications. This becomes difficult when the profiler has to deal with the binaries compiled in different languages. Heavy lifting for the stack unwinding is done by the kernel if frame pointers are present or if the binary has ORC - in kernel debug information format information available. Although most modern compilers have an option to scrap the frame pointers for performance gain.

        In this talk we will talk about how we are experimenting with using eBPF to extend the existing stack unwinding facility in the Linux kernel. We will discuss how we are walking the stacks of interpreted languages, such as Ruby, as well as runtimes with JITs, like the JVM. And how extending the current stack unwinding facility can be useful for such cases.

        Speakers: Vaishali Thakkar, Javier Honduvilla Coto
      • 31
        Performance insights into eBPF step by step

        Having full visibility throughout the system you build is well
        established best practice. Usually one knows which metrics to collect,
        how and what to profile or instrument to understand why the system
        exhibits this level of performance. All of this becomes more challenging
        as soon as eBPF layer is included.

        In this talk Dmitrii shed some light on those bits of your service that use
        eBPF, step by step with topics such as:

        • How to collect execution metrics of eBPF programs?
        • How can we profile these eBPF programs?
        • What are the common pitfalls to avoid?

        The talk will provide the attendees with an approach to analyze and
        reason about eBPF programs performance.

        Speaker: Dmitrii Dolgov (Red Hat)
      • 32
        Live coding eBPF with streaming abstractions

        eBPF gives us an extraordinary amount of power, allowing to attach custom programs to many subsystems in the kernel. We use it to build lots of helpful observability and security tools, but there is a problem: eBPF is hard and nonintuitive for developers who are not familiar with low-level programming concepts.

        In this talk, we will discuss a novel approach to writing programs for the eBPF virtual machine, building on a new set of abstractions borrowed from streaming databases, functional programming, and visual live coding environments. We will see how such abstractions can help us to simplify our development workflows, allowing us to build tools from a set of visual composable blocks. A prototype of a new open source integrated development environment for eBPF, Metalens, will be demonstrated.

        Speaker: Nikita Baksalyar
    • 1:30 PM
      Lunch
    • 1:30 PM
      Lunch
    • 1:30 PM
      Lunch
    • Kernel Testing & Dependability MC
      • 33
        Integrating testing with maintainer flows

        Currently we often have a fairly big disconnect between generic testing and quality efforts and the work of developers and maintainers. There is a lot of testing that is done by people working on the kernel that is not covered by what the general testing efforts do, and conversely it is often hard for developers and maintainers to access broader community resources when extra testing is needed. This impacts both development but also downstream users like stable kernels.

        How can we join these efforts together?

        Areas with successful reuse include:

        • kselftest
        • igt
        • v4l-compliance

        Other testing efforts more confined to their domains:

        • Intel's audio testing setup
        • Filesystem coverage (xfstests and so on)

        Ideas/topics for discussion:

        • Improved tooling
        • Testsuite/system configuration
        • Proposals from Cyril for describing hardware for tests
        • Test interface standards
        • ...
        Speakers: Mark Brown, Veronika Kabatova
      • 34
        Checking your work: Linux kernel testing and CI

        There are a number of tools available for writing tests in the kernel. One
        of them is kselftest, which is a system for writing end-to-end tests. Another
        is KUnit, which runs unit tests directly in the kernel.

        These testing tools are very useful, but they lack the ability for maintainers
        to configure how the tests should be run. For example, patches submitted to the
        RCU tree and branch should run a quick subset of the full gamut of rcutorture
        tests, whereas it is prudent to run heavyweight and comprehensive rcutorture
        tests ~once / day on the linux-rcu tree, as well as various mainline trees,
        etc. Similarly, cgroup tests can be run on every patch sent to the cgroup tree, but
        certain testcases have known flakiness that could be signaled by the developer.

        Maintainers and contributors would benefit from being able to configure their test
        suites to illustrate the intent of individual testcases, and the suite at large, to
        signal both to contributors and to CI systems, how the tests should be run and
        interpreted. This MC discussion would ask the question of whether we should implement
        this, and if so, what it would look like.

        Key problems:
        - Updating kernel test subsystem structure (ksefltest, kunit) to allow maintainers to express configurations for test suites. The idea here is to avoid developers having to include patches to each separate CI system to include and configure their test, and instead have all of that located in configuration files that are included in the test suite, with CI systems consuming and using this information as necessary.
        - Discuss whether we should bring xfstests into the kernel source tree, and whether we could make it a kselftest.
        - Discuss whether we can and should include coverage information in KernelCI.

        Key people:
        - Guillaume Tucker and other KernelCI maintainers
        - Ideally Shuah Khan as kselftest maintainer, and Brendan Higgins as KUnit maintainer
        - Anyone interested in testing / CI signal (Thorsten Leemhuis perhaps, given his KR talk about how to submit an actionable kernel bug report?)

        Speaker: David Vernet (Meta)
      • 35
        Making syzbot reports more developer-friendly

        Since 2017, syzbot (powered by syzkaller - a coverage-guided kernel fuzzer) has already reported thousands of bugs to the Linux kernel mailing lists and thousands have already been fixed.

        However, as our statistics show, a lot of reported issues get fixed only after a long delay or don't get fixed at all. That means we could still do much better in addressing the needs of the community than we currently do.

        This talk will summarize and present the changes that have been made to syzbot over the last year. Also, we want to share and discuss with the audience our further plans related to making our reports and our tool more developer-friendly.

        Speaker: Dmitry Vyukov (Google)
      • 36
        Designing UAPI for Fuzz-ability

        Fuzzing (randomized testing) become an important part of the kernel quality assurance. syzkaller/syzbot report a hundred of bugs each month. However, the fuzzer coverage of the kernel code is far from being complete and some subsystems are easier to fuzz/reach, while others are harder/impossible to fuzz/reach.
        In this talk Dmitry will talk about patterns and anti-patterns of UAPI/subsystem design with respect to fuzz-ability:

        • what makes it impossible to fuzz a subsystem
        • what leads to unreproducible crashes
        • why a subsystem may be excluded from fuzzing
        • what makes a perfect interface/subsystem for fuzzing
        Speaker: Dmitry Vyukov (Google)
      • 4:30 PM
        Break
      • 37
        The emerging of the virtual QA team for Linux kernel

        Linux kernel community has formed a virtual QA team and testing process gradually, from developing unit testing, to testing service (various CI that covers build, fuzzing and runtime), to result consolidation (KCIDB) to bug scrub (regzbot), which largely formalize the testing effort in community wide. 0-Day CI is glad to be part of this progress.

        In this topic, we will talk about the status of such trend, and how each part work together. And a few words regarding 0-Day CI’s current effort in this trend. Then we want to exchange the ideas and have the discussion around any enhancements or missing parts of this virtual team, like

        • Common testing methodology, such as bisection optimization, selective testing to reduce overall tests based on what is really changed.
        • Test matrix to map feature and tests so to know whether a feature is covered and enough
        • The connection with OSV, user and feature developer to convert their tests to be in the test pool like kselftests
        • Reduce breakage between different architectures to avoid test bias
        • Roadmap to adopt new test tools such as kismet
        • Extend the shift left testing to detect issue earlier so likely to reduce the overall Reported-by tags on mainline

        We look forward to having more collaboration with other players in the community to jointly move this trend forward.

        Speaker: philip li
      • 38
        KUnit: Function Redirection and More

        Despite everyone's efforts, there's still more kernel to test. One problem area that keeps popping up is the need to replace functions with 'fake' or 'mock' equivalents in order to test hardware or less-self-contained subsystems. We will discuss two methods of replacing functions: one based on ftrace, and another based on "static stubbing" using a function prologue.

        We will also provide a brief "KUnit year in review" retrospective, and a prospective look on what we are doing/what we hope to achieve in the coming year.

        Speakers: Brendan Higgins (Google LLC), David Gow (Google)
      • 39
        How to introduce KUnit to physical device drivers?

        Unit testing is a great way to ensure code reliability, leading to organic improvements, as it's often possible to integrate it with developers' workflows. It is also of great help when refactoring, which should be a primordial task in large code bases. When it comes to the Linux kernel, the KUnit framework looks very promising, as it works natively from inside the kernel, and provides an infrastructure for running tests easily.

        We are seeing a growing interest in unit testing on the DRM subsystem, with amazing initiatives to add KUnit tests to the DRM API. Moreover, three GSoC projects under the X.Org Foundation umbrella target unit tests for AMDGPU display drivers, as it is currently the largest one in the kernel. It is, thus, of great importance to discuss problems and possible solutions regarding the implementation of KUnit tests, especially for hardware drivers.

        Bearing this in mind, and as part of our GSoC projects [1], we introduce unit testing to the AMDGPU driver departing from the Display Mode Library (DML), which is a library focused on mathematical calculations for DCN (Display Core Next); we also explore the addition of new tests to DCE (Display Controller Engine). Since AMD's CI already relies on IGT GPU Tools (a test suite for DRM drivers) we also propose an integration between it and KUnit which allows for DRM KUnit tests to be run through IGT as well.

        In this talk, we present the tests' development process and the current state of KUnit in GPU drivers. We discuss the obstacles we faced during the project, such as generating coverage reports, mocking a physical device, and especially in regards to the implementation of tests for the AMDGPU driver stack, with the additional difficulties associated with making them IGT compatible. Finally, we want to discuss with the community lessons learned using KUnit in GPU drivers and how to reuse these strategies for other GPU drivers and also drivers in other subsystems.

        [1] https://summerofcode.withgoogle.com/programs/2022/organizations/xorg-foundation

        Speakers: Isabella Basso, Magali Lemes, Maíra Canal, Tales da Aparecida
      • 40
        Simple KernelCI Labs with Labgrid

        While most current KernelCI labs use Lava to deploy and test the Kernels
        on real hardware, other approaches are supported by KernelCI's design.

        To allow using boards connected to an existing labgrid installation,
        Jan build a small adapter from KernelCI's API to labgrid's hardware
        control Python API.

        As labgrid has a way to support board-specific deployment steps,
        this should also make it easier to run tests on boards which are not
        easily supported in Lava (such as without Ethernet) or requiring special
        button settings.

        The main goal of the discussion is to collect feedback from the MC
        participants, on how to make this adapter most useful for the KernelCI
        community.

        Speaker: Jan Lübbe (Pengutronix)
    • Rust MC

      Rust is a systems programming language that is making great strides in becoming the next big one in the domain.

      Rust for Linux aims to bring it into the kernel since it has a key property that makes it very interesting to consider as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc.

      This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics.

      Possible Rust for Linux topics:
      - Bringing Rust into the kernel (e.g. status update, next steps...).
      - Use cases for Rust around the kernel (e.g. drivers, subsystems...).
      - Integration with kernel systems and infrastructure (e.g. wrapping existing subsystems safely, build system, documentation, testing, maintenance...).

      Possible Rust topics:
      - Language and standard library (e.g. upcoming features, memory model...).
      - Compilers and codegen (e.g. rustc improvements, LLVM and Rust, rustc_codegen_gcc, gccrs...).
      - Other tooling and new ideas (Cargo, Miri, Clippy, Compiler Explorer, Coccinelle for Rust...).
      - Educational material.
      - Any other Rust topic within the Linux ecosystem.

      • 3:00 PM
        Welcome to the Rust MC
      • 41
        Rust GCC Front-end

        Toolchain support for the Rust language is a question central to adopting Rust in the Linux kernel. So far, the LLVM-based rustc compiler has been the only option for Rust language compilers. GCC Rust is a work-in-progress project to add a fully-featured front-end for Rust to the GNU toolchain. As a part of GCC, this compiler benefits from the common GCC flags, optimizations, and back-end targets.

        As work on the project continues, supporting Linux kernel development and the adoption of Rust in the kernel has become an essential guiding target. In this discussion, we would like to introduce the project's current state and consult with Rust-for-Linux developers about their needs from the toolchain; for example, how to prioritize work in Rust GCC or how we handle language versioning. Some particular topics for discussion:

        • Procedural macros
        • libcore, liballoc
        • Language versioning
        • Debug integration
        • Unstable language features
        • Bindings and FFI
        Speakers: Philip Herron (Embecosm), David Faust (Oracle)
      • 42
        rustc_codegen_gcc: A gcc codegen for the Rust compiler

        The Rust programming language is becoming more and more popular: it's even considered as another language allowed in the Linux kernel.
        That brought up the question of architecture support as the official Rust compiler is based on LLVM.
        This project, rustc_codegen_gcc, is meant to plug the GCC backend to the Rust compiler frontend as a relatively low-effort: it's a shared library reusing the same API provided by the Rust compiler as the cranelift backend.
        As such, it could be used by some Linux projects as a way to provide their Rust softwares to more architectures.
        This talk will present this project, its progress and will feature a discussion about what needs to be done to start using it for projects like Rust for Linux.

        Speaker: Antoni Boucher
      • 43
        Rust for Linux: Status Update

        Rust is a systems programming language with desirable properties in the context of the Linux kernel, such as no undefined behavior in its safe subset (when unsafe code is sound), including memory safety and the absence of data races.

        Rust for Linux is a project that aims to bring Rust support to the Linux kernel as a first-class language. This means providing support for writing kernel modules in Rust, such as drivers or filesystems, with as little unsafe code as possible (potentially none). That is, it prevents misusing kernel APIs with respect to memory-safety.

        This session will give an status update on the project:

        • What features are currently supported.
        • Infrastructure improvements.
        • Rust unstable (nightly) features status.
        • Rust ecosystem news: language, toolchains, etc.
        • Planned features and future.
        Speakers: Miguel Ojeda, Wedson Almeida Filho
      • 4:35 PM
        Coffee Break
      • 44
        Linux Rust NVMe Driver Status Update

        Status update from ongoing work on a Rust NVMe driver for Linux. Benchmark numbers, architectural challenges, etc.

        Speaker: Andreas Hindborg (Western Digital)
      • 45
        The Integration of Rust with Kernel Testing Service

        Rust for Linux aims to bring Rust into the kernel as the second programming language. With the great advancing of this target, a corresponding testing service for Rust is becoming a potential requirement.

        0-Day CI team has been working closely with the maintainers of Rust for Linux to integrate Rust into kernel test robot. We'd like to share our experience of enabling Rust test. Here are some of the progress we have made:

        • Kernel test robot is a bisection-driven CI, we not only scan build errors, but also run bisections to look for the first bad commits which introduced the errors. To maintain the capability of bisection, we setup automatic upgrade and adaptive selection for Rust toolchain, thus to match the required toolchain version of different commits during the process of bisection.

        • We provide both random config and a specific config with all Rust samples enabled to have different level of code coverage for Rust in kernel.

        Most of the work we have done is about building kernel with Rust enabled, and we are considering runtime test in the next step. We are also interested in various topics which may help to enhance Rust test. Some further work we are looking forward to happen are:

        • Boot/fuzzing testing for Rust code such as leveraging syzkaller.

        • Functional testing for core Rust code and modules, which could be part of common framework like kunit/kselftests to be easily used in CI service.

        • Collect and aggregate Rust code coverage data in kernel to better design and execute tests.

        • Wrap a tool to setup Rust environment based on min-tool-version.sh for consistent compiling and issue reproducing.

        • Testing for the potential impact of different compiling options of Rust, such as optimization level and build assert config.

        We hope that our work can give inspiration to other CIs wishing to integrate it, and help to facilitate the development of Rust for Linux.

        Speaker: Mr Yujie Liu (Intel)
      • 46
        Rust in the Kernel (via eBPF)

        We are very excited (and impatient) to have Rust supported in the Kernel. In fact we are so impatient we decided to develop a means of getting Rust in the Kernel today, using eBPF!

        Aya is an eBPF library built with a focus on operability and developer experience. It allows for both user-land and kernel-land programs to be written in Rust - and even allows for sharing of code between the two! It has minimal dependencies and supports BPF Compile-Once:Run-Anywhere (CO:RE). When linked with musl, it creates a truly portable, self-contained binary that can be deployed on many Linux distributions and kernel versions.

        In this talk we would like to deep dive into the present state of Aya, with focus on:

        • How it works
        • Currently supported features
        • How Rust for Linux and Aya can benefit from each other
        • Our future plans, which include changes in Rust ecosystem
        Speakers: Mr Dave Tucker (Red Hat), Michal Rostecki (Deepfence Inc)
      • 6:20 PM
        Buffer & Farewell from the Rust MC
    • Service Management and systemd MC
      • 47
        systemd cgroup delegation and control processes

        systemd manages the cgroup hierarchy from the root.
        This is considered an exclusive operation and it is sufficient when system
        units don't encompass any internal cgroup structure.
        To facilitate arbitrary needs of units, it is possible to delegate the subtree
        to the unit (a necessity for such units executing as unprivileged users).
        However, the unified cgroup hierarchy comes with so called internal node
        constraint that prevents hosting processes in internal nodes of the cgroup tree
        (when controllers are enabled).

        This creates a potential conflict between processes of the delegated unit and
        processes that systemd needs to run on behalf of the unit (e.g. ExecReload=).
        Currently, it is avoided by putting systemd control processes into an auxiliary
        child cgroup directly under delegated subtree root.
        This approach is broken when the subtree delegation is used to enable threaded
        cgroups since those require explicit setup and the auxiliary cgroup would miss
        that.
        Generally, this is a problem of placing the control and payload processes
        within the cgroup hierarchy.

        I'm putting forward a few patches that allow per-unit configuration of target
        cgroup of control and payload processes for units that have delegated
        subtrees.
        This is a generic approach that keeps a backwards compatible default, avoids
        creation of unnecessary wrap cgroups and additionally allows new customization
        of control process execution.

        It is a simple idea to present, this brings the topic up for discussion and
        comparison with similar situations that are affected by the internal node
        constraint too (e.g. joining a container) and the goal is to come up with a
        consent or at least the direction how to structure cgroup trees for delegated
        units that work well both for controller and threaded delegation.

        This presentation and discussion will fit in a slot of 20 minutes.

        Speaker: Michal Koutný
      • 48
        #snapsafe: restoring uniqueness in Virtual Machine clones

        short version

        When a virtual machine gets cloned, it still contains old data that believes are unique - random number generation seeds, UUIDs, etc. Linux recently included support for VMGenID to reseed its in-kernel PRNG, but all other RNGs and UUIDs are still identical after a clone.

        In this session, we will discuss approaches to solve this and reveal experiments on which we worked on, such as creating a user space readable system generation counter and going through a systemd inhibitor list for pre-snapshot/post-snapshot phases.

        long(er) version

        Linux recently added support for the Virtual Machine Generation ID
        (VMGenID) feature, an emulated device that informs the guest kernel about VM
        restore events by exposing a 128-bits UUID which changes every time a VM is
        restored from a snapshot. The kernel uses the UUID to reseed its PRNG, thus
        de-duplicating the PRNG state across VMs.

        Although, VMGenID definitely works towards the correct direction, it does
        not provide a mechanism for notifying user-space applications of VM restore
        events. In this presentation, we introduce Virtual Machine Generation Counter,
        an extension to vmgenid which provides a low-latency and race-free mechanism
        for communicating restore events to user-space. Moreover, we will speak about
        why VM Generation Counter is not enough for ensuring across-the-stack snapshot
        safety. We will present an effort which builds on top of Systemd inhibitor
        locks to make snapshot-restore cycle a first-class citizen in the life-cycle of
        a system, achieving end-to-end snapshot safety

        Speaker: Babis Chalios (Amazon Web Services)
      • 49
        Slimming down the journal

        In this talk, I'll discuss the new proposed compact mode for systemd-journald. Via a number of different optimizations, we can substantially reduce the disk space used by systemd-journald. I'll discuss each of the optimizations that were implemented, as well potential improvements that might further reduce disk usage but haven't been implemented yet.

        Accompanying PR: https://github.com/systemd/systemd/pull/21183

        Speaker: Daan De Meyer
      • 50
        New design for initrds

        Distributions ship signed kernels, but initrds are generally built locally. Each machine gets a "unique" initrd, which means they cannot be signed by the distro, the QA process is hard, and development of features for the initrd duplicates work done elsewhere.

        Systemd has gained "system extensions" (sysexts, runtime additions to the root file system), and "credentials" (secure storage of secrets bound to a TPM). Together, those features can be used to provide signed initrds built by the distro, like the kernel. Sysexts and credentials provide a mechanism for local extensibility: kernel-commandline configuration,
        secrets for authentication during emergency logins, additional functionality to be included in the initrd, e.g. an sshd server, other tweaks and customizations.

        Mkosi-initrd is a project to build such initrds directly from distribution rpms (with support for dm-verity, signatures, sysexts). We think that such an approach will be more maintainable than the current approaches using dracut/mkinitcpio/mkinitramfs. (It also assumes we use systemd to the full extent in the initrd.)

        During the talk I want to discuss how the new design works at the technical level, but also how distros can use it to provide more secure and more managable initrds, and the security assumptions and implications.

        Speaker: Zbigniew Jędrzejewski-Szmek (Red Hat)
      • 51
        Towards Secure Unified Kernel Images for Generic Linux Distributions and Everyone Else

        In this talk we'll have a look at:

        • systemd-stub (the UEFI stub for the Linux kernel shipped with systemd)
        • unified kernels (i.e. kernel images glued together from systemd-stub, the kernel itself, an initrd, and more)
        • systemd-sysext (an extension mechanism for initrd images and OS images)
        • systemd service credentials (a secure way to pass authenticated and encrypted bits of information to services, possibly stored on untrusted media)
        • systemd's Verity support (i.e. setup logic for file system images authenticated by the kernel on IO, via dm-verity)
        • systemd's TPM2 support (i.e. ability to lock credentials or disks to TPM2 devices and software state)
        • systemd's LUKS support (i.e. ability to encrypt disks, possibly locked to TPM2)

        And all that with the goal of providing a conceptual framework how to implement simple unified kernel images, that are immutable, yet extensible and parameterizable, are fully authenticated and measured, and that allow binding the root fs encryption or verity to them, in a reasonably manageable way.

        The intention is to show a path for generic distributions to make use of UEFI SecureBoot and actually provide useful features for a trusted boot, putting them closer to competing OSes such as Windows, MacOS and ChromeOS, without losing too much of the generic character of the classic Linux distributions.

        Speaker: Lennart Poettering
    • BOFs Session: Birds of a Feather (BoF)
    • CPU Isolation MC
      • 52
        CPU isolation tuning through cpuset

        A long term project for CPU isolation is to allow its features to be enabled and disabled through cpusets. This includes nohz_full, unbound load affinity involving kthreads, workqueues and timers, managed IRQs, RCU nocb mode, etc... These behaviors are currently fixed in stone at boot time and can't be changed until the next reboot... The purpose is to allow tuning these at runtime, which happens to be very challenging.

        Let's explore the current state of the art!

        Speaker: Frederic Weisbecker (Suse)
      • 53
        Isolation aware smp_call_function/queue_work_on APIs

        Changes to smp_call_function/queue_work_on style APIs
        to take isolation into consideration, more specifically, would like to possibly return errors
        for the callers who can handle them.

        Speaker: Marcelo Tosatti (Red Hat)
      • 54
        Make RCU do less (and disturb CPUs less)!

        CPUs can be disturbed quite easily by RCU. This can hurt power especially on battery-powered systems, where RCU can be a major consumer of power. Different strategies can be tried to mitigate power which we will show along with power data. Also I have been working on some patches to further reduce RCU activity in frequently-called paths like file close. This presentation is to discuss some test results mostly on the power consumption side on battery-powered Android and ChromeOS systems and the new work mentioned to delay RCU processing

        Speakers: Joel Fernandes, Mr Rushikesh Kadam, Uladzislau Rezki
      • 11:50 AM
        Break
      • 55
        CPU isolation vs jailbreaking IPIs

        CPU isolation comes with a handful of cpumasks to help determine which CPUs can
        sanely be interrupted, but those are not always checked when sending an IPI, nor
        is it always obvious wether a given cross-call could be omitted (or delayed) if
        targeting an isolated CPU.

        1 (with 2 and 3 as required foundations) shows a way to defer cross-call
        work targeting isolated CPUs to the next kernel entry, but still requires a
        manual patching of the actual cross-call.

        A grep for "on_each_cpu()" and "smp_call()" on a mainline kernel yields about
        350 results. This slot will be about discussing ways to detect and classify
        those (remote data retrieval, system wide synchronization...), if and how to
        patch them and where to draw the line(s) on system administrator configuration
        vs what is expected of the kernel.

        Speaker: Valentin Schneider (Red Hat)
      • 56
        rtla osnoise: what is missing?

        The osnoise tracers enable the simulation of common HPC workload while tracing all the external sources of noise in an optimized way. This was discussed two years ago. The rtla osnoise adds an easy-to-use interface for osnoise, enabling the tracer to the masses. rtla was discussed last year. These tools now are available and in use by members of this community in their daily activities.

        But that is just the minimum implementation, and there is lots of work to do. For example:

        • The addition of other types of workload - not only reading time
        • Include information about processor power usage
        • Usage of other types of the clock source
        • Inclusion of features to identify the source of IPIs

        And so on.

        In this discussion, the community is invited to share ideas, propose features and prioritize the TODO list.

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
    • Confidential Computing MC
      • 11:30 AM
        Break
    • LPC Refereed Track
      • 57
        How I started chasing speculative type confusion bugs in the kernel and ended up with 'real' ones

        This talk will illustrate my journey in kernel development as a PhD student in Computer Systems Security. I've started with Kasper, a tool I have co-designed and implemented, that finds speculative vulnerabilities in the Linux kernel. With the help of compilers Kasper emulates speculative execution to apply sanitizers on the speculative path.
        Building a generic vulnerability scanner allows finding gadgets that were previously undiscovered by pattern matching with static analysis. Spectre is not limited to a bounds check bypass! Kasper tries to find speculative gadgets and present them in a web UI for developers to analyse. I will also discuss ongoing efforts to improve the precision of the analysis and reason over practical exploitability.

        After we found a speculative type confusion within the list iterator macros, I posted a patch set with a suggested mitigation strategy. By looking at different uses of the list iterator variable after the loop, I entered territory of actual type confusions. I will also discuss ongoing efforts in building an automatic tool for the Linux kernel to detect invalid downcasts with container_of since they otherwise stay completely undetected. We would also gladly like to open a discussion with the audience on the interest and welcome feedback from the community.

        Speaker: Jakob Koschel (VUSec Amsterdam)
      • 58
        Profiling data structures

        The Linux perf tools shows where, in terms of code, a myriad of events take place (cache misses, CPU cycles, etc), resolving instruction pointer addresses to functions in the kernel, BPF or user space.

        There are tools such as 'perf mem' and 'perf c2c' that help translating data addresses where events take place to variables, and those will be described, both where the data comes from, such as AMD IBS, Intel PEBS and similar facilities in ARM that are recently being enabled in perf as well as how these perf tools use that data to provide 'perf report' output.

        The open space is to do data structure profiling and annotating, that is to print a data structure and show how data accesses cause cache activity and in what order, mapping back not just for a variable, but to its type, to help in reorganizing data structures in an optimal fashion to avoid false sharing and maximize cache utilization.

        The talk will try to show recent efforts into bringing together the Linux perf tools and pahole and the problems that remain in mapping back from things like cache misses to variables and types.

        Speaker: Arnaldo Carvalho de Melo (Red Hat Inc.)
      • 11:30 AM
        Break
      • 59
        Inside the Linux Kernel Random Number Generator

        Over the last year, the kernel’s random number generator has seen significant changes and modernization. Long a contentious topic, filled with all sorts of opinions on how to do things right, the RNG is now converging on a particular threat model, and makes use of cryptographic constructions to meet that threat model. This talk will be an in depth look at the various algorithms and techniques used inside of random.c, its history and evolution over time, and ongoing challenges. It will touch on issues such as entropy collection, entropy estimation, boot time blocking, hardware cycle counters, interrupt handlers, hash functions, stream ciphers, cryptographic sponges, LFSRs, RDRAND and related instructions, bootloader-provided randomness, embedded hardware, virtual machine snapshots, seed files, academic concerns versus practical concerns, performance, and more. We’ll also take a look at the interfaces the kernel exposes and how these interact with various userspace concerns. The intent is to provide an update on everything you’ve always wondered about how the RNG works, how it should work, and the significant challenges we still face. While this talk will address cryptographic issues in depth, no cryptography background is needed. Rather, we’ll be approaching this from a kernel design perspective and soliciting kernel-based solutions to remaining problems.

        Speaker: Jason Donenfeld
      • 60
        Live in a world with multiple memory types

        Initially, all memory are DRAM, then we have graphics memory, PMEM,
        CXL, ... Linux kernel has recently gained the basic support to manage
        systems with multiple memory types and memory tiers, and the ability
        to optimize performance by demoting/promoting memory between the
        tiers. And we are working on enhancing Linux's capabilities further.

        In this talk, we will discuss the current development and future
        direction to manage and optimize these systems, including,

        • Explicit memory tiers and user space interface
        • Support complex memory topology with help of firmware and device drivers
        • Use NUMA memory policy and cpusets to help manage memory types
        • Possible improvement of demoting with MGLRU
        • Further optimize page promoting with hot page selection and alternatives
        • Control the trashing among memory types
        • Possible user space based demoting/promoting

        We also want to discuss about the possible solution choices and
        interfaces in kernel and user space.

        Speaker: Mr Ying Huang
      • 1:30 PM
        Lunch
      • 61
        Kernel Live Patching at Scale

        Kernel live patching (KLP) makes it possible to apply quick fixes to a live Linux kernel, without having to shut down the workload to reboot a server. The kpatch tool chain and the livepatch infrastructure generally work well. However, using them on a closely monitored fleet with several million servers uncovers many corner cases. During the deployment of KLP at Meta, we ran into issues, including performance regressions, conflicts with tracing & monitoring tools, and KLP transitions sporadically failing depending on the behavior of the kernel at the time the patch is applied. In this presentation, we will share our experiences working with KLP at scale, describe the top issues we are facing, and discuss some ideas for future improvements.

        First, we would like to briefly introduce how we build, deploy, and monitor KLPs at scale. We will then present some recent work to improve KLP infrastructure, including: eliminating performance hit when applying KLPs; making sure KLP works well with various tracing mechanisms; and fixing various corner cases with kpatch-build tool chain and livepatch infrastructure. Finally, we would like to discuss the remaining issues with KLP at scale, and how to address them. Specifically, we will present different reasons for KLP transition errors, and a few ideas/WIPs to address these errors.

        Speakers: Song Liu (Meta), Rik van Riel (Meta), David Vernet (Meta)
      • 62
        TCP memory isolation on multi-tenant servers

        On Linux, tcp_mem sysctl is used to limit the amount of memory consumed by active TCP connections. However that limit is shared between all the jobs running on the system. Potentially a low priority job can hog all the available TCP memory and starve the high priority jobs collocated with it. Indeed we have seen production incidences of low priority jobs negatively impacting the network performance of collocated high priority jobs.

        Through cgroups, Linux does provide TCP memory accounting and isolation for the jobs running on the system but that comes with its own set of challenges which can be categorized into two buckets:

        1. New and unexpected semantics of memory pressure and OOM for cgroup based TCP memory accounting.
        2. Logistical challenges related to resource and quota management for large infrastructures running millions of jobs.

        This is an ongoing work and new challenges keep popping up as we expand cgroup based TCP memory in our infrastructure. In this presentation we want to share our experience in tackling these challenges and would love to hear how others in the community have approached the problem of TCP memory isolation on their infrastructure.

        Speakers: Christian Warloe (Google), Shakeel Butt (Google), Wei Wang (Google)
      • 4:30 PM
        Break
      • 63
        Meta’s CXL Journey and Learnings in Linux

        Compute Express Link (CXL) is a new open interconnect technology built on top of PCIe.
        Among other features, it enables memory expansion, unified system address space and cache
        coherency. It has the potential to enable SDM (Software Defined Memory) and emerging
        usage models of accelerators.

        Meta has been working on CXL with current focus on memory expansion. This presentation
        will discuss Meta's experiences, learnings, pain points and expectations for Linux
        kernel/OS to support CXL's value proposition and at-scale data center deployment. It
        touches upon aspects such as transparent memory expansion, device management at scale,
        RAS, etc. Meta looks forward to further collaboration with the Linux community to improve CXL
        technology and to enable the CXL ecosystem.

        Speaker: Jonathan Zhang
      • 64
        nouveau in the times of nvidia firmware and open source kernel module

        This talk will look at the recent NVIDIA firmware release and open source kernel module contents, define what exists, what can happen.

        It will then address the nouveau project and what this means to it, and what sort of plans are in place to use what NVIDIA has provided to move the project forward.

        It will also discuss possible future projects in the area.

        Speaker: David Airlie
    • eBPF & Networking

      The track will be composed of talks, 30 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.

      This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.

      • 65
        Machine reable description for netlink protocols (YAML?)

        Netlink is a TLV based protocol we invented and use in networking for most of our uAPI needs. It supports seamless extensibility, feature discovery and has been hardened over the years to prevent users from falling into uAPI extensibility gotchas.

        Nevertheless netlink remains very rarely unused outside of networking. It's considered arcane and too verbose (requires defining operations, policies, parsers). (The fact it depends on CONFIG_NET doesn't help either but that's probably just an excuse most of the time.)

        In an attempt to alleviate those issues I have been working on creating a netlink protocol description in YAML. A machine readable netlink message description should make it easy for language bindings to be automatically generated, making netlink feel much more like gRPC, Thrift or just a function call in the user space. Similarly on the kernel side the YAML description can be used to generate the op tables, policies and parsers.

        In this talk I'll cover the basics of netlink (which everyone claims to know but doesn't), compare it to Thrift/gRPC, and present the YAML work.

        Speaker: Jakub Kicinski (Meta)
      • 66
        Cilium's BPF kernel datapath revamped

        Since the early days of eBPF, Cilium's core building block for its datapath is tc BPF. With more adopters of eBPF in the Kubernetes landscape, there is growing risk from a user perspective that Pods orchestrating tc BPF programs might step on each other, leading to hard to debug problems.

        We dive into a recently experienced incident, followed by our proposal of a revamped tc ingress/egress BPF datapath for the kernel which incorporates lessons learned from production use, lower overhead as a framework, and supporting BPF links for tc BPF programs in a native, seamless manner (that is, not conflicting with tc's object model). In particular the latter solve the program ownership and allow for better debugability through a common interface for BPF. We also discuss our integration approach into libbpf and bpftool, dive into the uapi extensions and next steps.

        Speaker: Daniel Borkmann (Isovalent)
      • 67
        A BPF map for online packet classification

        There is a growing need in online packet classification for BPF-based networking solutions. In particular, in cilium we have two use cases: the PCAP recorder for the standalone XDP load balancer [1] and the k8s network policies. The PCAP recorder implementation suffers from slow and dangerous updates due to runtime recompilation, and both use cases require specifying port ranges in rules, which is currently not supported.

        At the moment there are two competing algorithms for online packet classification: Tuple Merge [2] and Partition Sort [3]. The Tuple Merge algorithm is using hash tables to store rules, and the Partition Sort is using multi-dimensional Interval trees. Thus, both of algorithms are [nearly?] impossible to implement in "pure" BPF due to lack of functionality and also due to verifier complexity limits.

        We propose a new BPF map for packet classification and an API which can be used to adapt this map to different practical use cases. The map is not tied to the use of a specific algorithm, so any of brute force, tuple merge, partition sort or a future state-of-art algorithm can be used.

        [1] https://cilium.io/blog/2021/05/20/cilium-110/#pcap
        [2] https://nonsns.github.io/paper/rossi19ton.pdf
        [3] https://www.cse.msu.edu/~yingchar/PDF/sorted_partitioning_icnp2016.pdf

        Speaker: Anton Protopopov (Isovalent)
      • 11:30 AM
        Break
      • 68
        How to share IPv4 addresses by partitioning the port space

        When establishing connections, a client needs a source IP address. For better or worse, network and service operators often assign traits to client IP addresses such as a reputation score, geolocation or traffic category, e.g. mobile, residential, server. These traits influence the way a service responds.

        Transparent Web proxies, or VPN services, obfuscate true client IPs. To ensure a good user experience, a transparent proxy service should carefully select the egress IPs to mirror the traits of the true-client IP.

        However, this design is hard to scale in IPv4 due to the scarcity of IP addresses. As the price of IPv4 addresses rise, it becomes important to make efficient use of the available public IPv4 address pool.

        The limited pool of IPv4 addresses, coupled with a desire to express traits known to be used by other services, presented Cloudflare with a challenge: The number of server instances in a single Point of Presence exceed the number of IPv4 egress addresses available -- a disconnect that is exacerbated by the need to partition available addresses further according to traits.

        This has led us to search for ways to share a scarce resource. The result is a system where a single egress IPv4 address, with given traits, is assigned to not one, but multiple hosts. We make it possible by partitioning ephemeral TCP/UDP port space and dividing it among the hosts. Such a setup avoids use of stateful NAT, which is undesirable due to scalability and single-point-of-failure concerns.

        From previous work [1] we know that the Linux Sockets API is poorly suited to a task of establishing connections from a given source port range. Opening a TCP connection from a port range is only possible if the user re-implements the free port search - a task that the Linux TCP stack already performs when auto-binding a socket.

        On UDP sockets, selecting source port range for a connected socket turns out to be very difficult. Correctly dealing with connected sockets is important because they are a desirable tool for egress traffic, despite their memory overhead. Currently, the Linux API forces the user to choose: Either use a single connected UDP socket owning a local port, which greatly limits the number of concurrent UDP flows; or, alternatively, somehow detect a connected-socket conflict when creating connected UDP sockets, which share the local address.

        We previously built a detection mechanism with a combination of querying sock_diag and toggling the port sharing on and off after binding the socket [1]. Depending on perspective, the process might be described by some as arduous, or by others as an ingenious hack that works.

        Recent innovations such as these demonstrate that sharing the finite set of ports and addresses among larger sets of distributed processes is a problem not yet completely solved for the Linux Sockets API. At Cloudflare we have come up with a few different ideas to address the shortcomings of the Linux API. Each of them makes the task of sharing an IPv4 address between servers and/or processes easier, but the degree of user-friendliness varies.

        In no particular order, the ideas we have evaluated are:

        1. Introduce a per-socket configuration option for narrowing down the IP ephemeral port range.

        2. Introduce a flag to enforce unicast semantics for connected UDP sockets, when sharing the local address (SO_REUSEADDR). With the flag set, it should not be possible to create two connected UDP sockets with conflicting 4-tuples ({local IP, local port, remote IP, remote port}).

        3. Extend the late-bind feature (IP_BIND_ADDRESS_NO_PORT) to UDP sockets, so that dynamically-bound connected UDP sockets can share a local address as long as the remote address is unique.

        4. Extend Linux sockets API to let the user atomically bind a socket to a local and a remote address with conflict detection. Akin to what the Darwin connectx() syscall provides.

        5. Introduce a post-connect() BPF program to allow user-space processes to prevent creation of connected UDP sockets with conflicting 4-tuples.

        During the talk, we will go over the challenges of designing a distributed proxy system that mirrors client IP traits, which led us to look into IP sharing and port space partitioning.

        Then, we will shortly explain production tested implementation of TCP/UDP port space partitioning using only existing Linux API features.

        Finally, we will describe the proposed API improvement ideas, together with their pros and cons and implementation challenges.

        We will accompany the most promising, according to our judgment, ideas with a series of RFC patches posted prior to the talk for the upstream community consideration.

        [1] https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/

        Speakers: Jakub Sitnicki (Cloudflare), Marek Majkowski (Cloudflare)
      • 69
        Networking resource control with per-cgroup LSM

        Google's container management system runs different workloads on the same host. To effectively manage networking resources, the kernel has to apply different networking policies to different containers.

        Historically, most of the networking resource control happened inside proprietary Google networking cgroup. That cgroup is an interesting cross between upstream net_cls and net_prio, has a lot of Google-specific business logic and has no chance of being accepted upstream.

        In this talk I'm going to talk about what we'd like to manage on the networking resource side and which BPF mechanisms were added to achieve this (lsm_cgroup).

        Speaker: Stanislav Fomichev (Google)
      • 70
        eBPF Standardization

        At LSF/MM/BPF, the topic was raised about better documenting eBPF and making "standards" like documentation, especially since we are having runtimes other than just Linux now supporting eBPF.

        This presentation will summarize the current state of the eBPF Foundation effort on these lines, how it is organized, and invite discussion and feedback on this topic.

        Speaker: Dave Thaler (Microsoft)
      • 1:30 PM
        Lunch
      • 71
        Bringing packet queueing to XDP

        Packet forwarding is an important use case for XDP, however, XDP currently offers no mechanism to delay, queue or schedule packets. This limits the practical uses for XDP-based forwarding to those where the capacity of input and output links always match each other (i.e., no rate transitions or many-to-one forwarding). It also prevents an XDP-based router from doing any kind of traffic shaping or reordering to enforce policy.

        Our proposal for a adding a programmable queueing layer to XDP was posted as an RFC patch set in July[0]. In this talk we will present the overall design for a wider audience, and summarise the current state of the work since the July series. We will also present open issues, in the hope of spurring discussion around the best way of adding this new capability in a way that is as widely applicable as possible.

        [0] https://lore.kernel.org/r/20220713111430.134810-1-toke@redhat.com

        Speaker: Toke Høiland-Jørgensen (Red Hat)
      • 72
        XDP gaining access to NIC hardware hints via BTF

        The idea for XDP-hints, which is XDP gaining access HW offload hints, dates back to Nov 2017. We believe the main reason XDP-hints work have stalled are that upstream we couldn't get consensus on the layout of the XDP metadata. BTF was not ready at that time.

        We believe the flexibility of BTF can resolve the layout issues, especially since BTF have evolved to include support for kernel modules.

        This talk is for hashing out upstream XDP-hints discussions and listening to
        users/consumers of this facility.

        There are multiple users of this facility that all needs to be satisfied:

        1. BPF-progs first obvious consumer (either XDP or TC hooks)
        2. XDP to SKB conversion (in veth and cpumap) for traditional HW offloads
        3. AF_XDP can consume BTF info in userspace to decode metadata area
        4. Chained BPF-progs can communicate state via metadata
        Speaker: Jesper Dangaard Brouer (Red Hat)
      • 73
        FW centric devices, NIC customization

        For a long time now the industry has been building programmable
        processors into devices to run firmware code. This is a long standing
        design approach going back decades at this point. In some devices the
        firmware is effectively a fixed function and has little in the way of
        RAS features or configurability. However, a growing trend is to push
        significant complexity into these devices processors.

        Storage has been doing FW centric devices for a long time now, and we
        can see some evolution there where standards based channels exist that
        carry device specific data. For instance, looking at nvme-cli we can
        see a range of generic channels carrying device specific RAS or
        configuration (smart-log, fw-log, error-log, fw-download). nvme-cli
        also supports entire device specific extensions to access unique
        functionality (nvme-intel- nvme-huawei-, nvme-micro-*)

        https://man.archlinux.org/man/community/nvme-cli/nvme.1.en

        This reflects the reality that standardization can only go so far.
        The large amount of FW code still needs RAS and configuration unique
        to each device's design to expose its full capability.

        In the NIC world we have been seeing FW centric devices for a long
        time, starting with MIPS cores in early Broadcom devices, entire Linux
        OS's in early "offload NICs", to today's highly complex NIC focusing on
        complicated virtualization scenarios.

        For a long time our goal with devlink has been to see a similar
        heathly mix of standards based multi-vendor APIs side by side with
        device specific APIs, similar to how nvme-clie is handling things on
        the storage side.

        In this talk, we will explore options, upstream APIs and mainstream
        utilities to enjoy FW-centric NIC customizations.

        We are focused on:

        1) non-volatile device configuration and firmware update - static and
        preserved across reboots

        2) Volatile device global firmware configuration - runtime.

        3) Volatile per-function firmware configuration (PF/VF/SF) - runtime.

        4) RAS features for FW - capture crash/fault data, read back logs,
        trigger device diagnostic modes, report device diagnostic data,
        device attestation

        Speakers: Saeed Mahameed (Nvidia), Mark Bloch (Nvidia)
      • 4:30 PM
        Break
      • 74
        Socket termination for policy enforcement and load-balancing

        Socket termination for policy enforcement and load-balancing

        Cloud-native environments see a lot of churn where containers can come and go. We have compelling use cases like eBPF enabled policy enforcements and socket load-balancing, where we need an effective way to identify and terminate sockets with active as well as idle connections so that they can reconnect when the remote containers go away. Cilium [1] provides eBPF based socket load-balancing for containerized workloads, whereby service virtual ip to service backend address translation happens only once at the socket connect calls for TCP and connected UDP workloads. Client applications are likely to be unaware of the remote containers that they are connected to getting deleted. Particularly, long running connected UDP applications are prone to such network connectivity issues as there are no TCP RST like signals that the client containers can rely on in order to terminate their sockets. This is especially critical for Envoy-like proxies [2] that intercept all container traffic, and fail to resolve DNS requests over long-lived connections established to stale DNS server containers. The other use case for forcefully terminating sockets is around policy enforcement. Administrators may want to enforce policies on-the-fly whereby they want active client applications traffic to be redirected to a subset of containers, or optimize DNS traffic to be sent to node-local DNS cache containers [3] for JVM-like applications that cache DNS entries.

        As we researched ways to filter and forcefully terminate sockets with active as well idle connections, we considered various solutions involving the recently introduced BPF iterator, sock_destroy API, and VRFs that we plan to present in this talk. Some of these APIs are network namespace aware, which need some book-keeping in terms of storing containers metadata, and we plan to send kernel patches upstream in order to adapt them for container environments. Moreover, sock_destroy API was originally introduced to solve similar problems on Android, but it’s behind a special config that’s disabled by default. With the VRF approach to terminate sockets, we faced issues with sockets ignoring certain error codes. We hope our experiences, and discussion around the proposed BPF kernel extensions to address these problems help the community.

        [1] https://github.com/cilium/cilium
        [2] https://github.com/envoyproxy/envoy
        [3] https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

        Speaker: Aditi Ghag (Isovalent)
      • 75
        MPTCP: Extending kernel functionality with eBPF and Netlink

        Multipath TCP (MPTCP) was initially supported in v5.6 of the Linux kernel. In subsequent releases, the MPTCP development community has steadily expanded from the initial baseline feature set to now support a broad range of MPTCP features on the wire and through the socket and generic Netlink APIs.

        With core MPTCP functionality established, our next goal is to make MPTCP more extensible and customizable at runtime. The two most common tools in the kernel's networking subsystem for these purposes are generic Netlink and BPF. Each has tradeoffs that make them better suited for different scenarios. Our choices for extending MPTCP show some of those tradeoffs, and also leave our community with some open questions about how to best use these interfaces and frameworks.

        This talk will take MPTCP as a use-case to illustrate questions any network subsystems could have when looking at extending kernel functionality and controls from the userspace. Two main examples will be presented: one where BPF seems more appropriate and one where a privileged generic Netlink API can be used.

        As one example, we are extending the MPTCP packet scheduler using BPF. When there are multiple active TCP subflows in a MPTCP connection, the MPTCP stack must decide which of those subflows to use to transmit each data packet. This requires low latency and low overhead, and direct access to low-level TCP connection information. Customizable schedulers can optimize for latency, redundancy, cost, carrier policy, or other factors. In the past such customization would have been implemented as a kernel module, with more compatibility challenges for system administrators. We have patches implementing a proof-of-concept BPF packet scheduler, and hope to discuss with the netdev/BPF maintainers and audience how we might best structure the BPF/kernel API -- similar to what would be done for a kernel module API -- to balance long-term API stability, future evolution of MPTCP scheduler features, and usability for scheduler authors.

        The next customization feature is the userspace path manager added in v5.19. MPTCP path managers advertise addresses available for multipath connections, and establish or close additional TCP subflows using the available interfaces. There are a limited number of interactions with a path manager during the life of a MPTCP connection. Operations are not very sensitive to latency, and may need access to a restricted amount of data from userspace. This led us to expand the MPTCP generic Netlink API and update the Multipath TCP Daemon (mptcpd) to support the new commands. Generic Netlink has been a good fit for path manager commands and events, the concepts are familiar and the message format makes it possible to maintain forward and backward compatibility between different kernel versions and userspace binaries. However the overhead of userspace communication does have tradeoffs, especially for busy servers.

        MPTCP development for the Linux kernel and mptcpd are public and open. You can find us at mptcp@lists.linux.dev, https://github.com/multipath-tcp/mptcp_net-next/wiki (soon via https://mptcp.dev), and https://github.com/intel/mptcpd

        Speaker: Matthieu Baerts (Tessares)
      • 76
        Percpu hashtab traversal measurement study

        As platforms grow in cpu count (200+ cpu), using per cpu data structures is becoming more and more expensive. Copying the percpu data from the bpf hashtab map to userspace buffers can take up to 22 us per entry on a platform with 256 cores.

        This talk presents a detailed measurement study of the cost of percpu hashtab traversal, covering various methods and systems with core counts.
        We will discuss how the current implementation of this data structure makes it hard to amortize cache misses, and solicit proposal for possible enhancements.

        Speaker: Brian Vazquez (Google)
    • linux/arch MC

      The linux/arch MC aims to bring architecture maintainers in one room to discuss how can we improve architecture specific code and its integration with the generic kernel

      • 77
        High memory management API changes

        There was a time when the Linux kernel was 32bit but hardware systems had much
        more than 1GB of memory. A solution was developed to allow the use of high
        memory (HIGHMEM). High memory was excluded from the kernel direct map and was
        temporarily mapped into and out of the kernel as needed. These mappings were
        made via kmap_*() calls.

        With the prevalence of 64bit kernels the usefulness of this interface is
        waning. But the idea of memory not being in the direct map (or having
        permissions beyond the direct map mapping) has brought about the need to
        rethink the HIGHMEM interface.

        This talk will discuss the changes to the kmap_*() API and the motivations
        driving them. This includes the status of a project to rework the HIGHMEM
        interface as of the LPC conference.

        Finally how does HIGHMEM affect the modern architectures in use? Is it finally
        time to remove CONFIG_HIGHMEM? Or is there still a need for 32 bit systems to
        support large amounts of memory in production environments?

        Speaker: Ira Weiny
      • 78
        Mitigating speculative execution attacks with ASI - follow up

        In this talk we will argue the case for adopting ASI in upstream Linux.

        Speculative execution attacks, such as L1TF, MDS, LVI, (and many others) pose significant security risks to hypervisors and VMs, from neighboring malicious VMs. The sheer number of proposed patches/fixes is quite high, each with its own non-trivial impact on performance. A complete mitigation for these attacks requires very frequent flushing of several buffers (L1D cache, load/store buffers, branch predictors, etc. etc.) and halting of sibling cores. The performance cost of deploying these mitigations is unacceptable in realistic scenarios.

        Two years ago, we presented Address Space Isolation (ASI) - a high-performance security-enhancing mechanism to defeat these speculative attacks. We published a working proof of concept in https://lkml.org/lkml/2022/2/23/14. ASI, in essence, is an alternative way to manage virtual memory for hypervisors, providing very strong security guarantees at a minimal performance cost.

        In the talk, we will discuss what new vulnerabilities have been discovered since our previous presentation, what are the existing approaches, and their estimated costs. We will then present our performance estimation of ASI, and argue that ASI can mitigate most of the speculative attacks as is, or by a small modification to ASI, at an acceptable cost.

        Speakers: Junaid Shahid (Google), Ofir Weisse (Google)
      • 79
        Consolidating representations of the physical memory

        We have several coarse representations of the physical memory consisting of
        [start, end, flags] structures per memory region. There is memblock that
        some architectures keep after boot, there is iomem_resource tree and
        "System RAM" nodes in that tree, there are memory blocks exposed in sysfs
        and then there are per-architecture structures, sometimes even several per
        architecture.

        The multiplication of such structures and lack of consistency between some
        of them does not help the maintainability and can be a reason for subtle
        bugs here and there.

        The layout of the physical memory is defined by hardware and firmware and
        there is not much room for its interpretation; single abstraction of the
        physical memory should suffice and a single [start, end, flags] type should
        be enough. There is no fundamental reason why we cannot converge
        per-architecture representations of the physical memory, like e820,
        drmem_lmb, memblock or numa_meminfo into a generic abstraction.

        I suggest to use memblock as the basis for such abstraction. It is already
        supported on all architectures and it is used as the generic representation
        of the physical memory at boot time. Closing the gaps between per
        architecture structures and memblock is anyway required for more robust
        initialization of the memory management. Addition of simple locking of
        memblock data for memory hotplug, making the memblock "allocator" part
        discardable and a mechanism to synchronize "System RAM" resources with
        memblock would complete the picture.

        Speaker: Mike Rapoport (IBM)
      • 11:30 AM
        Break
      • 80
        LoongArch: What we will do next
        • cleaner way forward for compatibility with the "old-world" (the
          earlier in-house MIPS-clone firmware and kernel ABI of LoongArch), if
          possible;
        • cleaner approach to support both standard UEFI and the Loongson-custom
          boot protocols, if possible;
        • way forward for supporting zboot in EFI stub boot flow.
        Speakers: Huacai Chen, Jianmin Lv (Loongson), Xuerui WANG
      • 81
        Extending EFI support in Linux to new architectures

        An overview will be presented of recent work in the Linux/EFI
        subsystem and associated projects (u-boot, Tianocore, systemd), with a
        focus on generic support for the various new architectures that have
        adopted EFI as a supported boot flow. This includes UEFI secure boot
        and/or measured boot on non-Tianocore based EFI implementations,
        generic decompressor support in Linux and early handling of RNG seeds
        provided by the bootloader.

        Note that topics related to confidential computing (TDX, SEV) will not
        be covered here: there are numerous other venues at LPC and the KVM
        Forum that already cover this in more detail.

        Speakers: Ard Biesheuvel (Google), Ilias Apalodimas
      • 82
        Make LL/SC arch has a strict forward guarantee

        For the architecture that uses load-link/store-conditional to implement atomic semantics, ll/sc can effectively reduce the complexity and cost of embedded processors and is very attractive for products with up to two cores in a single cluster. However, compared with the AMO architecture, it may not have enough forward guarantee, causing the risk of livelock. Therefore, CPUs based on the ll/sc architecture such as csky, openrisc, riscv, and loongarch haven't met the requirements of using qspinlock in NUMA scenarios. In this presentation, we will introduce how to make ll/sc have strict forward guarantees, solve the mixed-size atomic & dcas problem incidentally, and discuss the hardware solution's advantages and disadvantages. I hope this presentation will help ll/sc architecture solve NUMA series issues in Linux.

        Speaker: Mr Ren Guo
    • 1:30 PM
      Lunch
    • 1:30 PM
      Lunch
    • 1:30 PM
      Lunch
    • Android MC

      Continuing in the same direction as last year, this year's Android microconference will be an opportunity to foster collaboration between the Android and Linux kernel communities. Discussions will be centered on the goal of ensuring that both the Android and Linux development moves in a lockstep fashion going forward.

      Projected talks:
      - io_uring in Android (Akilesh Kailash)
      - MGLRU results on Android (Kalesh Singh or Yu Zhao presenting over VC)
      - Hermetic builds with Bazel (Matthias Männich)
      - Android kernel testing updates (Steve Muckle)
      - pKVM (Quentin Perret)
      - erofs as a replacement for f2fs and the deprecation of ext4 (David Anderson)
      - eBPF-based FUSE (Paul Lawrence)
      - stgdiff tools (Giuliano Procida)
      - Technical debt (Lee Jones)
      - Parallelized suspend/resume (Saravana Kannan)
      - CPU DVFS for guest thread migrations (Saravana Kannan)

      Accomplishments since the last Android MC:
      - fw_devlink: Fixed the correctness of sync_state() callbacks when simple-bus devices are involved
      - Implemented a prototype for the cgroup-based accounting of DMA-BUF allocations -- current review doc: https://patchwork.kernel.org/project/linux-media/patch/20220328035951.1817417-2-tjmercier@google.com/
      - Other dependencies for tracking shared gfx buffers now merged
      - Improved community collaboration:
      - Collaboration page set up: https://aosp-developers-community.github.io/
      - Integrating v4l2_codec2 HAL on v4l2-compliant upstream codecs WIP

      MC leads:
      Karim Yaghmour karim.yaghmour@opersys.com
      Suren Baghdasaryan surenb@google.com
      John Stultz jstultz@google.com
      Amit Pundir amit.pundir@linaro.org
      Sumit Semwal sumit.semwal@linaro.org

      • 3:00 PM
        Intro
      • 83
        GKI experience

        Qualcomm will provide an update on commercialization of a GKI-based target. This short talk will discuss the benefits to adopting GKI model (LTS intake frequency, upstream adoption) and some of the challenges we faced. Finally, we will discuss future challenges for GKI products with respect to upstream kernel development.

        Speaker: Elliot Berman
      • 84
        Technical debt

        For various reasons, the Android Common Kernel (ACK) requires functionality that is not suitable for upstream. This talk will explore the reasons why this delta must exist, how it is maintained & managed and the steps taken to ensure that it remains as small as possible.

        Speaker: Matthias Männich (Google)
      • 85
        Hermetic builds with Bazel

        Starting with Android 13, Android Kernels can be built with Bazel. While under the hood this still uses KBuild as the authoritative build system, the guarantees a Bazel build provides are very valuable for building GKI kernels and modules. This talk will explore the Bazel based kernel build and the Driver Developer Kit (DDK) that provides a convenient way to create kernel modules in compliance with GKI.

        Speaker: Matthias Männich (Google)
      • 86
        STG for ABI monitoring

        ABI monitoring is an important part of the stability and upgrade-ability promises of the GKI project. In the latest generation of our tooling, we applied concepts from graph theory to the problem space and gained high confidence in the correctness and reliability of the analysis results. What else can we fit into this model and what would be most useful?

        Speaker: Giuliano Procida
      • 88
        Virtualization in Android

        In this presentation we will talk about Protected KVM and the new virtualization APIs introduced with Android 13. You'll find out more about some of the key Protected KVM design decisions, its upstream status and how we plan to use protected virtualization for enabling a new set of use cases and better infrastructure for device vendors.

        Speakers: David Brazdil (Google), Serban Constantinescu
      • 89
        Cuttlefish

        Cuttlefish is an Android based VM that can be used for kernel hacking amongst other things. We'll chat about how to set one up, put a mainline kernel on it, and utilize the devices it supports.

        Speaker: Ram Muthiah
      • 4:40 PM
        Break
      • 90
        eBPF-based FUSE

        The file system in userspace, or fuse filesystem, is a long-standing filesystem in linux that allows a file system to be implemented in user space. Unsurprisingly, this comes with a performance overhead, mostly due to the large number of context switches from the kernel to the user space daemon implenting the file system.
        bpf, or berkeley packet filters, is a mechanism to allow user space to put carefully sanitized programs into the kernel, intially as part of a firewall, but now for many uses.
        fuse-bpf is thus a natural extension of fuse, adding support for backing files and directories that can be controlled using bpf, thus avoiding context switches to the kernel. This allows us to use fuse in many more places in Android as performance is very close to the native file system.

        Speaker: Paul Lawrence (Google Inc)
      • 91
        EROFS as a replacement for EXT4 and Squashfs

        EROFS is a readonly filesystem that supports compression. It is rapidly becoming popular in the ecosystem. This talk will explore its performance implications and space-saving benefits on the Android platform, as well as ideas for future work.

        Speaker: David Anderson (Google)
      • 92
        MGLRU results on Android

        Multigenerational LRU (MGLRU) is a rework of the kernel’s page reclaim mechanism where pages are categorized into generations representative of their age. It provides a finer granularity aging than the current 2-tiered active and inactive LRU lists, with the aim to make better page reclaim decisions.

        MGLRU has shown promising improvements from various platforms/parties. This presentation will underline the results of evaluating the patchset of Android.

        [1] https://lore.kernel.org/r/20220309021230.721028-1-yuzhao@google.com/

        Speaker: Kalesh Singh (Google Inc)
      • 93
        io_uring in Android

        This presentation will talk about the usage of io_uring in Android OTA, present performance results. Android OTA uses dm-user which is a out of tree user space block device.

        We plan to explore io_uring evaluating the RFC patchset : https://lore.kernel.org/io-uring/Yp1jRw6kiUf5jCrW@B-P7TQMD6M-0146.local/T/#m013adcbd97cc4c4d810f51961998ba569ecc2a62

        Speaker: Akilesh Kailash
      • 94
        (Impact of) Recent CPU topology changes

        Acting on the expectation that both device-tree and ACPI enabled systems must present a consistent view of the CPU topology, Sudeep submitted at [1] (currently v6) a series of patches that firstly fix some discrepancies in the CPU topology parsing from the DT /cpu-map node, as well as improve detection of last level cache sharing. The conference topic aims to discuss the impact of these changes to userspace facing CPU topology information and in the use of more unconventional DT topology descriptions ("phantom domains" - u-arch or frequency subgrouping of CPUs) present in Android systems.

        [1] https://lore.kernel.org/lkml/20220704101605.1318280-1-sudeep.holla@arm.com/

        Speakers: Dietmar Eggemann, Ionela Voinescu
      • 95
        Dynamic Energy Model to handle leakage power

        The Energy Model (EM) framework describes the CPU power model and is used for task scheduling decisions or thermal control. It's setup during the boot in one of the supported ways and is not modified during the normal run. Although, not every chip has the same power characteristics and cores inside might be sensitive to temperature changes in different way.
        To address better the variety of silicon fabrications we want to allow modifications of the EM at runtime. The EM runtime modification would introduce new features:
        - allow to provide (after boot) the total power values for each OPP not limited to any formula or DT data
        - allow to provide power values proper for a given SoC manufactured - with different binning and read from FW or kernel module
        - allow to modify at runtime power values according to current temperature of the SoC, which might increase leakage and shift power-performance curves for Big core more than for other cores

        Speaker: Lukasz Luba
    • RISC-V MC
      • 4:30 PM
        Break
    • Real-time and Scheduling MC
      • 96
        rtla: what is next?

        Presented last year, RTLA made its way to the kernel set of tools.

        RTLA includes an interface for timerlat and osnoise tracers in the current state. However, the idea is to expand RTLA to include a vast set of ... real-time Linux analysis tools, combining tracing and methods to stimulate the system.

        In this discussion, we can talk about ways to extend tracers and rtla, including:

        • Supporting histogram for ftrace tracers (timerlat events)
        • Adding new interface for tracers, like hwlat
        • Supporting trace-cmd file support to allow ftrace & osnoise/timerlat/hwlat in parallel
        • Adding smi counters for timerlat
        • Adding other tools
        • Adding pseudo-random task simulator

        But the main idea is to hear what people have to say about how to make the tool even better!

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 97
        Bringing Energy-Aware Scheduling to x86

        Energy-Aware Scheduling (EAS) is not a straight fit for x86 hybrid processors. Thus, x86 hybrid processors do not make use of EAS yet. A large range of turbo frequencies, inter-CPU dependencies, simultaneous multithreading, and instruction-specific differences in throughput makes it difficult to feed the scheduler with a simple, timely, accurate model of CPU capacity.
        Dependencies between CPUs and other on-chip components makes it difficult to create an energy model. The widespread use of hardware-controlled frequency-scaling on systems based on Intel processors needs to be reconciled with a model in which the kernel controls the operating point of the CPU.
        The goal of this talk is to discuss the level of support from hardware, the challenges of EAS on x86, and proposed solutions to provide simple capacity and energy models that are sufficiently accurate for the scheduler to use.

        Speakers: Len Brown (Intel Open Source Technology Center), Ricardo Neri (Intel Corporation)
      • 98
        Latency hints for CFS task

        RT schedulers are traditionally used for everything concerned with the latency but it's sometimes not possible to use RT for all parts of the system because of the variance of the runtime or the trust of some parts as an example. At the opposite side, some apps don't care at all about latency and preempting the running task but prefer to let the current task move forward.
        The latency nice priority aims to let userspace to set such latency requirements for CFS tasks but one difficulty is to find a suitable interface that stays meaningful for user but is not tied to one particular implementation. [1] has resurrected the latency_nice interface with an implementation that is system agnostic.
        This talk will present the current status and how to move forward on the interface.
        https://lore.kernel.org/lkml/20220512163534.2572-7-vincent.guittot@linaro.org/T/

        Speaker: Vincent Guittot (Linaro)
      • 4:30 PM
        Break
      • 99
        Linux Kernel Scheduling and split-LLC architectures: Overview, Challenges and Opportunities

        Linux Task Scheduler has seen several enhancements to make task scheduling better and smarter for split last level cache (split-LLC) environments. With wider adoption of the chiplet-like technology in current and future processors, these continued efforts become key to squeeze the most out of the silicon.

        Work has already gone in to accurately model the domain topology for split-LLC architectures: Optimizing task wakeups to target cache-hot LLCs, reducing cross-LLC communication. NUMA imbalance metrics have been reworked to enable better task distribution across NUMA nodes with multiple LLCs. These enhancements have enabled several workloads to benefit from architectural advantages of split-LLCs. That being said, there is still lot of performance left on the table.

        In this talk we provide an overview of recent scheduler changes that have benefitted workloads in a split-LLC environment. We will describe challenges, opportunities and some ambitious ideas to make the Linux Scheduler more performant on split-LLC architectures.

        Speakers: Gautham R Shenoy (AMD Inc.), Prateek Nayak (AMD Inc. )
      • 100
        Limit the idle CPU search depth and use CPU filter during task wake up

        When a task is woken up in the last level cache (LLC) domain, the scheduler tries to find an idle CPU for the task. But when the LLC domain is fully busy, the search for an idle CPU may be in vain, adding long latency to the task wakeup and yet does not lead to an idle CPU. The latency gets worse when the number of CPUs in the LLC increases, which will be the case for future platforms.
        During LPC2021 there was a discussion on how to find the idle CPU effectively. This talk is an extended discussion on that. We will illustrate how we encountered the issue and how to debug this issue on a high core count system. We'll introduce the proposal that leverages the util_avg of the LLC to decide how much effort is spent to scan for idle CPUs. We'll also present a proposal to filter out the busy CPUs so as to speed up the scan. We'll share the current test data using the mechanism. We hope to get feedback on advice/feedback on tuning the scan policy and making it viable for upstreaming.

        Speakers: Chen Yu, Yun Wu (Bytedance)
      • 101
        Linux needs a Scheduler QOS API -- and it isn't nice(2)

        Optimal task placement decisions and hardware operating points impact application performance and energy efficiency.

        The Linux scheduler and the hardware export low level knobs that allow an expert to influence these settings. But that expert needs to know details about the hardware, about the Linux scheduler, and about every (other) task that is running on the system.

        This is not a reasonable demand for multi-platform applications. Here we look at what, say Chromium, must do to run on Linux, Windows, and MacOS; and how we can make it easier for apps to run more optimally on Linux.

        Speaker: Len Brown (Intel Open Source Technology Center)
      • 102
        PREEMPT_RT Q&A with tglx

        In this topic, Thomas Gleixner will answer all the questions about the present of future of the PREEMPT_RT, mainly about the status of the merge and how things will work after the merge.

        Speaker: Thomas Gleixner
    • IoTs a 4-Letter Word MC
      • 9:30 AM
        Break
      • 103
        Putting firmware on the device: a Zephyr+Yocto+Mender hike

        One of the biggest real-life challenges for embedded developers is putting the various bits and pieces of technology together to form an actual product. Usually, each component offers good documentation and resources to get started, but documentation examples that encompass bigger, interconnected parts of a pipeline are often hard to come by.

        In this presentation, we will start by building a firmware binary to be run on a coprocessor in an NXP i.MX7-based AMP system. The resulting artifact will be included in a subsequent build process which generates a full Linux distribution. To facilitate this, a Yocto Project feature called “multiconfig” will be harnessed to orchestrate the successive tasks and integrate the results in a single artifact. This will constitute the actual, complete application binary that the device hardware will run.

        Still a real product needs more than this…

        It’s not enough to have the binary somewhere on the development machine - chances are that it also needs to be deployed as an update to devices in the field multiple times during the lifecycle of the product. At that point, Mender provides an OTA solution which can directly integrate into the Yocto Project based build process, helping to streamline the last step of distributing the generated binary image to a fleet of devices.

        Speaker: Mr Josef Holzmayr
      • 104
        Open source FPGA NVMe accelerator platform for BPF driven ML processing with Linux/Zephyr

        The talk will describe an open source NVMe development platform developed by Western Digital and Antmicro for server-based AI applications. The system combines an FPGA SoC with programmable logic and an AMP CPU, running Zephyr on the Corex-R cores handling NVMe transactions and Linux on Cortex-A in an openAMP setup.

        The system utilizes Zephyr RTOS to perform time critical tasks including handling the base set of the NVMe commands, while all the custom commands are passed to the Linux system for further processing. The Linux system runs an uBPF virtual machine allowing users to upload and execute custom software processing the data stored on the NMVe drive.

        The platform (custom hardware from Western Digital and open source software and FPGA firmware from Antmicro) was designed to enable users running ML pipelines designed in Tensorflow. To make it possible, the uBPF virtual machine has been extended to include functionalities to delegate certain processing to external, native, libraries interfacing the BPF code with hardware ML accelerators.

        The platform includes an example showcasing a TensorFlow AI pipeline executed via the uBPF framework accelerating the AI model inference on an accelerator implemented in FPGA using TVM/VTA.

        The platform intends to be a development platform for edge, near data processing acceleration research

        Speaker: Karol Gugala (Antmicro)
      • 105
        Abusing zephyr and meta-zephyr

        This talk will talk about the work done to switch from cmake to west in meta-zephyr and how I leveraged this work to do bad things with zephyr and meta-zephyr to generate Yocto Project machine definitions for meta-zephyr. We'll discuss why these patches are not zephyr upstreamable and why autogenerated machine definitions are not included in meta-zephyr.

        Speaker: Eilís Ní Fhlannagáin (Oniro Project)
      • 106
        libgpiod V2: New Major Release with a Ton of New Features

        The linux GPIO subsystem exposes a character device to the user-space that provides a certain level of control over GPIO lines. A companion C library (along with command-line tools and language bindings) is provided for easier access to the kernel interface. The character device interface has been rebuilt last year with a number of new ioctl()s and data structures that improve the user experience based on feedback and feature requests that we received since the first release in 2016. Now libgpiod has been entirely rewritten to leverage the new kernel features and fix issues present in the previous API. The new interface breaks compatibility and requires a different programmatic approach but we believe it is a big improvement over v1. The goal of this talk is to present the new version of the library, reworked command-line tools and high-level language bindings. We will go over the software concepts used in the new architecture and describe new features that provide both a more fine-grained control over GPIOs as well as expose more detailed information about interrupts.

        Speaker: Bartosz Golaszewski (BayLibre)
      • 107
        Linux IEEE 802.15.4 MLME improvements

        As of today, Linux has relatively poor support for 802.15.4 MLME operations such as scanning and beaconning. These operations are at the base of beacon enabled PAN (Personal Area Networks) where devices can dynamically discover each other, associate to a PAN and maintain it as the devices move relatively to each other.

        While some embedded RTOS like Zephyr already have a quite featureful support for these commands, Linux is a bit lagging behind. This talk will be an occasion to present the work done and still on-going to fill these gaps in the Linux kernel 802.15.4 stack.

        Speaker: Miquèl Raynal
      • 108
        All types of wireless in Linux are terrible and why the vendors should feel bad

        The wireless experience in Linux is terrible, whether it be 802.11, bluetooth or one of the other random standards we support? Why is it so bad? One word... Vendors! Vendors do the bare minimum, regress for "stable" users, either rarely or never update or even ship appropriate firmware and expect us to just accept it! This is the perspective of a linux-firmware maintainer for a distribution that has a key role working on Edge and IoT. How can we help vendors (or require them) to improve the wireless firmware user experience in Linux?

        Speaker: Peter Robinson (Red Hat)
      • 109
        Libre Silicon in IoT

        Few have achieved what many would have thought impossible; bringing together a distributed community of engineers, then designing, prototyping, and fabricating a custom RISC-V SoC. The project was largely a success - in the first revision no less!

        Designated PyFive, the intent was a libre silicon MCU capable of easily running Micropython and CircuitPython. It was designed and tested from the ground-up using open-source design & synthesis tools as well as an open-source PDK (Physical Design Kit). PyFive was one of 40 designs selected for MPW-1 in 2020, the first run of the Google-sponsored eFabless and Skywater foundry collaboration. There is now a GroupFund campaign which is truly the first of its kind.

        Since then, the community has created a follow-up project called ICE-V Wireless which targets IoT. This board pairs an ESP32-C3 and an iCE40 FPGA (notably with OSS CAD suite support from YosysHq). The ESP32 BLE5 / WiFi module is fully capable of standing up on its own two legs without the FPGA. However, having fabric capable of hosting a soft-core CPU with peripherals next to the radio opens a world of possibilities for the average SoC designer on a budget.

        This talk will go into detail on the successes and challenges encountered along the way, interfacing & tooling between Linux and a custom ASIC, and bringing up a custom Wireless device with Linux and Zephyr.

        With platforms like PyFive and ICE-V, what future doors might be opened with libre silicon in the Linux IoT space?

        Speaker: Michael Welling
    • Toolchains
      • 110
        Toolchain Track Welcome

        Welcome to the toolchain track from the organizers.

        Speakers: Jose E. Marchesi (GNU Project, Oracle Inc.), Nick Desaulniers (Google)
      • 111
        Where are we on security features?

        There has been tons of work across both GCC and Clang to provide the Linux kernel with a variety of security features. Let's review and discuss where we are with parity between toolchains, approaches to solving open problems, and exploring new features.

        Parity reached since last year:

        • zero call-used registers
        • structure layout randomization

        Needs work:

        • stack protector guard location
        • Link Time Optimization
        • forward edge CFI
        • backward edge CFI
        • array bounds checking
        • -fstrict-flex-arrays
        • __builtin_dynamic_object_size
        • C language extension for bounded flexible arrays
        • builtin for answering "does this object end with a flexible array?"
        • -fsanitize=bounds
        • integer overflow protection
        • Spectre v1 mitigation
        Speakers: Kees Cook (Google), Qing Zhao
      • 112
        Status Report: Broken Dependency Orderings in the Linux Kernel

        Potentially broken dependency orderings in the Linux kernel have been a recurring theme on the Linux kernel mailing list and even Linux Plumbers Conference. The Linux kernel community fears that with ever-more sophisticated compiler optimizations, it would become possible for modern compilers to undermine the Linux kernel memory consistency model when optimizing code for weakly-ordered architectures, e.g. ARM or POWER.

        Specifically, the community was worried about address and control dependencies being broken, with the latter having already seen several unfruitful [PATCH RFC]’s on LKML.

        This “fear” of optimizing compilers eventually lead to READ_ONCE() accesses being promoted to memory-order-acquire semantics for arm64 kernel builds with link-time optimization (LTO) enabled, leaving valuable performance improvements on the table as this imposes ordering constraints on non-dependent instructions.

        However, the severity of this problem had not been investigated as of yet, with previous discussions lacking the evidence of concrete instances of broken dependency orderings.

        We are pleased (or not so pleased) to report that, based on our work, we have indeed found broken dependency orderings in the Linux kernel. We would now like to open the discussion about, but not limited to, the severity of the broken dependencies we found thus far, whether they warrant dedicating more attention to this problem, and potential (lightweight or heavyweight) fixes.

        Speakers: Marco Elver (Google), Paul Heidekrüger (Technical University of Munich)
      • 11:30 AM
        Break
      • 113
        Programmable debuggers and the Linux kernel (drgn, GDB+poke)

        This activity is about programmable debuggers and their usage in the
        Linux kernel. By "programmable debugger" we understand debuggers that
        are able to understand the data structures handled by the target
        program, and to operate on them guided by user-provided scripts or
        programs.

        First we will be doin a very brief presentation of two of these
        debuggers: drgn and GDB+poke, highlighting what these tools provide on
        top of the more traditional debuggers.

        Then we will discuss how these tools (and the new style of debugging
        they introduce) can be successfully be used to debug the Linux kernel.
        The main goal of the discussion is to collect useful feedback from the
        kernel hackers, with the goal of making the tools as useful as possible
        for real needs; for example, we would be very interested in figuring out
        what are the kernel areas/structures/abstractions for which support in
        the tools would be most useful.

        Speakers: Elena Zannoni, Jose E. Marchesi (GNU Project, Oracle Inc.), Stephen Brennan (Oracle)
      • 114
        Toolchain support for objtool in the Linux kernel

        The Linux kernel relies on objtool for performing a host of validations, metadata generation, and other fixups and annotations. One of Objtool's feature is stack metadata validation and generation which forms the backbone of kernel's reliable stack unwinding needs.

        In this session, we will discuss what components of the objtool, in general, can get some help from the toolchain. We will also discuss what assistance can be provided for the usecase of reliable stack tracing. For reliable stack traces, correct and complete metadata is only one of the pillars. We will discuss the additional components that are required. At LPC 2021, we talked about the proposal to define and generate CTF Frame unwind information in the GNU Toolchain. There was also a discussion on objtool on arm64. In this session, we plan to converge these discussions with a perspective of what toolchain support can be provided to support objtool in the Linux kernel.

        Speakers: Indu Bhagat, Josh Poimboeuf (Red Hat)
      • 1:30 PM
        Lunch
      • 115
        GCC's -fanalyzer and the Linux kernel

        I'm the author of GCC's -fanalyzer option for static analysis.

        I've been working on extending this option to better detect various kinds of bugs in the Linux kernel (infoleaks, use of attacker controlled values, etc).

        I've also created antipatterns.ko, a contrived kernel module containing examples of the bugs that I think we ought to be able to detect statically.

        In this session I will:

        • present the current status of -fanalyzer on the Linux kernel, and
        • ask a bunch of questions about how this GCC option and the kernel should
          best interact.

        I have various ideas on ways that we can extend C via attributes, named address spaces, etc for marking up the expected behavior of kernel code in a way that I hope is relatively painless. I'm an experienced GCC developer, but a relative newcomer to the kernel, so I'm keen on having a face-to-face discussion with kernel developers and other toolchain maintainers on how such an analyzer should work, and if there are other warnings it would be useful to implement.

        Speaker: David Malcolm (Red Hat)
      • 116
        Kernel ABI Monitoring and Toolchain Support

        The new CTF(Compact C Type Format) supported in libabigail is able
        to extract a corpus representation for the debug information in
        Kernel binary and its modules, i.e, entire Kernel release (kernel +
        modules). Using CTF reader improvements the time to extract and build
        the corpus compared with DWARF reader, for example, extracting ABI
        information from the Linux kernel takes up to ~4.5times less
        time, this was done using a Kernel compiled by GCC, nowadays LLVM
        doesn't support binaries generation with CTF debug info, would be nice
        to have this.

        But what about of the modules inserted (loaded) at runtime in the
        Kernel image?. To make the comparison it uses kABI scripts this is
        useful among other things to load modules with compatible kABI, this
        mechanism allows modules to be used with a different kernel version
        that of the kernel for which it was built. So what of using a single
        notion of ABI (libabigail) also for the modules loader?

        Since we add support for CTF in libabigail, is needed the patch
        for building the Kernel with CTF enabled in the Kernel upstream
        configuration. Also some GCC attributes that affect the ABI and
        are used by kernel hackers like noreturn, interrupt, etc. are not
        represented in DWARF/CTF debug format and therefore they are not
        present in the corpus.

        A stricter conformance to DWARF standards would be nice, full DWARF 5
        support, getting things like ARM64 ABI extensions (e.g., for HWASAN)
        into things like elfutils at the same time as the compile-link
        toolchain, more consistency between Clang and GCC debug info for the
        same sources, the same for Clang and Clang with full LTO. And an
        extending ABI monitoring coverage beyond just architecture, symbols
        and types / dealing with header constants, macros and more

        The interest in discussing ways to standardize ABI and type
        information in a way that it can be embedded into binaries in a less
        ambiguous way. In other words, what can we do to not rely entirely on
        intermediate formats like CTF or DWARF to make sense of an ABI? Maybe
        CTF is already a good starting point, yet some additions are needed
        (e.g. other language features like for C++)?

        Speakers: Mr Dodji Seketeli, Mr Giuliano Procida, Mr Guillermo E. Martinez, Mr Matthias Männich
      • 4:30 PM
        Break
      • 117
        Linux Kernel Control-Flow Integrity Support

        Control-Flow Integrity (CFI) is a technique used to ensure that indirect
        branches are not diverted from a pre-defined set of valid targets,
        ensuring, for example, that a function pointer overwritten by an
        exploited memory corruption bug is used to arbitrarily redirect the
        control-flow of the program. The simpler way to achieve CFI is through
        instrumenting the binary code being executed with proper checks that
        verify the sanity of the indirect branches whenever they happen. To help
        with this goal, some CPU vendors enhanced their hardware with extensions
        that make these checks simpler and faster. Currently there are 4
        different instrumentation setups being more broadly discussed for
        upstream: kCFI, which is a full software instrumentation that employs a
        fine-grained policy to validate indirect branch targets; ARM's BTI,
        which is an ARM hardware extension that achieves CFI in a
        coarse-grained, more relaxed form; Intel's IBT, which is an X86 hardware
        extension that similarly to BTI also achieves coarse-grained CFI, but
        with the benefit of also enforcing this over speculative paths; and
        FineIBT, which is a software/hardware hybrid technique which combines
        Intel's IBT with software instrumentation to make it fine-grained
        without losing its good performance while still adding resiliency
        against speculative attacks.

        In this session, kernel developers and researchers (Sami Tolvanen, Mark
        Rutland, Peter Zijlstra, Joao Moreira) will provide an overview on the
        different implementations, their upstream enablement and discuss the
        contrast in approaches such as granularity or implications of design
        differences such as callee/caller-side checks.

        Speakers: Joao Moreira (Intel Corporation), Mark Rutland (Arm Ltd), Peter Zijlstra (Intel OTC), Sami Tolvanen (Google)
    • Containers and Checkpoint/Restore MC
      • 118
        Opening session
        Speaker: Stéphane Graber (Canonical Ltd.)
      • 119
        Tracer namespaces

        There are various use-cases related to tracing which could benefit from introducing a notion of "tracer namespace" rather than playing tricks with ptrace. This idea was introduced in the LPC 2021 Tracing MC.

        For instance, it would be interesting to offer the ability to trace system calls, uprobes, and user events using a kernel tracer controlled from within a container. Tracing a hierarchy consisting of a container and its children would also be useful. Runtime and post-processing trace filtering per-container also appears to be a relevant feature, in addition to allow dispatching events into a hierarchy of active tracing buffers (from the leaf going upwards to the root).

        It would be preferable if this namespace hierarchy is separate from pid namespaces to allow use-cases similar to "strace" to trace a hierarchy of processes without requiring them to be in a separate pid namespace.

        Introduce the idea of "tracer namespaces" and open the discussion on what would be needed to make it a reality.

        Speaker: Mathieu Desnoyers (EfficiOS Inc.)
      • 120
        Restoring process trees with child-sub-reapers, nested pid-namespaces and inherit-only resources.

        Re-parenting may put processes having same inherit-only resource into completely different and far away locations in the process tree, so that they don't have ancestor/descendant relations between each other anymore.

        In mainstream CRIU currently we don't have nested pid-namespaces support and re-parenting to child-sub-reaper support. We just handle the most common case where task was re-parented to container init. To handle this case CRIU simply checks all children of the container init for non-session-leaders which can't inherit session from init. We "fix" the original tree by moving such children to session leader sub-tree connecting them by helper task. After that we restore tasks based on the "fixed" tree and kill helpers so that we get the right tree which we check to be the same as the dumped one.

        In this talk I want to first cover how we handle in Virtuozzo more complex cases with child-sub-reapers [1], nested pid-namespaces [2], and cases where re-parenting breaks longer branches in process tree [2].

        And second I want to shed some light on the problem which we can't handle in CRIU easily because of the lack of information from kernel, this problem was known from the early days of CRIU development and it is still present and vital to support arbitrary process trees.

        Also I want to present one possible solution for the problem - "CABA" [3] and hope to see some feedback on it.

        Links:
        https://src.openvz.org/projects/OVZ/repos/criu/commits/70eee0613acf [1]
        https://src.openvz.org/projects/OVZ/repos/criu/commits/aa77967c2f6c [2]
        https://lore.kernel.org/lkml/20220615160819.242520-1-ptikhomirov@virtuozzo.com/ [3]

        Speaker: Pavel Tikhomirov (Virtuozzo)
      • 121
        How can we make procfs safe?

        Thanks to openat2(2), it is now possible for a container runtime to be absolutely sure that they are accessing the procfs path they intended by using RESOLVE_NO_XDEV|RESOLVE_NO_SYMLINKS (the main limitation before this was the fact that there was no way to safely do the equivalent of RESOLVE_NO_XDEV in userspace on Linux, and implementing the necessary behaviour in userspace was expensive and bug-prone).

        However, this method does not help if you need to access magiclinks in procfs (RESOLVE_NO_XDEV blocks all magiclinks and even if we allowed magiclink-jumps within the same vfsmount this wouldn't help with any of the magiclinks we care about since they all cross the vfsmount boundary). Of particular concern are:

        • /proc/self/fd/*
        • /proc/self/exe
        • When introspecting other processes, /proc/<pid>/ns/*, /proc/<pid>/cwd and /proc/<pid>/root.

        The primary attack scenario is that we have seen attacks where not-obviously-malicious Kubernetes configurations have been used to get the container runtime to silently create unsafe containers (we need to access several procfs files when setting up a container and if any of the paths are redirected to "fake" procfs files, we would be silently creating insecure containers) -- ideally it should be possible to detect these kinds of attacks and refuse to create containers in such an environment.

        In this talk, we will discuss proposed patches to fix some of these endpoints (primarily /proc/self/fd/* through open(fd, "", O_EMPTYPATH)) and open up to a general discussion about how we might be able to solve the rest of them.

        Speaker: Aleksa Sarai (SUSE LLC)
      • 11:30 AM
        Break
      • 122
        cgroup rstat's advanced adoption

        rstat is a framework how generic hierarchical stats collection is implemented
        for cgroups.
        It is light on the writer (update) side since it works with per-cgroup per-cpu
        structures only (mostly).
        It is quick on the reader side since it aggregates only cgroups active since
        the previous read in a given subtree.
        It is used for accounting CPU time on the unified hierachy, blkcg and memcg stats.
        Readers of the first two are user space queriers, the memcg stats are used
        additionally by MM code internally and hence memcg builds some optimizations
        above rstat. Despite that there were reports of readers being negatively
        affected by occasionally too long stats retrieval.
        This is suspected to be caused by some shared structures within rstat and their
        effect may get worse as more subsystems (or even BPF) start building upon
        rstat.

        This talk describes how rstat currently works and then analyzes time complexity
        of updates and readings depending on number of active use sites.
        The result could already be a base for discussion and we will further consider
        some approaches to keep rstat durations under control with more new adopters
        and also how such methods affect error of collected stats (when tolerance is
        limited, e.g. for the VM reclaim code).

        This presentation and discussion will fit in a slot of 30 minutes (give or take).

        Speaker: Michal Koutný
      • 123
        Unprivileged CRIU

        This talk will discuss on-going changes to CRIU to introduce an "unprivileged" mode, utilizing a minimal set of Linux capabilities that allow for non-root users to checkpoint and restore processes.

        It will also touch on a particularly motivating use-case; improving JVM start-up time.

        Speaker: Younes Manton
      • 124
        Restartable Sequences: Scaling Per-Core Shared Memory Use in Containers

        Introducing per-memory-space virtual CPU IDs allocation domains helps solving user-space per-core data structure memory scaling issues as long as the data structure is private to a memory space (typically a single process). However, this does not help in use-cases where the data structure sits in shared memory used across processes.

        In order to address this part of the problem, a per-container virtual CPU ID domain would be useful. This raises some practical questions about where this belongs: either an existing namespace or a new "vcpu domain" namespace, and whether this type of domain should be nestable or not.

        Reference: "Extending restartable sequences with virtual CPU IDs", https://lwn.net/Articles/885818/

        Speaker: Mathieu Desnoyers (EfficiOS Inc.)
      • 125
        Bringing up FUSE mounts C/R support

        Bringing up FUSE mounts C/R support

        Intro

        Each filesystem support in CRIU brings their own problems. Block-device based filesystems
        comparably easy to handle, we just need to save mount options and use it at the restore stage,
        it is also possible to provide such filesystems as an external mounts. Some virtual filesystems
        should be handled specially, for instance for tmpfs we care about saving entire fs content, for
        overlayfs we have to do some special processing to resolve source directories paths. But NFS
        and FUSE filesystems is totally different story. This talk is aimed to cover and discuss about
        the ways and problems which connected with FUSE filesystem support. There are some parallels
        between support for NFS (which is present in CRIU OpenVZ fork), but generally approach is different.
        Right now we don't have ready-to-go solution and support for FUSE C/R, this work was started by
        Vitaly Ostrosablin and me this year. We have ideas and PoC solutions for some of most important
        technical problems that comes into mind there but we also have things to discuss with the community.

        Plan

        Intro

        The main problem with FUSE filesystem support is that FUSE tie up different
        kernel objects (fuse mount, fuse daemon task, fuse control device, fuse file descriptors,
        fuse file memory mappings). This is very challenging from the CRIU side because we have
        special order of kernel resources restoration. And this is not a question of our choice.

        How CRIU handles files C/R?

        First of all, CRIU restores all the mounts. Tasks are restored lately. Why?
        1. to have ability to restore file memory mappings at the same time as we restore
        process tree (to get VMAs inherited) [2]
        2. To restore memory mappings for files we need to have mount roots descriptors ready to use

        Finally, we have a strict order mounts -> tasks and mappings. But FUSE breaks this logic totally.
        We need to have a FUSE daemon ready at the same time when we creating mount. But we can't do that,
        because fuse daemon task may use some external resources like network sockets, pipes, file descriptors
        opened from another mounts.

        What we can do with that?

        Idea is fairly simple and elegant. Let's create fake fuse daemon and use it for mount fuse rarely,
        then, once we are ready we can replace fuse daemon by the original one. Good news here is that
        kernel allows to do that without any changes.

        TBD

        Next challenge. Dumping fuse file descriptors info with frozen network

        TBD

        References

        • [1] Original issue https://github.com/checkpoint-restore/criu/issues/53
        • [2] https://github.com/checkpoint-restore/criu/blob/7d7d25f946e10b00c522dc44eb9c60d9eba2e7a0/criu/files-reg.c#L2372
        Speaker: Alexander Mikhalitsyn (Virtuozzo)
      • 126
        Closing session
        Speaker: Christian Brauner
    • Kernel Summit
      • 127
        Regression tracking & fixing: current state, problems, and next steps

        This session will provide a very quick and brief overview about Thorsten’s recent regression tracking efforts, which are performed with the help of the regression tracking bot “regzbot”. Thorsten afterwards wants to outline and discuss a few oddities and problems in Linux development he noticed during his work that plague users – for example how bugzilla.kernel.org is badly handled and how some regressions are resolved only slowly, as the fixes take a long time to get mainlined.

        In addition to that he will also outline some of the problems that make regression tracking hard for him in the hope a discussion will find ways to improve things. The session is also meant as a general forum to provide feedback to Thorsten about his work and discuss the further direction.

        Speaker: Thorsten Leemhuis
      • 128
        Modernizing the kdump dump tools

        kdump is a mechanism to create dump files after kernel panics for later analysis. It is particularly important for distributions as kdump often is the only way to debug problems reported by customers. Internally kdump the two user space tools makedumpfile, for dump creation, and crash, for dump analysis.

        For both makedumpfile and crash to work they need to parse and interpret kernel internal, unstable data structures. This is problematic as both tools claim to be downward compatible. In the decades of their existence this lead to more and more history accumulating up to the point that it often takes hours of git archaeology to find out why the code is the way it is. This is not only time consuming but also leads to many bugs that could be prevented.

        This talk shows how moving makedumpfile and crash to the tools/ directory in the kernel tree can help to simplify the code and thus reduce the maintenance needed for both tools. It also shows what consequences this move has for downstream partners and how these consequences can be minimized.

        Speaker: Philipp Rudo
      • 11:30 AM
        Break
      • 129
        Why is devm_kzalloc() harmful and what can we do about it

        devm_kzalloc() has been introduced more than 15 years ago and has
        steadily grown in usage through the kernel sources (more than 6000 calls
        and counting). While it has helped lowering the number of memory leaks,
        it is not the magic tool that many seem to think it is.

        The devres family of functions tie the lifetime of the resources they
        allocate to the lifetime of a struct device bind to a driver. This is
        the right thing to do for many resources, for instance MMIO or
        interrupts need to be released when the device is unbound from its
        driver at the latest, and the corresponding devm_* helpers ensure this.
        However, drivers that expose resources to userspace have, in many cases,
        to ensure that those resources can be safely accessed after the device
        is unbound from its driver. A particular example is character device
        nodes, which userspace can keep open and close after the device has been
        unbound from the driver. If the memory region that stores the struct
        cdev instance is allocated by devm_kzalloc(), it will be freed before
        the file release handler gets to run.

        Most kernel developers are not aware of this issue that affects an ever
        growing number of drivers. The problem has been discussed in the past
        ([1], [2]) - interestingly in the context of Kernel Summit proposals,
        but never scheduled there - but never addressed.

        This talk proposal aims at raising awareness of the problem, present a
        possible solution that has been proposed as an RFC ([3]), and discuss
        what we can do to solve the issue. Solutions at the technical, community
        and process levels will be discussed, as addressing the devm_kzalloc()
        hamr also requires a plan to teach the kernel community and catch new
        offending code when it gets submitted.

        [1] https://lore.kernel.org/all/2111196.TG1k3f53YQ@avalon/
        [2] https://lore.kernel.org/all/YOagA4bgdGYos5aa@kroah.com/
        [3] https://lore.kernel.org/linux-media/20171116003349.19235-1-laurent.pinchart+renesas@ideasonboard.com/

        Speaker: Laurent Pinchart (Ideas on Board Oy)
      • 130
        Current Status and Future Plans of DAMON

        DAMON[1] is Linux kernel's data access monitoring framework that provides
        best-effort accuracy under user-specified overhead range. It has been about
        one year after it has been merged in the mainline. Meanwhile, we received a
        number of new voices for DAMON from users and made efforts for answering to
        those. Nevertheless, many things to do for that are remaining.

        This talk will share what such voices we received, what patches are developed
        or under development for those, what requests are still under plan only, and
        what the plans are. With that, hopefully we will have discussions that will be
        helpful for improving and prioritizing the plans and specific tasks, and
        finding new requirements.

        Specific sub-topics will include, but not limited to:

        • Making DAMON ABI more stable and flexibile
        • Extending DAMON for more usages
        • Improving DAMON accuracy
        • DAMON-based kernel/user space optimization policies
        • Making user space DAMON policies more efficient
        • Making kernel space DAMON policies just work (auto-tuning)

        [1] https://damonitor.github.io

        Speaker: SeongJae Park
      • 1:30 PM
        Lunch
      • 131
        What kernel documentation could be

        The development community has put a lot of work into the kernel's documentation directory in recent years, with visible results. But the kernel's documentation still falls far short of the standard set by many other projects, and there is a great deal of "tribal knowledge" in our community that is not set down. In this talk, the kernel documentation maintainer will look at the successes and failures of the work so far, but will focus on what our documentation should be and what we can do to get it there.

        Speaker: Jonathan Corbet (Linux Plumbers Conference)
      • 132
        Rust

        The effort to add Rust support to the kernel is ongoing. There has been progress in different areas during the last year, and there are several topics that could benefit from discussion:

        • Dividing the kernel crate into pieces, dependency management between internal crates, writing crates in the rest of the kernel tree, etc.

        • Whether to allow dependencies on external crates and vendoring of useful third-party crates.

        • Toolchain requirements in the future and status of Rust unstable features.

        • The future of GCC builds: upcoming compilers, their status and ETAs, adding the kernel as a testing case for them...

        • Steps needed for further integration in the different kernel CIs, running tests, etc.

        • Documentation setup on kernel.org and integration between Sphinx/kernel-doc and rustdoc (this can be part of the documentation tech topic submitted earlier by Jon).

        • Discussion with prospective maintainers that want to use Rust for their subsystem.

        Speakers: Miguel Ojeda, Wedson Almeida Filho
      • 4:30 PM
        Break
      • 133
        Zettalinux: It's Not Too Late To Start

        The current trend in memory sizes lead me to believe that we'll need
        128-bit pointers by 2035. Hardware people are starting to think about it
        [1a] [1b] [2]. We have a cultural problem in Linux where we believe that
        all pointers (user or kernel) can be stuffed into an unsigned long and
        newer C solutions (uintptr_t) are derided as "userspace namespace mess".

        The only sane way to set up a C environment for a CPU with 128-bit
        pointers is sizeof(void *) == 16, sizeof(short) == 2, sizeof(int) == 4,
        sizeof(long) == 8, sizeof(long long) == 16.

        That means that casting a pointer to a long will drop the upper 64
        bits, and we'll have to use long long for the uintptr_t on 128-bit.
        Fixing Linux to be 128-bit clean is going to be a big job, and I'm not
        proposing to do it myself. But we can at least start by not questioning
        when people use uintptr_t inside the kernel to represent an address.

        Getting the userspace API fixed is going to be the most important thing
        (eg io_uring was just added and is definitely not 128-bit clean).
        Fortunately, no 128-bit machines exist today, so we have a bit of time
        to get the UAPI right. But if not today, then we should start soon.

        There are two purposes for this session:

        • Agree that we do need to start thinking about 128-bit architectures
          (even if they're not going to show up in our offices tomorrow)
        • Come to terms with needing to use uintptr_t instead of unsigned long

        [1a] https://github.com/michaeljclark/rv8/blob/master/doc/src/rv128.md
        [1b] https://github.com/riscv/riscv-opcodes/blob/master/unratified/rv128_i
        [2] https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

        Speaker: Matthew Wilcox (Oracle)
      • 134
        The Maple Tree

        The maple tree is a kernel data structure designed to handle ranges. Originally developed to track VMAs but found new users before inclusion in mainline, the tree has many uses outside of the MM subsystem. I would like to talk about the current use cases that have arose and find out about any other uses that could be integrated into future plans.

        Speaker: Liam Howlett (Oracle)
    • Zoned Storage Devices (SMR HDDs & ZNS SSDs) MC
      • 135
        SSDFS: ZNS SSD ready file system with zero GC overhead

        The architecture of SSDFS is the LFS file system that can: (1) exclude the GC overhead, (2) prolong NAND flash devices lifetime, (3) achieve a good performance balance even if the NAND flash device's lifetime is a priority. The fundamental concepts of SSDFS: (1) logical segment, (2) migration scheme, (3) background migration stimulation, (4) diff-on-write. Every logical block is described by {segment_id, block_index_inside_segment, length}. This concept completely excludes block mapping metadata structure updates that results in decreasing the write amplification factor. Migration scheme implies that after erase block exhaustion every update of logical block results in storing new state in the destination erase block and invalidation of logical block in the exhausted one. Regular I/O operations are capable to completely invalidate the exhausted erase block for the case of “hot" data (no necessity in GC operations). SSDFS is using the migration stimulation technique as complementary to migration scheme. It means that if some LEB is under migration then a flush thread is checking the opportunity to add some additional content into the log under commit. SSDFS is using the inline techniques to combine metadata/data pieces into one I/O request of decreasing write amplification factor. SSDFS architecture is ZNS SSD friendly and it can run efficiently even with limited number of active/open zones (14 active zones, for example). Preliminary benchmarking and estimations of conventional SSDs has showed the ability of SSDFS to decrease write amplification 2x - 10x times and prolong SSD lifetime 2x - 10x for real life use-cases comparing with other file systems (ext4, xfs, btrfs, f2fs, nilfs2).

        Speaker: Viacheslav Dubeyko (ByteDance)
      • 136
        Improving data placement for Zoned Linux File systems

        In this talk I'll present what I've learned from building ZenFS, a user-space
        RocksDB file system for zoned block devices, and what features could be transferable to kernel file systems.

        I'll go over the goals and high-level design of zenfs, focusing on the extent
        allocator, present what performance gains we've measured[2] and what the trade-offs are when constructing a file system for zoned block devices.

        Finishing up, i'd like to open a discussion on how to enable similar levels of performance in posix-compliant, general purpose file systems with zone support. BTRFS would be a good first target but bcachefs could also benefit from this.

        Unless we do data separation (separating files into different zones) we will not reap the full benefits of zoned storage.

        [1] https://github.com/westerndigitalcorporation/zenfs/
        [2] https://www.usenix.org/conference/atc21/presentation/bjorling

        Speaker: Hans Holmberg
      • 137
        Object caching on Zoned Storage

        Object caching is a great use case for SSDs but it comes with a big device write amplification penalty - often as much as 50% [1] of the capacity is reserved for over-provising to reduce the impact of this on conventional SSDs.

        There is a great opportunity to adress this problem using zoned storage, as the garbage collection can be co-designed with the cache eviction policy.

        Objects stored in flash caches have a limited life time and a common approach to eviction is to simply throw out the oldest objects in the cache to free space. Conventional drives have no notion of how old objects are and are not allowed to just throw out data out of erase units on the drive to reclaim space. If the garbage collection of the drive data is done cooperatively with a ZNS cache FTL on the host however, objects can be chosen to be evicted in stead of relocated when space is reclaimed.

        To get there, we will need an ZNS cache FTL and an interface between the FTL and the cache implementation for indicating which LBAS that should be re-located or invalidated to minimize write amplification.

        How could this be implemented? What options do we have? A ZNS Cache userspace library? A cache block device?

        The user/integration point of this I have in mind would be Cachelib [2], what other potential users do we have?

        This is a great opportunity to work together on a common solution for several use cases and vendors, pushing the eco-system forward!

        [1] https://research.facebook.com/file/298328535465775/Kangaroo-Caching-Billions-of-Tiny-Objects-on-Flash.pdf
        [2] https://github.com/facebook/CacheLib

        Speaker: Hans Holmberg
      • 11:15 AM
        Break
      • 138
        Supporting non-power of 2 zoned devices

        The zone storage implementation in Linux, introduced in v4.10, first targeted SMR drives with a power of 2 (po2) zone size alignment requirement. The newer NAND-based zoned storage devices do not naturally align to po2, so the po2 requirement introduces a gap in each zone between its actual capacity and size. This talk explores the various efforts[1] that have been going on to allow non-power of 2(npo2) zoned devices so that LBA gaps in each zone can be avoided. The main goal of this talk is to raise community awareness and get feedback from current/future adopters of zoned storage devices.

        [1]https://lore.kernel.org/linux-block/20220615101920.329421-1-p.raghav@samsung.com/

        Speaker: Pankaj Raghav
      • 139
        Zonefs: Features Roadmap

        This presentation will discuss planned new features and improvements for the zonefs file system: asynchronous zone append IOs, relaxing of O_DIRECT write constraint and memory consumption reduction. Feedback from the audience will also be welcome to discuss other ideas and performance enhancements.

        Speaker: Damien Le Moal (Western Digital)
      • 140
        Btrfs RAID on zoned devices

        Currently there is no possibility to use btrfs' builtin RAID feature with zoned block-devices, for a variety of reasons.

        This talk gives a status update on my work on this subject's matter and possibly a roadmap for further development and research activities.

        Speaker: Johannes Thumshirn (Western Digital Corporate)
      • 141
        Experiences implementing zonefs support in ZenFS

        The talk will cover the main challenges in porting an zoned block device aware application using raw block device access (ZenFS using libzbd) to zonefs. In addition to this, a performance comparison between ZenFS using zbdlib and zonefs will be presented.

        Speaker: Jorgen Hansen (WDC)
    • eBPF & Networking

      The track will be composed of talks, 30 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.

      This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.

    • 1:30 PM
      Lunch
    • 1:30 PM
      Lunch
    • 1:30 PM
      Lunch
    • Open Printing MC

      In this session we will first finalize the features and roadmap for CUPS 2.5 which has a focus on OAuth and new container/sandbox/distribution-independent-packaging support. Then we will discuss the features and roadmap for CUPS 3.0 which implements a new simplified architecture for driverless printing.

      https://github.com/OpenPrinting/cups

      • 149
        CUPS 2.5 and 3.0 Development

        In this session we will first finalize the features and roadmap for CUPS 2.5 which has a focus on OAuth and new container/sandbox/distribution-independent-packaging support. Then we will discuss the features and roadmap for CUPS 3.0 which implements a new simplified architecture for driverless printing.

        https://github.com/OpenPrinting/cups

        Speaker: Michael Sweet (Lakeside Robotics Corporation)
      • 150
        3D Printing

        3D printing gets more and more popular, not only in the industry but also in the DIYer's garages. Especially for consumers but also for professional users it would be great to have an easy workflow, so that 3D printing "just works" like conventional 2D printing. Somehow one could think about clicking "Print" in the 3D design/CAD/CAM software and the object gets printed. As we already have CUPS filters for 2D printing data formats get converted to get the 3D printer's native language in the end. Also we need some standard format for the applications to send, like as we have PDF for 2D printing. We will discuss here how such a 3D printing workflow could look like.

        Speakers: Michael Sweet (Lakeside Robotics Corporation), Till Kamppeter (OpenPrinting / Canonical)
      • 4:30 PM
        Break
      • 151
        Testing and CI for OpenPrinting projects

        cups-filters (and also other projects on OpenPrinting) get larger and more and more complex with the time. It is always harder to overview the code and to predict the exact effects of a change, adding a feature or fixing a bug one can easily cause a regression. One tests the code but has one really tested all types of input, all settings, … As human beings easily forget we need some automated testing, useful things being done when running "make check", and tests being triggered on each GIT commit. Here we will discuss strategies of automatic testing. We will also take CUPS' testing as an example and see whether we can proceed similarly on cups-filters.

        https://github.com/OpenPrinting/, https://github.com/OpenPrinting/cups

        Speaker: Till Kamppeter (OpenPrinting / Canonical)
      • 152
        Restricting access to IPP printers with OAuth2 framework

        Native printing in Linux leverages Internet Printing Protocol (IPP), the standard supported by the vast majority of printers available on the market. While it is quite sufficient for personal use, it has some drawbacks for enterprise customers, such as a lack of standard, OAuth2-based
        user’s authorization mechanisms necessary for print management systems. We tried to address this issue by developing a standard solution that can be implemented in various IPP-based systems.
        The problem can be defined as a general protocol between an IPP client and a printing system, consisting of IPP printers and an authorization server. To get access to the printer’s resources, the IPP client redirects the user to the authentication webpage provided by the authorization server. When the user authenticates successfully, the server issues an access token that must be used by the IPP client during communication with the printer. The printer uses the access token to verify the user's access rights.
        We would like to discuss security-related issues of this problem and propose a general protocol working for printing systems with different architectures. Other possible solutions will also be discussed.

        Speaker: Piotr Pawliczek (Google)
      • 153
        Documentation for OpenPrinting projects

        Good documentation is something neglected a lot in the free software world. One is driving the coding of certain projects quickly forward to get something which actually works and one can try it out. One wants to get one's new library finally released. But how should people know how to use it? Documentation! CUPS is well documented, but cups-filters (and pappl-retrofit) lack API documentation. Also the user documentation on the sites of distributions like Debian or SUSE are often much better than our upstream documentation. Here we will discuss how to solve this. API documentation generators for libraries? Upstreamizing documentation from distros to OpenPrinting? …?

        http://www.openprinting.org/
        https://github.com/OpenPrinting/openprinting.github.io

        Speakers: Till Kamppeter (OpenPrinting / Canonical), Aveek Basu
      • 154
        Sandboxing/Containerizing alternatives to Snap for Printer Applications or CUPS

        There are Snaps of CUPS and 5 Printer Applications, but Snap has also disadvantages, most prominently the one-and-only Snap Store and also that some desktop apps start up slowly. Are there alternatives to create distribution-independent packages, especially of Printer Applications? Docker? Flatpak? AppImage? …?

        https://github.com/OpenPrinting/
        https://openprinting.github.io/OpenPrinting-News-March-2022/#flatpak-and-printing
        https://openprinting.github.io/OpenPrinting-News-April-2022/#appimage-and-printing
        https://openprinting.github.io/OpenPrinting-News-May-2022/#official-docker-image-of-cups-and-printer-applications

        Speaker: Till Kamppeter (OpenPrinting / Canonical)
    • Power Management and Thermal Control MC
      • 155
        Frequency-invariance gaps in current kernel

        The kernel's load tracking scales the observed load by the frequency the CPU is running at, this scaled value is used to determine how loaded a CPU truly is and how its frequency should change.
        Currently, on X86, four-core turbo level is used as the maximum ratio for every CPU. However, Intel client Hybrid platforms have Pcores and Ecores, and Intel server platforms with Intel-Speed-Select-Technology enabled have high-priority cores and low-priority cores.
        The Pcore/High-Priority-Core can run at higher maximum frequency, while the remaining cores can only run at lower maximum frequency.
        In these cases, unified maximum ratio for every CPU doesn't reflect the truth and brings unfairness to the load balance.
        Also, the current code doesn’t handle special cases where the frequencies for one or more CPUs are clamped via sysfs.
        We would like to demonstrate the impacts brought by those issues for further discussion.

        Speaker: Rui Zhang
      • 156
        Unified structure for thermal zone device registration

        To register a thermal zone device, the number of parameters required has been increase from 4 when it is first introduced to 8, and people are still willing to add more.
        This is hard to maintain because every time a new parameter is needed, either a new wrapper is added, or all the current thermal zone drivers need to be updated.
        Plus, there is already a structure, aka “struct thermal_zone_params”, available, and it has already been used for registration time configuration.
        Here, I propose to use one structure for registration phase configuration, or combine with the existing struct thermal_zone_params for better maintenance.

        Speaker: Rui Zhang
      • 157
        Combining DTPM with the thermal control framework

        The DTPM framework and the thermal control framework are using the same algorithm and mechanism when the power numbers are involved. That results in duplicated code.
        The DTPM framework interacts with the user space but nothing prevent to provide an in-kernel API where the power based cooling devices can directly act on. That will result in a simpler code and very explicit power value usage. In addition, if the SCMI is supported by DTPM, no changes will be needed in the thermal cooling devices. The result will be one generic power based cooling device supporting any device (devfreq, cpufreq, ...) with an energy model (DT or SCMI based).

        Speaker: Daniel Lezcano (Linaro)
      • 4:15 PM
        Break
      • 158
        Energy model accuracy

        Energy-aware scheduling (EAS) introduced a simply, yet at that time, effective energy model to help guide task scheduling decisions and DVFS policies. As CPU core micro-architecture has evolved the error bars on the energy model to grow potentially leading to sub-optimal task placement. Are we getting to the point where we need to enhance the energy model, or look at new ways to bias task placement decisions?

        Speaker: Morten Rasmussen (Arm)
      • 159
        A generic energy model description

        The energy model is dispatched through implicit values in the device tree and the power values are deduced from the formula P=CxFxV² by the energy model in the kernel.
        Unfortunately, the description is a bit fuzzy if the device is using the Adaptative Voltage Scaling or not performance based, as a battery or a back light.
        On the other side, complex energy models exist on out of tree kernels like Android, meaning there is a need for such a description.
        A generic energy model description will help to have a clear of view of the power contributors for thermal, power consumers for accounting and performance.

        Speaker: Daniel Lezcano (Linaro)
      • 160
        CPUfreq/sched and VM guest workload problems

        Running a workload on VM results in very disparate CPUfreq/sched behavior compared to running the same workload on the host. This difference in CPUfreq/sched behavior can cause significant power/performance regression (on top of virtualization overhead) for a workload when it is run on a VM instead of the host.

        This talk will highlight some of the CPUfreq and scheduler load tracking issues/questions both at the guest thread level and at the host vCPU thread level and explore potential solutions.

        Speaker: Saravana Kannan
      • 5:40 PM
        Break
      • 161
        Linux per cpu idle injection

        Per core/cpu idle injection is very effective in controlling thermal conditions, without using CPU offline which has its own drawbacks. Since CPU temperature ramp up and ramp down is very fast, idle injection provides a fast enter and exit path.

        Linux has support for per core idle injection for a while (https://www.kernel.org/doc/html/latest/driver-api/thermal/cpu-idle-cooling.html).
        But this solution has some limitations as it blocks soft IRQs and have negative effect on pinned timers. I am working on a solution for unblocking soft IRQ issue but there is no good solution for pinned timers yet.

        The purpose of this discussion is to find possible solutions for the above issues.

        Speaker: Srinivas Pandruvada
      • 162
        Fine grain frequency control with kernel governors

        We introduced AMD P-State kernel CPUFreq driver [1] early of this year that is using ACPI CPPC based fine grain frequency control instead of legacy ACPI P-States, and it is merged into kernel 5.17 [2]. The AMD P-State will be used on most of the Zen2/Zen3 and future AMD processors.

        There are two types of hardware implementations: “full MSR solution” and “shared memory solution”. “full MSR solution” provides the architected MSR set of CPPC level registers to manage the performance hints which is the fast way to control frequency updates. However, “share memory solution” is that CPUs only support a mailbox model for the CPPC registers in the system memory, we have to map the system memory which shared with CPPC and use kernel RCU locks to manage synchronization with the way in kernel ACPI CPPC libraries.

        The initial driver is developed on “full MSR solution” processors and can get better performance per watt scaling in some CPU benchmarks. However, we face the performance drops [3] which compared with legacy ACPI CPUFreq driver in “shared memory solution” processors. The traditional kernel governors such as ondemand, schedutil, and etc. might not be fully suitable for the fine grain frequency control, because there were 166~255 performance states in AMD P-State that compared with only 3 ACPI P-States. CPU CFS scheduler governor might face more frequently performance change in AMD P-State.

        In following days, we will support more features include Energy-Performance Preference which is the balance between performance and energy and Preferred Core which is to have a best performance single core/thread in one package. We want to discuss how to refine the CPU scheduler or kernel governors to allow the platform to specify an order of preference that processes should be scheduled on the different cores.

        In this session, we would like to have a discussion to how to improve the kernel governors which have better performance per watt scaling on fine grain frequency control, how to leverage the new Energy-Performance Preference and Preferred Core features to improve the Linux kernel performance and power efficiency.

        For details of AMD P-State, please see [4].

        References:
        [1] https://lore.kernel.org/lkml/20211224010508.110159-1-ray.huang@amd.com/
        [2] https://www.phoronix.com/scan.php?page=news_item&px=AMD-P-State-Linux-5.17
        [3] https://lore.kernel.org/linux-pm/a0e932477e9b826c0781dda1d0d2953e57f904cc.camel@suse.cz/
        [4] https://www.kernel.org/doc/html/latest/admin-guide/pm/amd-pstate.html

        Speaker: Ray Huang (AMD)
      • 163
        Isolation for broken hardware during system suspend

        When a device is broken and return failure during suspend, the whole system is blocked from entering system low-power states.
        Thus user loses the top one power saving feature on their systems due to non-fatal device failures for their usage.
        In this case, making the system suspend work with tolerance of device failures is a gain. This may be achieved by a) disabling the device on behalf of BIOS, b) do device unbind upon suspend, c) skip the devices’ suspend callback or d) ignore the suspend callback failures, etc. It also helps when debugging device related suspend issues reported.

        Speaker: Rui Zhang
    • LPC Refereed Track