Linux Plumbers Conference 2023

America/New_York
Description

13-15 November, Richmond, VA, USA. 

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.

    • eBPF & Networking "James River Salon C" (Omni Richmond Hotel)

      "James River Salon C"

      Omni Richmond Hotel

      225

      For the fourth year in a row, the eBPF & Networking Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s networking stack as well as BPF subsystem and their surrounding user space ecosystems such libraries, loaders, compiler backends, and other related system tooling.

      The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of both subsystems.

      The track will be composed of talks, 30 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.

      eBPF & Networking Track's technical committee: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann (chair), Andrii Nakryiko and Martin Lau.

      • 1
        Opening session
        Speakers: Daniel Borkmann (Isovalent), Jakub Kicinski (Facebook)
      • 2
        Evolving the BPF Type Format

        This talk focuses on a number of issues and suggested solutions around the BPF Type Format (BTF). BTF has becoming more and more central not just for core BPF features, but also in other subsystems such as ftrace. The goal explored here is to facilitate - as much as is possible - various feature requests that have been expressed around BTF support that will benefit BTF adoption.

        These include

        • supporting separating kernel BTF into a dedicated vmlinux BTF module so that small embedded systems can take advantage of BTF
        • supporting explicit matching between kernel and module BTF to identify mismatches; currently mismatches have to be detected implicitly
        • supporting standalone BTF to allow modules to include BTF that does not have to change every time the core kernel does
        • auto-detecting which BTF features are available in a kernel so that we do not encode BTF features in kernel BTF that are not available in the kernel itself; this would be useful for a newer pahole when run on an older kernel
        Speaker: Alan Maguire (Oracle)
      • 3
        Exceptions in BPF

        There has been recent work on adding the notion of exceptions to the BPF runtime in the Linux kernel. In this presentation, we will explore the necessary changes made to the BPF subsystem to fulfill this. We will also explore various implementation choices, reasons for making the feature as generic as possible, and the possibility of integrating similar features found in other languages (C++, Rust, etc.) in the future.

        Finally, we discuss the value proposition of exceptions, how their careful and creative use can simplify writing BPF programs, and how they allow us to make guarantees about program behavior that are difficult to enforce through the verifier's static analysis.

        Speaker: Kumar Kartikeya Dwivedi (EPFL)
      • 4
        When BPF programs need to die : exploring the design space for early BPF termination

        In the rapidly evolving landscape of BPF as kernel extensions, the need for early termination is becoming increasingly critical, whether it's due to kernel stalling or the need to enforce execution time restriction on critical hook points :

        • The recently added bpf_loop helper can be used to attach a very long running BPF program into the kernel which has been demonstrated to stall the kernel.
        • Feature addition like BPF exception[1], which allows a programmer to throw errors and terminate the BPF program, is still limited by its inability to auto-cleanup resources that were allocated in runtime.
        • Using BPF-orchestrators[2], operators want to impose execution time restrictions on critical hook points, so that the overall system performance is not hindered due to kernel extensions. Resource contention and other factors can severely affect the runtime of a BPF program, which based on the system load and demand, will call for immediate termination before throttling subsequent requests.

        Thus, the situation now demands a mechanism to allow abrupt termination of any running BPF program as well as perform cleanup of allocated kernel resource to avoid resource leaks.

        In this talk, we will provide an overview of the design space for BPF termination, discussing the advantages and limitations of each approach. Then, we will introduce a fast-path approach to allow abrupt termination of kernel extensions which leverages the verifier's knowledge to achieve zero-bookkeeping safe cleanup. In order to create a fast-path, all helper function calls are patched to halt further resource allocation or make costly function calls. The patching mechanism makes exceptions for helpers which release resources such that all existing locks and reference can be efficiently released. Due to the verifier guarantees based on range analysis and branch traversal, it is guaranteed to cleanup resources regardless of runtime branch decisions.

        Towards the end of this talk, we will discuss about some known limitations and assumptions in our work. Most importantly, we invite the community to give feedback on how we can refine this work for a contribution towards the Linux upstream.

        [1] : https://lwn.net/Articles/944372/
        [2] : Sahu, Raj, and Dan Williams. "Enabling BPF Runtime policies for better BPF management." Proceedings of the 1st Workshop on eBPF and Kernel Extensions. 2023.

        Speakers: Dan Williams (Virginia Tech), Raj Sahu (Virginia Tech)
      • 11:00
        Break
      • 5
        Verifying the Verifier: eBPF Range Analysis Verification

        This talk will present our automated tool, Agni, to check the correctness of range analysis in the Linux kernel’s eBPF verifier. Agni automatically extracts the semantics of the verifier's range analysis in logic (SMT) from the kernel's C source code. We use abstract interpretation theory to provide a formal specification of the soundness of range analysis. Our tool checks the verifier's range analysis against this specification. When the soundness checks fail, it implies a possibility that a register's actual value during program execution can deviate from the verifier's beliefs regarding that register's value. Agni further synthesizes an eBPF program that can trick the verifier into constructing a range for a register that deviates from its actual concrete value.

        We ran Agni on 16 kernel versions, starting from 4.14 (the earliest we support). We were able to show that the range analysis in kernel versions 5.13 through 5.19 (the latest we checked) are sound. In older kernels where our soundness checks failed, in ~97% of the cases, Agni was able to synthesize eBPF programs that manifest bugs in the verifier.

        The talk will consist of:
        - An overview of why the eBPF verifier implements a non-standard range analysis.
        - How we extract the semantics of the verifier's range analysis code into logic.
        - How one goes about proving the soundness of range analysis, and our soundness specification.
        - An overview of how Agni constructs (synthesizes) eBPF programs that manifest bugs in the verifier.
        - The results of our experiments on the 16 kernel versions we tested.
        - A demo of the entire toolchain in action: how we go from kernel source code to actual eBPF programs that manifest bugs in the verifier.
        - A discussion about the future direction with potential for checking other parts of the eBPF verifier analyses for soundness.

        Some useful links:
        - Our tool Agni: https://github.com/bpfverif/agni
        - A collection eBPF programs synthesized by Agni, that manifest bugs in the eBPF verifier's range analysis, with instructions on how to run them: https://github.com/bpfverif/ebpf-verifier-bugs
        - Our recent post about Agni to the kernel mailing list: https://lore.kernel.org/bpf/SJ2PR14MB6501E906064EE19F5D1666BFF93BA@SJ2PR14MB6501.namprd14.prod.outlook.com/T/#u

        Speaker: Harishankar Vishwanathan (Rutgers University)
      • 6
        BPF Memory Model, Two Years On

        What has happened with the BPF Memory Model, two years after the presentation [1] at the Networking and BPF Summit held at the 2021 Linux Plumbers Conference?

        Until recently, not much!

        But that has changed, so much so that this presentation will cover a more detailed proposal for a BPF memory model.

        [1] https://lpc.events/event/11/contributions/941/

        Speaker: Paul McKenney (Facebook)
      • 7
        Overflowing the kernel stack with BPF

        eBPF is accelerating waves of innovation allowing applications to enhance the kernel’s capabilities at runtime, while guaranteeing stability and security. Such guaranteed safety is made possible by the verifier engine which statically verifies BPF code. However, the verifier implicitly makes assumptions about the runtime execution environment, which must hold for safety to be upheld. One such component of the execution environment is the availability of stack space for use by the BPF program. In this talk, we highlight two fundamental problems in the setup of the BPF runtime environment that allowed us to overflow the kernel stack.

        First, the BPF program, when attached to the kernel, often inherits or reuses the kernel stack, which is limited in size. Depending on the attachment point, the kernel stack may already be approaching the limit, and a BPF program can overflow the stack, despite being verified to use less than 8KB of stack space. The verifier makes an incorrect assumption about the runtime execution environment: that the kernel stack will always have 8KB stack space available.

        Second, while in most cases the BPF execution environment restricts nesting of BPF programs to limit the resultant stack depth, it is incomplete. We find that, by hooking BPF programs on helper functions, we can nest multiple BPF programs in the kernel in a way that inherits the same stack state, ultimately exhausting the stack. The verifier once again makes an incorrect assumption about the runtime execution environment: that BPF program nesting is disallowed or carefully controlled in all cases.

        In this talk we intend to touch upon:

        • BPF program attachment and its interaction with the stack.
        • an overview of the well-known difficulties encountered due to the limited kernel stack, especially with paths in the kernel, like XFS, that can bloat the stack in some cases.
        • a demonstration of stack overflow due to BPF program attachment on an otherwise-innocuous attachment point.
        • all methods of BPF program nesting, and their effects on stack reuse.
        • a demonstration of stack overflow due to uncontrolled BPF program nesting.
        • a discussion of how to mitigate these and future problems that are caused by implicit verifier assumptions about the runtime execution environment.
        Speakers: Dan Williams (Virginia Tech), Sai Roop Somaraju (Virginia Tech), Siddharth Chintamaneni (Virginia Tech)
      • 13:00
        Lunch
      • 8
        BPF for Security and LSM updates

        As the BPF LSM matures and gains adoption there is more need to pick up some of the left-over patches and have these submitted in the kernel:

        • LSM Static calls and share an update on how this series is progressing and the latest performance benchmarking results.
        • bpf_get_xattr and bpf_set_xattr and reach consensus on how to get this merged in the kernel.

        The talk will also share an update on how BPF (and LSM) is being used for security, the challenges and the future work that comes out of these challenges.

        Speaker: KP Singh (Google)
      • 9
        BPF_LSM + fsverity for Binary Authorization

        Overview

        Binary authorization is a common security requirement for modern systems. Fundamentally, only securely authorized binaries are allowed to perform certain risky operations. For example, only an authenticated sshd binary is allowed to bind port 22, or only limited authorized binaries should write to raw block devices with critical data. Many proposals have sought to solve this problem, namely, fsverity, IMA, etc. However, existing solutions often fail to provide enough flexibility and fine granular control with reasonably low overhead. In this talk, we present a flexible and low overhead solution based on BPF_LSM and fsverity.

        Design

        In this solution, we use:

        • fs-verity for file integrity checksums
        • Secure binary signing service
          to compute and sign fs-verity hashes
        • Xattrs to store fs-verity root
          hash signatures
        • BPF_LSM to enforce access control
        • User space daemon
          to manage keyrings and BPF_LSM programs

        Kernel Work

        We will need the following kfuncs to enable this work:
        bpf_fsverity_get_digest() to get fsverity root hash;
        bpf_vfs_getxattr() to get xattr, which contains the signature.

        Note: We will have a patchset and/or a PoC for review before LPC 2023.

        Speakers: Song Liu (Meta), Boris Burkov (Meta)
      • 10
        Sysarmor: Meta's eBPF Security Detection and Enforcement Tool

        Sysarmor is a security daemon used to detect possible threats, and enforce security rules at Meta. Sysarmor is deployed to higher threat environments, such as: collocated hosts, Meta Network Appliances, development servers, Meta cloud gaming, and public cloud (AWS/GCP). Sysarmor has over 40 BPF based detections, including areas such as: networking, privilege escalation, hardware attacks, rootkits, unknown executables, container creation, and container escape.

        The main differentiator between sysarmor and similar BPF based security tools is that sysarmor evaluates its rules inside the bpf program, instead of dispatching events to userspace logic. This allows sysarmor to use BPF-LSM to enforce these rules if desired. Sysarmor rules can make use of process information or container information, as well as hook-specific arguments to make a decision as to whether or not an action is allowed.

        The talk will provide an overview Sysarmor, and some of the hooks used, then will discuss several challenging areas the Sysarmor team has worked on:

        1. Efficiently and accurately gathering process information such as executable filename and using it for filtering events.
        2. Gathering container information such as ids and image ids, and associating that information with kernel information such as namespace ids.
        3. Using uprobes in system executables effectively, and associating information from system logs with BPF events.
        4. Using BPF iterators to recreate context after a service restart.
        Speakers: Liam Wisehart (Meta), Shankaran Gnanashanmugam (Meta)
      • 16:00
        Break
      • 11
        Extending Non-Repudiable Logs with eBPF

        The Linux kernel uses non-repudiable logging to attest to system integrity. Non-repudiation ensures that the validity of the log cannot be disputed, even in the presence of an untrusted actor. We present an extensible interface for user-defined programs to leverage TPM-based non-repudiable logging of any kernel data accessible to eBPF programs. With the large variety eBPF hook locations, our approach allows system integrity to be verified with greater granularity than previously possible. We have used this technique to measure and store container image digests when they are run to verify and attest container integrity. The variety of use cases present an exciting future for eBPF in security and trust.

        Speakers: Avery Blanchard (Duke University), George Almasi (IBM)
      • 12
        Advancing Kernel Control Flow Integrity with eBPF

        We explore the use of eBPF for kernel security, specifically in the context of enforcing kernel control flow integrity (kCFI). CFI is an effective way to defend against control hijack attacks. However, current CFI implementation in the kernel is imprecise and suffers from deployment challenges, resulting in it being underused. We believe eBPF's intrinsic strengths (safety, access to runtime state, dynamicity) can address both the imprecision and deployment issues of kCFI. In this talk, we will discuss the challenges of using eBPF to enforce fine-grained and precise kCFI. We will also discuss techniques to reduce the eBPF invocation cost while maintaining the flexibility of eBPF, a key challenge of this approach. We will present the detailed workings of our eBPF-based kCFI implementation and the evaluation of its performance overhead.

        From this talk, audiences will understand the current limitations of kernel CFI, opportunities/challenges of using eBPF for kCFI, and approaches to overcome those challenges. This discussion will help highlight issues and ways of using eBPF not only for kCFI, but overall kernel security and spur further discussion about the feasibility of such an approach.

        Speaker: Jinghao Jia (UIUC)
      • 13
        Modernizing Android BPF and the Android Security Model

        Android’s support for BPF is currently very limited and does not include many modern upstream features, namely CO-RE and iterators. The Android ecosystem’s security requirements and device lifecycles make integrating libbpf workflows and enabling partner access to BPF difficult.

        The goal of this discussion is to detail and explain the challenges for Android and explore options to enable modern BPF features securely in Android while aligning future development with upstream.

        Speaker: Neill Kapron (Google)
      • 14
        Buzzing Across Space: The Illustrated Children’s Guide to eBPF

        Bonus/fun evening session:

        eBPF gains widespread adoption, and it is relatively easy for people in tech, nowadays, to find tutorials or blog posts to get started with eBPF and to understand how it works. Other people hear about eBPF, but are less familiar with the related concepts, and they struggle more to understand what it is about, and how it changes system programming.

        Based after ”The Illustrated Children’s Guide to Kubernetes” [0], we have created a simple book to introduce the origins, the basics, and the main use cases of eBPF in a simple way. Meet Captain Tux at the commands of his starship, enroling eBee and her fellows to help boost the engines and various other components of the vessel, in this new story: “Buzzing Across Space: The Illustrated Children’s Guide to eBPF”.

        This session is not about presenting the basics of eBPF, but instead should be a relaxed time to go through the story and illustrations of the book, just to have fun!

        Note: A PDF version of the book is freely available online [1].

        [0] https://www.cncf.io/phippy/
        [1] https://ebpf.io/books/buzzing-across-space-illustrated-childrens-guide-to-ebpf.pdf

        Speaker: Quentin Monnet (Isovalent)
    • Kernel Testing & Dependability MC "Potomac G" (Omni Richmond Hotel)

      "Potomac G"

      Omni Richmond Hotel

      80
      • 15
        Welcome!
        Speakers: Sasha Levin, Shuah Khan
      • 16
        the path to achieve a bug-free build on the mainline

        There're a lot of focused testing effort across Linux kernel community to guarantee the quality of kernel from build to runtime. Nowadays, not only the test process has moved towards formalization but also the test coverage has been increased to discover more issues in an earlier time. On the other side, some issues are still escaped to mainline.

        In this talk, we will dive into the build issues and look for a possible path to achieve build issue free on mainline. We will share
        * the status quo of mainline including the trend of the issues found by different tools
        * the profile of known community testing effort regarding the coverage and focus
        * the newly adopted practices in 0-Day CI in past year

        Then we want to exchange the ideas and have the discussion around
        * what are the challenges that community currently faces, like identifying the build coverage of a new patch
        * what kind of practices can be propagated to various testing sides
        * the possible timeline (stages) for such achievement

        We look forward to having more collaboration with other players in the community to jointly achieve this goal.

        Speaker: philip li
      • 17
        Storing and Outputting Test Information: KUnit Attributes and KTAPv2

        Current kernel testing frameworks save basic test information including test names, results, and even some diagnostic data. But to what extent should frameworks store supplemental test information? This could include test speed, module name, file path, and even parameters for parameterized tests.

        Storing this information could greatly improve the kernel developer experience by allowing test frameworks to filter out unwanted tests, include helpful information in KTAP results, and possibly populate auto-generated documentation. I have been working on the new KUnit Test Attributes feature that could be part of this solution to store and access test-associated data.

        But what test attributes should we be saving? How should test-associated data be formatted in KTAP? And what possibilities does this open with parameterized tests (filtering based on parameters or even parameter injection)?

        Speaker: Rae Moar
      • 18
        Testing Drivers with KUnit (Does hardware have to be hard?)

        Unit testing common library code is (relatively) easy, but drivers often deal with a lot of global state, both in code and in hardware. New features like static stubbing go some way towards making this easier, but a lot of work still goes into making "fake devices".

        There are still many open questions, however:
        - Are the existing tools helping? Is there something obviously missing?
        - Are UML features like LOGIC_IOMEM a good path forward?
        - How should drivers make a fake 'struct device'? Via a platform_device (possibly with devicetree support), root_device, or a new kunit_device?
        - There are lots of ways of managing resources for tests (kunit_resource, KUnit actions, devres/devm_ APIs). What should we use, when?
        - How do we deal with callbacks, threads, etc, with KUnit contexts?
        - How to support other safety/reliability/testing opportunities like hardware fuzzing and Rust?

        Speaker: David Gow (Google)
      • 11:00
        Break
      • 19
        Quality in embargoed patches

        The bar on the quality of code that fixes embargoed issues is pretty low: usually the case is that the code is only tested by the author, and possibly a handful of other folks who are part of working on the fix.

        This session is a discussion to help draft a proposal for a testing story that could be presented to HW vendors that with to publish embargoed code without going through the kernel's usual review process.

        For more context: https://lore.kernel.org/ksummit/ZNuuvS5BtmjcazIv@sashalap/

        Speaker: Sasha Levin
      • 20
        Detecting failed device probes

        Regressions that cause a device to no longer be probed by a driver can have a
        big impact on the platform's functionality, and despite being relatively common
        there isn't currently any generic way to detect them.

        By enabling the community to catch device probe regressions in a way that
        doesn't require additional work for every new platform, and that can catch
        issues from config changes to driver probe errors, such regressions can be
        addressed much quicker and improve the kernel's stability.

        This session will present the approach that is currently being proposed to
        detect device probe regressions and open the floor for feedback and discussion.

        Speakers: Laura Nao, Nicolas Prado (Collabora)
      • 21
        Unifying and improving test regression reporting and tracking

        The current CI systems for the kernel offer basic and low-level
        regression detection and handling capabilities based on test results, but they do that in their own specific way. We wonder if we can find more common ways of tackling the problem through post-processing the data provided by the different CI systems. We could then extract additional "hidden" information, look at failure trends, analyze and compare logs, as well as understand the impact of the test infra setup in a failure. This would allow us to address the need to track the reported regressions status, creating a standard workflow across the different tools.

        Regzbot is proving itself as part of this standard workflow, but reporting the regressions from CI systems there is hard as the volume could be high, the relevance of some regressions could be low, and false positives do exist (a lot - specially for hardware testing). Better post-processing of the data and more standard workflows can help more reports to get to the community without destroying the quality of the data provided by regzbot.

        We'd like to briefly talk about the current status and limitations of the testing ecosystem regarding regressions, exhibit the current efforts and strategies being developed, and start a discussion about contributions and plans for the future.

        Speakers: Gustavo Padovan (Collabora), Ricardo Cañuelo
    • LPC Refereed Track "James River Salon D" (Omni Richmond Hotel)

      "James River Salon D"

      Omni Richmond Hotel

      183
      • 22
        Resolve and standardize early access to hardware for automotive industry with Linux "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        The automotive industry has been admirer of safeOS or realtime OS like QNX, Integrity, FreeRTOS as it gives them faster boot times, realtime predictable access to hardware, but the trends are changing as more and more OEMs now are migrating to Linux as high level operating systems.

        Unfortunately, Linux still hasn't resolved few key use cases required for automotive industry, due to which the SOC manufacturers have built heterogeneous architectures with MCU cores doing safety and time critical applications running FreeRTOS and Microprocessors running Linux for general purpose use cases.

        The use cases for low end display cluster, rear-view camera, low end driver monitoring, two-wheeler cluster and time critical industrial HMIs require:

        • Getting audio tone on speakers within 500 msec from boot.
        • Early animated display of needles and graphics on panel with/without GPU involved with in 1000 sec from boot.
        • Sending CAN response within < than 100 msec from boot.
        • Early wake up response on Ethernet with in 150 msec from boot.
        • Camera stream to screen (glass to glass) within 500 msecs from boot.
        • and more ...

        The solution industry has found with heterogeneous processors is not scalable, is not standardized (no open standards to follow) and not Linux friendly (for Linux late attach to MCU).

        The objective of this session is to :
        a) Deep dive into exact problems and the current solutions and how we migrate the current RTOS based solutions to switch to Linux only based solutions.
        b) How we standardize the "Linux late attach" with heterogeneous SOC.
        c) Bring together the automotive OEM, SOC manufacturers and Linux kernel and user space maintainers to define "Linux automotive" standards, harden & improvise the Linux kernel & drivers to meet the key performance requirements.

        We would like conclude the session with an invitation to industry wide expertise, to set up a consortium / an open forum with a focused project for the above listed automotive use cases in Linux foundation or open source projects like ELISA.

        Speaker: Khasim Syed Mohammed
      • 23
        Rust for Linux "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        Rust for Linux is the project adding support for the Rust language to the Linux kernel. Soon after LPC 2022, the initial support for Rust was merged into the kernel for v6.1.

        Since then, there has been progress in several different areas, including the addition of safe abstractions around kernel functionality and subsystems, as well as infrastructure and tooling changes.

        The talk will give an update on the status of the project: the community and team growth, the increased industry support, the Kangrejos workshop, the new website, progress on related tools and use cases, changes to our workflow now that the project is developed in-tree, etc.

        In addition, the talk will also give an overview on the upstreaming progress: the abstractions that have been merged in mainline so far as well as some of the ongoing efforts.

        Some of the projects/topics mentioned in the talk will be covered in greater length in the Rust MC on Wednesday by their own maintainers. In other cases, the projects kindly provided us with the update which we will give on their behalf in this talk.

        Finally, we will cover some discussion topics where we seek input from the community: the policy around unsoundness issues for stable kernels, the Rust version upgrade policy and the duplicate drivers exception.

        Speakers: Miguel Ojeda, Wedson Almeida Filho
      • 11:00
        Break "James River Salon D" (Omni Richmond Hotel)

        "James River Salon D"

        Omni Richmond Hotel

        183
      • 24
        Beginner Linux kernel maintainer's toolbox "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        Kernel Maintainers are supposed to send pull requests with code adhering to some standards, which are not always well documented and might be difficult to figure out for new Kernel Maintainers. The talk will present current best-practices for handling the code and sending it further to the upstream Maintainer:
        1. Improvements to email workflow: b4, useful simple hooks for verifying commits (because checkpatch is not enough).
        2. Get yourself in linux-next and get tested by community Continuous Integration/Testing.
        3. Add yourself to kernel.org keyring, sign your tags and pushes (for transparency log).
        4. Dump the mailing lists: use lei and lore.

        The talk is directed towards fresh Kernel Maintainers and to all who want to improve their workflow.

        Speaker: Krzysztof Kozlowski
      • 25
        Speeding up Kernel Testing and Debugging with virtme-ng "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        Testing and debugging kernels can be painfully slow: compiling the kernel, setting up a testing system (bare-metal or VM), deploying the recompiled kernel, executing tests, collect results and repeating the cycle.

        Intensified by the fact that each kernel developer employs their own distinctive set of custom scripts and workflows to accomplish comparable goals can lead to inefficiencies and it is often a major deterrent for newcomers aspiring to venture into kernel development.

        Virtme-ng aims to provide a standardized tool for kernel developers that can help to expedite this process. It uses a combination of QEMU/KVM, virtio-fs and overlayfs to boot a recompiled kernel (or any kernel image in the system) inside a virtualized copy-on-write (CoW) live snapshot of the running system.

        This allows to basically "fork" the system with a new kernel, creating a safe sandbox for executing tests (with performance comparable to native host execution), all while eliminating the need for the deployment and maintenance of dedicated testing systems.

        This tool can be extremely useful in a CI/CD scenario or for kernel bisecting, offering substantial time and resource savings. Furthermore, this talk is also an opportunity to raise awareness of common kernel development tools. Sharing our typical development workflow can be a real benefit for the whole kernel community, as even seemingly minor details in our daily routine can help or inspire others to become more involved in kernel development.

        Speaker: Andrea Righi (Canonical)
      • 13:00
        Lunch "James River Salon D" (Omni Richmond Hotel)

        "James River Salon D"

        Omni Richmond Hotel

        183
      • 26
        Emulating NT synchronization primitives in Wine "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        In order to emulate Windows NT kernel synchronization primitives, Wine currently uses a single server process, which fields operations on those primitives via RPC from client processes.

        This has historically worked well, but has turned out to be a severe performance bottleneck in heavily multithreaded applications such as modern games.

        In this talk, I propose to emulate the complexity of NT synchronization primitives in a kernel driver, which according to proof-of-concept tests can improve performance up to twice the speed of current Wine.

        Proof-of-concept trees are available here:

        https://repo.or.cz/linux/zf.git/shortlog/refs/heads/ntsync4

        https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync4

        Speaker: Zeb Figura (CodeWeavers, Inc)
      • 27
        Optimizing sysfs and procfs "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        Problem - Sysfs and procfs are implemented as on-demand file systems. An on-demand file system maintains meta-data about the entries and creates inodes/dentries on access and keeps them till there is enough memory available in the system.

        During system boot, it was observed that about 40% of sysfs and procfs entries were accessed. This means significant number of inodes/dentries have been allocated along with their meta-data. Once these inodes/dentries are allocated they remain in memory even if there is no immediate need for them.

        This increases the memory requirement during system boot and impacts the systems with low memory (embedded/IOT) and virtualised environments (VMs competing for memory)

        Solution - We propose that once the access to a sysfs/procfs file/folder is completed (after close() is called) the Kernel can delete the corresponding inode/dentry, this can be achieved by using DCACHE_DONTCACHE flag. This ensures memory is released sooner than a memory crunch situation and reduces the overall memory required during boot.

        During POC it was observed there was slight impact on boot time because of same folder hierarchy was recreated on every access. In order to mitigate this, we have changed the behaviour so that only the inodes/dentries of files were removed upon close() call. This would ensure that memory footprint is conserved while not affecting the boot time.

        Speakers: Ajay Kaher, Mr Vamsi Krishna Brahmajosyula
      • 16:00
        Break "James River Salon D" (Omni Richmond Hotel)

        "James River Salon D"

        Omni Richmond Hotel

        183
      • 28
        Powering up "discoverable bus-attached" devices on DT-based platforms "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        There is a long standing issue with devices that need any kind of powering up in order for the bus to be able to enumerate them. A device connected to such a bus will only be discovered/enumerated if resources like regulators, gpios and so on, are properly configured beforehand. This is one of the main reasons a major number of DT-based platforms are marking resources like regulators as "always-on", as a hack rather than a solution. The devicetree cannot describe devices connected to buses like PCI, USB and so on, therefore the resources needed to power up a device connected to such buses cannot be tied to any DT node. Tying them to the bus itself is not accurate from the HW point of view as they are only needed if one specific device is ever connected on that bus. There are quite a few examples of this scenario that involve PCIe, USB and MMC devices. MMC has implemented its own solution (pwrseq) a long time ago. For cases like USB onboard hub, there is a provided solution upstream recently, which adds a dedicated devicetree node for the onboard hub for which a platform driver probes and registers the hub which is later sysfs linked by the usb device driver on probe. This approach however does not work for every other device that is connected to a discoverable bus, since there are devices that provide multiple functionalities, like Qualcomm QCA Wifi and Bluetooth chips, which usually use two different buses (PCIe and UART/USB), is powered by an onboard PMIC and use a GPIO for SW reset that is shared between Wifi and Bluetooth. In such a case, in order for the Wifi driver to probe, the device needs to be enumerated first over PCIe, but for that to happen, some configuration needs to be done with respect to regulators (at least). Therefore which driver should handle those resources in such a way that the device itself gets a chance to probe? What subsystem should such a "power sequencer" be part of? How should such a device be represented in devicetree (if possible)? How could a driver for a device from a discoverable bus could get its hands on those resources before probing? Can we have a generic way to solve this problem? This presentation is trying to bring together all these questions and all scenarios, the historically proposed solutions and possibly a proof-of-concept.

        https://lore.kernel.org/all/20211006035407.1147909-1-dmitry.baryshkov@linaro.org/
        https://lwn.net/Articles/602855/
        https://www.uwsg.indiana.edu/hypermail/linux/kernel/1406.2/03144.html
        https://lore.kernel.org/all/20230110172954.v2.1.I75494ebee7027a50235ce4b1e930fa73a578fbe2@changeid/

        Speaker: Abel Vesa (Linaro)
      • 29
        Improving kexec boot time "James River Salon D"

        "James River Salon D"

        Omni Richmond Hotel

        183

        Maintaining an up-to-date host kernel offers significant advantages such as enhanced security, performance and reliability. One of the primary challenges in live updating the host kernel is the long downtime experienced by guest Virtual Machines (VMs). Several factors contribute to this downtime such as VM snapshot and VM restore time, the biggest one being the time taken to (kexec) reboot the host kernel. In this session we will go through the work that has been done on reducing kernel boot times in a typical server configuration by up to 80%, including parallelizing smpboot, optimizing TSC synchronization, and the work currently being done to streamline the initialization of struct pages when using HugeTLB Vmemmap Optimization (HVO). Additionally, we will go through optimizations that can be done in more specialized scenarios such as skipping PCI probe for certain devices and skipping purgatory. We will also look at the remaining areas which still occupy a significant boot time in the optimized kernel, such as enabling ACPI interpreter and ACPI table loads and discuss if there are ways to optimize/reduce time taken for these.

        Speaker: Usama Arif
    • RISC-V MC "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82

      We're holding another edition of the RISC-V microconference for Plumbers at 2023. Broadly speaking anything related to both Linux and RISC-V is on topic, but discussion tend to involve the following categories:

      • How to support new RISC-V ISA features in Linux, both for the standards are for vendor-specific extensions.
      • Discussions related to RISC-V based SOCs, which frequently include interactions with other Linux subsystems as well as core arch/riscv code.
      • Coordination with distributions and toolchains on userspace-visible behavior.
      • 30
        Introduction
        Speaker: Palmer Dabbelt (Google)
      • 31
        Deprecating Stuff

        Let's talk about what we can deprecate and when.

        Speaker: Palmer Dabbelt (Google)
      • 32
        Run ILP32 on RV64 ISA (RV64ILP32)

        The 64ilp32 ABI is not a fresh topic; x86-x32, mips-n32, and arm64-ilp32 have all appeared for many years but have yet to succeed in wide usage. But running ILP32 on 64-bit ISA still has a magic power to abstract people for continuous trying; now, this is our turn. The rv64ilp32 patch series has iterated to the second version, combining u64ilp32 (User) & s64ilp32 (kernel), supporting the 64ilp32 kernel for the first time. In the presentation, we will show the advantages of 64ILP32 from different views:
        - ILP32 v.s. LP64 (SPECint 2006 & 2017 scores)
        - RV64 v.s. RV32 ISA (memcpy, ebpf, atomic64, DCAS ...)
        - Run 64ilp32 Fedora on all popular platforms.
        The conclusion is that 64ilp32 would help the existing RISC-V hardware platforms by increasing performance and decreasing cost.

        Speaker: Mr Ren Guo
      • 33
        RISC-V patchwork CI

        The RISC-V kernel has a number of different continous integration (CI)
        instances in the wild. This session covers the "patchwork CI", which
        pulls patches from patchwork, and reports build/test results back to
        the submitter. We will be presenting how the CI is setup, what builds
        are done, and how tests are performed. Further, we will discuss
        current limitations, and outline a "patchwork CI"-next plan.

        We would like like to discuss gaps, ideas, concerns around the
        patchwork CI. Is anything missing? Can anything be removed? How can we
        improve the developer experience, and improve the quality of the
        submissions, pre-merge.

        Speaker: Björn Töpel (N/A)
      • 34
        Proposal of porting Trusted Execution Environment Provisioning (TEEP) Protocol with WorldGuard

        The objective of TEEP Protocol is to install and update the target device or server to have the latest critical software and data which is called Trusted Component (TC) at the IETF.
        In the procedure, the server checks the trustworthiness of the target devices remotely whether it is compromised or not, and only installs and updates the software components if confirmed it is not compromised.

        I would like to propose possibilities of porting from existing TEEP Protocol implementation on RISC-V relies on only PMP to using WorldGuard which was recently donated to RISC-V International from SiFive.
        The current TEEP implementation only uses PMP hardware feature and the Keystone software stack.
        It will have much practical value to have TEEP working on better hardware support which WorldGuard provides for the hardware isolated region.

        I am standardizing the TEEP protocol and released the open-source reference implementation of it at the end of April 2023 at the Internet Engineering Task Force (IETF). I am also an Author which enables life cycle management of software packages as a Trusted Components (TC). The TC consists of Trusted Applications and Personalization data covers from IoT/Edge/Network devices, to drone, automotive and heavy industry equipment.

        Maintaining the security bugs in the software packages is becoming challenging, while the number of software packages required for developing products is dramatically increasing every year.

        The development of TEEP Protocol implementation started in late 2018.
        It was a deadly psychological difficulty to repeat the approval request for releasing the source code for four years which took until April 2023.

        It was initially planned to disclose in March 2019, so I could collaboratively develop with the people with Keystone project at UC Berkeley and Open Enclave project by Confidential Computing Consortium at the Linux Foundation. And making it worse, every time Keystone project releases new versions or the TEEP Protocol evolves at every IETF meeting, I had to revise the internal implementation only with limited engineers in AIST because the people of Keystone and Open Enclave do not know we was developing TEEP Protocol on top of their software stack. If I could have collaborated the implementation with the people of Keystone and the Open Enclave, then the development efforts would be much less, and the quality of the implementation would be improved by having their expertise. It was a typical bad example of working on open collaboration community like RISC-V without using the benefit of collaboration.
        I hope Japanese organization will learn the good practice sometime in the future.

        Speaker: Akira Tsukamoto
      • 11:00
        Break
      • 35
        SBI Supervisor Software Events

        The Supervisor Software Events (SSE) extension provides a
        mechanism to inject software events from an SBI implementation
        to supervisor software such that it preempts all other traps and
        interrupts. This brings interesting challenges for the SBI implementation (OpenSBI,KVM RISC-V, etc) and supervisor software (Linux).

        Speaker: Mr Clément Léger (Rivos Inc)
      • 36
        Perf feature improvements in RISC-V

        RISC-V Linux kernel has some basic perf support with counter overflow and stat until now. This has its own limitations and multiple perf related ISA extensions are being drafted to address these concerns. We would like discuss few of the existing challenges and new issues related to implementation for new ISA extensions. For example, counter event mapping, event encoding, host + guest usage for perf, kernel blind spot profiling with SSE, restricted user space access for cycle/instret.

        The objective is to get an early feedback from the community to figure out the best path forward.

        Speaker: ATISH PATRA (Rivos)
      • 37
        RISC-V Vector: Current Status and Next?

        In this talk we are going to briefly share the status of Vector extension support and focus our discussion on the use of Vector in the kernel-mode. We will do it by reviewing others arch approaches and seeking if there is anything we may carry or improve as risc-v.

        Most architectures provide SIMD instruction set to improve throughput of some operations. However, the use of SIMD instructions in their kernel mode is often restricted due to latency considerations for extra state-keeping. For example, it is not uncommon to see an arch that disables preemption when using kernel-mode SIMD. Also, msot ban the use of SIMD in interrupt context.

        Among these architectures, seldom provide SIMD-optimized common sub-routines (mem/str ops). PowerPC provides vsx/vmx-optimized common routines with a precondition and a side-effect. First, it cannot leverage those routines in interrupt context. And, it must disable kernel preemption while using these subroutines. Meanwhile, though the same side effect applies to x86, it provides irq_fpu_usable() to allow some level of SIMD uses in interrupt context.

        On risc-v, should we follow the path of powerPC? If yes, how do we decide when to use it? Vendors may have varies performance characteristic of V and using SIMD for small inputs may not gain. Should we do runtime detection for this or just enable V whenever the hardware supports?

        Further, should risc-v take a step forward, by enabling preemption for its kernel-mode SIMD [1]? Supporting kernel preemption during Vector execution may enable us to widely use Vector in kernel thread or syscalls while remain same level of responsiveness. Historically, the reason for disabling kernel preemption while using kernel-mode SIMD is the per-cpu variable consideration [2]. However, the per-cpu FPU cache is phasing out on x86 and is not present on risc-v's approach. So, it might be a good timing for discussing if such support is a good idea now.

        The talk will cover the following topics:

        • The current status of Vector extension support
        • Basic idea of supporting kernel-mode Vector
        • The use of Vector optimized sub-routines for us and other arch
        • Should we support running kernel-mode Vector with preemption?

        1 https://lore.kernel.org/all/20230721112855.1006-1-andy.chiu@sifive.com/
        2 https://yarchive.net/comp/linux/kernel_fp.html

        Speaker: Tao Chiu
      • 38
        Control Flow Integrity on RISCV

        Memory safety issues impact program safety and integrity. One of the implications of such issues is subversion of programmer intended control flow of the program and thus violation of control flow integrity of program. There has been various software (and hardware) mechanisms using which one can enforce control flow integrity of the program. One such mechanism is using hardware assisted shadow stacks (for backward edge or return flow) and landing pad instruction (for forward edge or calls/jmps). Mainstream instruction set architectures (ISA) have extensions that can be leveraged by software to assist in enforcing control flow integrity. RISC-V has ongoing effort on similar lines to ratify an extension [1]. Additionally there has been RFC patch series [2].
        This talk is mostly going to be around approach and design decisions around CFI patch series seeking input from community members. Additionally this talk will go into details of how to use CFI extension to enforce kernel control flow integrity on risc kernel.

        1 - https://github.com/riscv/riscv-cfi
        2 - https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/T/

        Speaker: Deepak Gupta
      • 39
        RISC-V irqbypass with KVM

        KVM and VFIO provide an architecture-neutral irqbypass framework, but
        its enablement requires an implementation of an architecture-specific
        function, kvm_arch_irq_bypass_add_producer(). The RISC-V AIA and IOMMU
        specifications provide novel support for guest interrupt delivery (most
        notably MRIFs), which must be considered for RISC-V KVM's irqbypass
        implementation. We have an initial proposal which includes the RISC-V
        IOMMU driver implementing an IRQ domain in order to provide
        irq_set_vcpu_affinity(). This discussion is seeking feedback on that
        approach. Additionally, the RISC-V IOMMU will send notice MSIs when
        guest vIMSICs are backed by MRIFs, requiring a policy to select where
        the notice MSIs are delivered. This means we need to define new uAPI
        in order to involve the user. Feedback on the uAPI proposals would
        also be welcome. Also, we acknowledge that irqbypass performance will
        differ for guests with assigned interrupt files vs. those with MRIFs
        and we would like to discuss how best to modify or extend accounting in
        order to improve accuracy of measurements. Finally, the PoC is just
        getting started, by the time Plumbers meets, there should be enough
        done to have made other design decisions which could be discussed.

        Speaker: Andrew Jones (Ventana Micro Systems)
    • Real-time and Scheduling MC "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83

      Over the past decade, many parts of PREEMPT_RT have been included in the official Linux codebase. Examples include real-time mutexes, high-resolution timers, lockdep, ftrace, RCU_PREEMPT, threaded interrupt handlers, and more. The number of patches that need integration has been significantly reduced, and the rest is mature enough to make their way into mainline Linux.

      The scheduler is the core of Linux performance. With different topologies and workloads, giving the user the best experience possible is challenging, from low latency to high throughput and from small power-constrained devices to HPC, where CPU isolation is critical.

      • 40
        Welcome message and DL Server

        The DL server is a method that allows the usage of a SCHED_DEADLINE to schedule an entire scheduler. This mechanism can be used for multiple purposes. The base case is to

        For example, to schedule the CFS scheduler, avoiding the starvation from SCHED_FIFO. The server's base was presented by peterz some years ago, but it raised the points. For example, the inversion of priority of CFS and FIFO tasks happening at the same time and the CFS task being scheduled before the FIFO one, breaking the main PREEMPT_RT metric (scheduling latency).

        Progress was made in the last months with the addition of deferrable servers: a deferable server defers the server activation for the future.

        Towards the implementation of the DL server, it is important to discuss it with the broader scheduler community to define some topics as:

        • The deferable server
        • The interface
        • How to expand it further
        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 41
        Do nothing fast: How to scale idle cpus ?

        Following surprising benchmark results showing that adding a global raw spinlock in the idle loop significantly improves performance of the scheduler-heavy hackbench benchmark on a 192 core AMD EPYC, a month-long investigation followed to understand the root cause of this behavior.

        This presentation is meant to walk the audience through the findings and the resulting solution, opening discussion on some of the still unexplained behaviors with respect to wakeup-queueing, going-to-idle frequency, task runqueue selection and migration frequency.

        Speaker: Mathieu Desnoyers (EfficiOS Inc.)
      • 42
        system pressure on CPUs capacity and feedback to scheduler

        How to reflect better the pressure that can be applied on the CPUs compute capacity into the scheduler to improve task placement deciion and load balancing.
        This is a follow-up of the talk at OSPM and patchset will be published before LPC

        Speaker: Vincent Guittot (Linaro)
      • 43
        Optimizing Chromium Low-Power Workloads on Intel Notebooks

        At LPC 2022 we hosted an Energy Quality of Service (EQOS) API discussion. The proposed API enables user-space to inform the kernel about something it is expert in: itself. Callers do not require any knowledge of the hardware, unrelated tasks, or the internal workings of the scheduler. The session sparked a lot of follow-on discussions, with the main take-away being “okay, so prototype it and demonstrate value.”

        We will present a prototype with measurements demonstrating value. A web browser, updated to invoke the EQOS API, can save significant power during video playback and video conferencing workloads, without impacting performance when required.

        The EQOS API achieves this by differentiating tasks that care about maximum performance, from those that are willing to accept performance impact in the name of energy efficiency (EE tasks). The EQOS API simply tells the kernel which is which. As the entire purpose of QOS is to differentiate tasks, the kernel saves the EQOS class in task_struct. This prototype uses that per-task EQOS class in two ways.

        First, it takes advantage of Intel’s Hardware Performance-State “Energy Performance Preference” setting, effectively transforming this hardware knob from per-CPU to per-task. This allows the hardware to aggressively use high frequency for performance tasks, while being mindful of the energy cost of frequency for EE tasks.

        Second, we tell the scheduler’s ASYM-PACKING feature to continue preferring high priority CPUs for performance tasks, but to prefer low priority CPUs for EE tasks. This has the double benefit of getting the EE tasks off the limited high-performance CPUs, at the same time as retiring the EE tasks at more energy-efficient operating points.

        The concept of per-task EQOS class is agnostic to the underlying hardware architecture, and the scheduler is free to use (or ignore) this hint differently on different hardware.

        Unaddressed... Latency preferences are orthogonal to EQOS. Perhaps if we had a Latency-QOS hint, wecould tell the scheduler when a task wants to run efficiently, but promptly.

        Speakers: Len Brown (Intel Corporation), Ricardo Neri (Intel Corporation), Mr Vaibhav Shankar (Intel Corporation)
      • 10:55
        Break
      • 44
        How to reduce complexity in Proxy Execution

        The proxy execution patch series continues to be worked on to stabilize and get it ready for validation for use in products.

        But its complexity is high.

        I want to have a discussion for ideas on how we might break things up into more fine grained patches to iteratively get upstream, without making it an epic effort (hello, PREEMPT_RT!), or overwhelming reviewers ("[PATCH 1/628] sched:...")

        What initial half-steps might make sense? Is there value in proxy execution if we only boost locally (boost lockholder only if its on the same cpu as the selected blocked task), skipping migration initially?

        I'll also outline whatever the current state of the patch series is as of Nov.

        Speaker: John Stultz (Google)
      • 45
        Adaptive userspace spinlocks with rseq

        Implementing efficient spinlocks in userspace is not possible yet in Linux, even after years of different approaches and proposed solutions.The main gap to achieve it is the lack of ABI providing an easy and low-overhead way to check if the current lock holder is running or not.

        In this session, we are going to present the problem, and to propose a solution for it using the restartable sequences infrastructure as means to expose the thread state to userspace in a cheap way, without requiring system calls.

        RFC: https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

        Speakers: André Almeida (Igalia), Mathieu Desnoyers (EfficiOS Inc.)
      • 46
        CPU Isolation state of the art

        Here's a tour of what has been done in the front of CPU isolation
        this year and what still need to be achieved. Among which topics will include examples such as:

        • Memcg cache drain
        • Vmstat
        • Disable per-CPU buffer_head cache
        • IPI deferrals
        • cpusets v2 improvements
        • Osnoise tracer
        • Need for a nohz_full cpuset interface?
        • Sysidle (energy optimization)
        Speaker: Frederic Weisbecker (Suse)
      • 47
        Improving CPU Isolation with per-cpu spinlocks: performance cost and analysis

        What do we want?
        - Better CPU isolation, in order to run time-sensitive tasks without interruption

        What is (one of the things) preventing this?
        - queue_work_on(isolated_cpu)

        While working on those, an interesting parallel programming strategy was noticed:
        - Use per-cpu structures with local_lock, when a remote CPU needs any action performed, use queue_work_on(target_cpu).
        - Works great for rare remote-cpu interactions, but is terrible for CPU isolation
        Previous works (Mel Gorman, 01b44456) propose the usage of per-cpu spinlocks instead. But aren't spinlocks expensive?

        The objective of this presentation is to show a performance analysis done on per-cpu spinlocks, presenting base info such as:
        - Cache coherence & contention: why spinlocks can be expensive
        - How per-cpu spinlocks prevent most of this cost?- How does that impact isolated cpus ?

        And then showing the numbers:
        - How many clock cycles does per-cpu spinlocks actually cost?
        - What else can we do to save more cycles, and how those impact performance?
        - How much of the impact can be 'hidden' by OOO execution?
        - What about contention?
        - How does that compare with the current solution?

        Speaker: Mr Leonardo Bras Soares Passos (Red Hat)
      • 48
        Q&A about PREEMP_RT

        Thomas will be open to people's questions about PREEMPT RT and other topics.

        Speaker: Thomas Gleixner
    • Toolchains "Magnolia" (Omni Richmond Hotel)

      "Magnolia"

      Omni Richmond Hotel

      187
      • 49
        Security Features status update

        There has been tons of work across both GCC and Clang to provide the Linux kernel with a variety of security features. Let's review and discuss where we are with parity between toolchains, approaches to solving open problems, and exploring new features.

        Parity reached since last year:

        • -fstrict-flex-arrays=3
        • -fsanitize=bounds
        • __builtin_dynamic_object_size()
        • arm64 Shadow Call Stack (backward edge CFI)

        In progress:

        • __counted_by(member) attribute for bounded Flexible Array Members

        Needs work/discussion:

        • -fbounds-safety language extension proposal
        • handling nested structures ending in a Flexible Array Member (Clang)
        • language extension to support Flexible Array Member in Unions
        • arbitrary stack protector guard location (Clang: risc-v, powerpc)
        • Link Time Optimization (Kernel support for GCC)
        • forward edge CFI (GCC: KCFI)
        • backward edge CFI (Kernel support for CET)
        • arithmetic overflow protection (GCC & Clang)
        • -Warray-bounds false positives (GCC)
        Speakers: Kees Cook (Google), Qing Zhao, Bill Wendling (Google)
      • 50
        Synthesized CFI for hand-written assembly in GNU assembler

        We are working on extending the GNU assembler to Synthesize CFI (SCFI) for hand-written assembly. Using a new command line option "--scfi[=all,none]" to the GNU assembler, the users can invoke GAS' SCFI machinery to synthesize CFI for hand-written assembly. Some restrictions on the hand-written assembly do need to apply. The work is in progress and the first target is x86_64, with an option like "--scfi=inline" in the roadmap to handle inline assembly later. It will be good to discuss if this is useful for the Linux kernel; And further extensions that may be needed to make this useful for the kernel's use of hand-written and inline assembly.

        Speaker: Indu Bhagat
      • 51
        break
      • 52
        Graph-based ABI analysis for fun and profit

        Analyzing an ELF binary such as a shared library or a Linux kernel image to deduce properties about its exported API and ABI has been done in multiple ways in the past. Most approaches have in common the general mechanism of first extracting information from the binary, then storing it in an intermediate format and lastly comparing it against the result of another extraction. Applications for this ABI monitoring mechanism include preserving the backwards compatibility of an API at ABI level or gaining insights about changes caused by arbitrary influences to how the binary is produced (different compiler, different flags, actual code changes, etc.).

        While DWARF is the usual source of type information accompanying ELF symbol information, its primary purpose is not to describe an ABI surface, but rather to provide supplemental information that is needed to effectively debug such binaries. More recently, other formats have been explored as alternative sources of type information. Among them are CTF and BTF.

        An experimental implementation of a BTF reader and comparison algorithm eventually became the (open source) STG (Symbol Type Graph) project to explore the advantages and disadvantages of an entirely graph-based approach. STG can consume BTF and XML inputs and we've been using STG to enforce stable Android (kernel and library) ABIs. Our ABI graph equivalence algorithm is founded on sound mathematical principles.

        Adding native DWARF support to STG was a major undertaking as turning arbitrary DWARF into a representative graph required a lot of research and experimentation and we took a data-driven approach in how we overcame challenges (particularly around type deduplication).

        We would like to take this opportunity to go into much more detail about the STG internals, design choices we made, interesting things we learned, including how to:

        • design a compact storage format that is suitable for version control systems
        • design an in-memory graph data structure for high performance
        • traverse the graph structure efficiently in the presence of cycles
        • efficiently extract and deduplicate DWARF type information
        • filter type information that is available but not relevant to public ABIs
        Speaker: Matthias Männich (Google)
      • 53
        toolchain-agnostic build time improvements

        In this talk, we'll cover areas of research for how we might be able to improve compile times and overall build times for the Linux kernel in a toolchain agnostic manner.

        We'll look at:
        - Ingo's "Fast Kernel Headers" series
        - automating header refactoring
        - include-what-you-use and the linux kernel
        - precompiled headers
        - recent improvements to modpost
        - link-time de-duplication of BTF

        Speakers: Tanzir Hasan (Google), Nick Desaulniers (Google)
      • 13:00
        Lunch
      • 54
        Compiling for verified targets (BPF)

        During the GNU Tools Cauldron conference we had an activity called "The challenges of compiling for verified targets", with this abstract:

        The Linux kernel BPF is not a sandbox technology: BPF programs do not
        rely on a sandboxed environment for security, and in fact they are
        executed in the most privileged kernel mode. Instead, BPF programs are
        known to be safe before they are executed. It is the job of the kernel
        verifier to determine whether a given BPF program is safe to be
        executed, and to reject it if that cannot be determined. Conceptually
        speaking, an entire BPF program should be as predictable as a single
        machine instruction. Obviously, this cannot be achieved for any
        arbitrary BPF program given that the BPF ISA is turing-complete, and
        so the verifier imposes quite draconian restrictions on the programs
        to make sure they always terminate, among other things.

        BPF programs are sometimes written by hand, but as more kernel
        subsystems are being expanded to use BPF, the programs are getting
        bigger and more complicated, and hackers prefer to write BPF programs
        in high level languages like C or Rust and compile them to BPF object
        code using an optimizing compiler. Both the GNU Toolchain and
        clang/llvm provide BPF support.

        In ordinary targets the main challenge of the compiler is to generate
        the optimal[1] machine instructions that implement the same semantics
        than the program being compiled. In verified targets (like BPF) there
        is an additional and very important challenge: the generated machine
        instructions shall be verifiable. While this cannot be guaranteed for
        every input program, ideally the optimizing compiler shall inform the
        user if the input program contains source language constructions that
        inexorably would lead to not-verifiable code, and shall also adjust
        the optimization passes in order to avoid transformation that lead to
        non-verifiable code. The better the compiler does this, the more
        practical compiled BPF will become.

        It is not clear how to achieve this. In this talk, we will first state
        the problem and then examine different alternatives and potential
        techniques and strategies, some of them already tried by the
        clang/llvm BPF port with variable success: IR legalization, usage of
        the static analyzer, verification in assembler, usage of
        counter-passes vs. pass tailoring (-Overifiable), usage of annotations
        in source code, tailoring of the front-ends (BPF C), etc. Also we will
        analyze and discuss the impact that each strategy would have to the
        rest of the compiler.

        Note that this problem is not specific to the GNU Toolchain. Whatever
        techniques get developed will also serve to the clang/llvm compiler.
        We will be touching base with them during the LPC conference in
        November this very year.

        [1] Given some criteria like execution speed, or compactness.

        The goal of that activity was to gather input and ideas from the GNU toolchain community on the best way to proceed in order to fulfill the very novel needs of verified targets in general, and BPF in particular. We had an interesting and useful discussion in Cauldron.

        As a next step, we intend to continue the discussions at LPC with the BPF kernel people and also clang/llvm maintainers. We hope to start developing strategies and techniques to make compilation for verified targets useful in practice, and to keep it that way in a future where proliferation of verifiers is expected to happen.

        Speakers: Jose E. Marchesi (GNU Project, Oracle Inc.), Yonghong Song
      • 55
        Towards data type profiling

        Memory accesses can suffer from problems like poor spacial and temporal locality, as well as false sharing of cache lines. Existing presentations of profile data, such data from the perspective of code, can make it difficult to reason as to what the problems are and to work out what the fixes should be. A typical fix may be to reorder variables within a data structure.

        In this work Namhyung Kim will present ongoing work combining perf event and DWARF debug information, in order to correlate samples and present data type of the variables accessed within a program. However, DWARF debug information is not reliable in enabling a good understanding of variables accessed. The presentation will discuss the state of data type profiling and its addition to the Linux perf tool, how toolchain limitations are worked around by the tool, and how toolchains can be improved for data type profiling in the future.

        Speaker: Namhyung Kim (Google)
      • 56
        break
      • 57
        VSCode for kernel development

        How far can we take the kernel development experience in a reference IDE setup ? This talk will present a setup I've built http://github.com/FlorentRevest/linux-kernel-vscode
        It integrates features such as:
        - A series manager https://github.com/FlorentRevest/vscode-git-send-email
        - A mailing list explorer https://github.com/FlorentRevest/vscode-patchwork
        - Notebooks for syzkaller bugs reproduction, ftrace records analysis...
        - A clangd based cross-reference setup
        - Yet Another QEMU Wrapper that integrates with the IDE debugger
        And discuss how we could improve it further

        Speaker: Florent Revest (Google LLC)
      • 58
        Callsite Trampolines

        Memory allocation profiling discussion at LSF/MM/BPF conference this year (https://lwn.net/Articles/932402/) revealed a need for compiler support to instrument call sites of specific functions (in this case memory allocations) in a way that stores additional data for each call site. The details of this idea are described in Steven Rostedt's presentation: https://docs.google.com/presentation/d/1zQnuMbEfcq9lHUXgJRUZsd1McRAkr3Xq6Wk693YA0To/
        We would like to discuss this feature with compiler community.

        Speakers: Mr Aleksei Vetrov, Steven Rostedt, Suren Baghdasaryan
    • 13:00
      Lunch "Shenandoah H" (Omni Richmond Hotel)

      "Shenandoah H"

      Omni Richmond Hotel

      60
    • 13:05
      Lunch "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82
    • 14:00
      Lunch "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83
    • Android MC "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82

      The Android Micro Conference brings the upstream community and Android systems developers together to discuss issues and changes to the Android platform and their dependencies and interactions with the Linux kernel, allowing for collaboration on solutions for upstream.

      • 59
      • 60
        Driver Development Kit (DDK) and Vendor Workflow

        With the upcoming migration to the Bazel build system in Android, Qualcomm recently had to migrate all kernel build processes to Google's Kleaf framework.

        One part of Kleaf is the Driver Development Kit (DDK) which is used to build external modules. Before DDK, Qualcomm had a homebrew solution that essentially mapped Android.mk targets to the classic "make -C KERNEL_SRC M=$PWD". Under the hood, the DDK does the same thing, but adds additional bells and whistles which improve reusability and maintainability.

        The DDK obviated much of the homebrew effort, but there were some challenges getting kernel developers work with the new system efficiently. In this talk, we'll detail some of these challenges and how collaborative problem-solving with Google moved us forward.

        Speaker: John Moon (Qualcomm Innovation Center, Inc.)
      • 61
        Simplified Android Kernel Driver Development with DDK v2

        The Driver Development Kit (DDK) for Android Kernels allows for a reliable way to develop drivers against the Generic Kernel Image (GKI). While doing so, the DDK takes care of toolchain selection, hermeticity of the build and hence paves the way for compliant kernel modules. Further, DDK facilitates module development with clearly expressed dependencies between modules and a fast incremental build. Under the hood it uses Kbuild as authoritative build system.

        With DDKv2 we go one step further. By reducing the boilerplate to create modules, we enable not only GKI compliant modules with a lower entry bar, but also encourage upstream friendly module development to reduce the effort for upstreaming modules.
        At the same time we provide mechanisms to build modules against multiple versions of GKI in order to keep up-to-date with the latest available LTS and mainline.

        In this session we want to gauge for how well those features are perceived for practical kernel module development.

        Speakers: Matthias Männich (Google), Mr Yifan Hong (Google)
      • 62
        BPF Access Control and CO-RE in Android

        Android lacks Compile Once Run Everywhere (CO-RE) support, limiting a BPF program’s access to kernel data structures. BPF use could increase if CO-RE is enabled in Android. However, due to the complex ecosystem, care must be taken in developing this support. SOC vendors and OEM Partners are currently limited to BPF socket filters only, but have requested access to kprobes and tracepoints, which would open up access to internal kernel data structures and therefore extend the kernel ABI. What approach should be taken to limit access to BPF attach points and ensure program compatibility across Android and kernel versions?

        Speaker: Neill Kapron (Google)
      • 63
        Binder: fixing contention in buffer allocations

        The core graphics stack in Android reported the alloc->mutex being a major source of contention. This would create significant delays and ultimately result in janks. This talk will explore the specific details of this contention and how to mitigate these scenarios in binder.

        Speaker: Carlos Llamas
      • 64
        Android Kernel testing with platform integration

        The talk will be on how Android kernel team test Android Kernel with platform integration, what is the issue we have seen, and how we can improve.

        Speaker: Betty Zhou
      • 65
        Improving suspend/resume time and runtime PM on Android

        This talk will be a discussion of how the Android ecosystem could use runtime PM and s2idle to improve resume time and the common issues we've identified so far.

        Some key questions we need to answer would be:

        • If all the devlinks created by fw_devlink enforce runtime PM by default will it make things easier for the vendors or make it worse? Why?
        • If it'll make things worse, how can fw_devlink selectively set it for some device links?
        • Common issues we've seen with enabling s2idle and what can we do about them?
        Speaker: Saravana Kannan
      • 16:05
        Break
      • 66
        RISC-V support in Android

        RISC-V is the first major processor support change for Android since support for ARM64 was added in 2014. This session will outline the current progress of RISC-V support in the Android kernel and discuss GKI, extension support, optimization efforts, and more questions and challenges that have come up so far.

        Speaker: Curtis Galloway (Google)
      • 67
        Adding Third-Party Hypervisor to Android Virtualization Framework

        We discuss adding support for Gunyah to the Android Virtualization Framework. The reference implementation for the Android Virtualization Framework uses KVM, but extensions to support other hypervisors is possible. From the perspective of supporting a new hypervisor in AVF, we will review AVF’s hypervisor requirements, adding a hypervisor to CrosVM. We'll also discuss new features added to Gunyah to support 2nd and 3rd party use cases.

        Speakers: Elliot Berman (Qualcomm), Prakruthi Heragu (Qualcomm)
      • 68
        Porting Android Automotive on Xen

        This session is about running AOSP in a Xen guest virtual machine. We will start by presenting the goals of the project and provide an updated status. We will follow with gaps in the current implementation and discuss potential development for the future.

        Speaker: Leo Yan
      • 69
        Pixel 6 support on android-mainline

        One of the main problems with upstreaming Android-specific kernel features is the push back from upstream maintainers due to the “lack of an in-tree user”. This talk dives into Google’s efforts to create an upstreaming Android development platform using the Pixel 6 device.

        We will dive into how we added support for Pixel 6 on the android-mainline downstream branch including:

        1. How we manage to keep the downstream Pixel 6 drivers compatible with the constantly evolving upstream APIs.

        2. How we are reducing the downstream technical debt, e.g. removing vendor hooks, upstreaming forked drivers, etc.

        3. Future plans.

        Speakers: Peter Griffin (Linaro), William McVicker (Google)
      • 70
        Can mainline Linux run on Android without vendor hooks?

        There has been a lot of effort to bring Android out of tree patches to a minimum. With GKI effort this has reduced significantly, but somewhat artificially as some of these patches are now implemented as vendor hooks in modules. Why are they needed and can we ever get rid of them?

        Speaker: Mr Qais Yousef (Google)
      • 71
        16KB Page Size Support

        With the increasing demand for high-performance mobile applications, Android has began exploring the benefits of moving to a larger base page size. This talk will provide an update on the current status; performance metrics; and the challenges involved with this undertaking.

        Speakers: Juan Yescas, Kalesh Singh (Google)
      • 72
        AOSP Devboards

        We will discuss about devboards in AOSP, their current state of affairs, and some ideas around improving collaboration to help multiple interested communities.

        Speaker: Sumit Semwal (Linaro)
    • Containers and checkpoint/restore MC "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83
      • 81
        Opening session
        Speaker: Stéphane Graber (Zabbly)
      • 82
        Introducing PAGEMAP_SCAN IOCTL for Windows syscalls translation and CRIU

        Windows APIs GetWriteWatch() and ResetWriteWatch() are used to get and clear the write-tracking state atomically of any number of pages in memory. Only the kernel can keep track of this state efficiently through the memory management component. Linux Kernel lacked this support.

        Soft-dirty PTE flag was used initially. But it had to be left alone because of its short-comings and no way to fix it after years. UFFD_FEATURE_WP_ASYNC and UFFD_FEATURE_WP_UNPOPULATED are the new features which have been added in Userfaultfd to keep track of write-tracking state asynchronously and correctly. PAGEMAP_SCAN IOCTL has been added to filter the information about PTE bits and only get desired data. It is used to perform get and clear the write-tracking state atomically.

        By using it, CRIU doesn’t need to freeze processes to pre-dump their memory and it’ll have the accurate information about pages to the moment of dumping them.

        We'll discuss its evolution, current implementation, use cases, and benchmarks. The IOCTL is at v33. This discussion aims to advance patches and attract more users to this interface.

        Speakers: Andrei Vagin, Mr Muhammad Usama Anjum (Collabora)
      • 83
        User namespaces with host-isolated UIDs/GIDs

        This talk aims to move forward the discussion about an extension of user namespaces that allows the usage of host-isolated (non-mapped) UID/GID. This topic was raised by Stéphane Graber and Christian Brauner originally in [1] and [2]. Stéphane and I would like to share some new results and discuss difficulties with the Linux kernel community.

        Some highlights:
        - extension of kuid_t/kgid_t to 64-bit wide
        - VFS permission model for unmapped UID/GIDs

        [1] Isolated dynamic user namespaces https://lpc.events/event/7/contributions/836/
        [2] Simplified user namespace allocation https://lpc.events/event/11/contributions/982/

        Speakers: Aleksandr Mikhalitsyn, Stéphane Graber (Canonical Ltd.)
      • 84
        In Containers We Trust? Building Trust in Containerized Environments

        Building trust in containerized environments requires the measurement and attestation of individual containers. The Linux Integrity Measurement Architecture (“IMA”) collects and stores file integrity measurements in a non-repudiable log. These measurements are used during remote attestation to verify system integrity and extend trust from the kernel to measured files. File measurements cannot, however, be used to attest individual container integrity because they are not differentiated by namespace. We present a mechanism to measure container integrity, without requiring changes to the host operating system. Using loadable kernel extensions and existing IMA infrastructure, we measure images at container creation and namespace container file integrity measurements throughout runtime.

        Speaker: Avery Blanchard (Duke University)
      • 16:00
        Break
      • 85
        Fuse mounts recovery and Checkpoint/Restore

        During this talk we want to discuss the idea of FUSE API extension that can be useful for fuse mounts healing and Checkpoint/Restore.

        Last year I gave a talk [1] about the first steps of making FUSE support in CRIU. This time we want to continue this discussion and cover another (but close) problem. The problem of fuse mount “healing”. It is a very actual problem for the LXC project where we have LXCFS fuse filesystem, and we want to have better reliability by enabling fuse mounts recovery in case of LXCFS daemon crashes but the same idea can be useful for fuse-based storages.

        [1] https://lpc.events/event/16/contributions/1243/
        [2] https://lore.kernel.org/linux-fsdevel/20230403144517.347517-1-aleksandr.mikhalitsyn@canonical.com/

        Speakers: Aleksandr Mikhalitsyn, Stéphane Graber (Canonical Ltd.)
      • 86
        Cgroups and Enterprise Users

        Enterprise distributions are finally transitioning to cgroup v2 as the default [1][2]. But as has been discussed in previous Linux Plumbers Conferences [3][4], the transition from cgroup v1 to cgroup v2 has not been seamless for userspace applications.

        Some (simpler) enterprise applications have been able to utilize Systemd service files to manage their cgroups needs, but larger and more complex programs require more granular control. The Oracle database has started to utilize libcgroup's cgroup v1/v2 abstraction layer, and this has solved some of their easier cgroup challenges.

        Topics we're interested in discussing. (We're definitely open to other areas of discussion - let us know if you have areas that pique your interest, and we'll gladly consider them. But this is what is of interest within Oracle for the next year or so.)

        • Processes in the root cgroup v1 cgroup are not subject to cgroup resource restrictions (like cpusets), but cgroup v2 does enforce this. Some applications have chosen to use cgrulesengd to solve this, but this risks running afoul of the single-writer rule. And honestly it feels antiquated. How can we provide a more effective solution to these customers whose products will run on either a v1 or v2 system? (To further complicate things, these products also run on many different kernel versions and systemd versions and not all of the latest and greatest features are available.)

        • Some products have relied heavily on realtime for their high-priority (and/or low-latency) processes. Older versions of Oracle Linux allowed them to set realtime quotas and periods on a per-cgroup basis (via the CONFIG_RT_GROUP_SCHED kernel config), but this feature isn't available on cgroup v2 systems. (And I think this is the right choice.) With EEVDF coming soon, I would be curious to hear what are now the best practices for prioritizing certain processes - nice, EEVDF slice length, realtime scheduler class, etc.

        • Now that libcgroup plays nicely with systemd (see release v3.1.0 [5]), we're encouraging distros to again provide it as a package. Yes, users can use libcgroup to violate the single-writer rule, but used properly it can greatly simplify cgroup management, including the creation of delegated systemd scopes. And of course all of its standard cgroup reading/writing capabilities remain. I wouldn't mind giving a quick rundown of where libcgroup is and where it's going. It has proven invaluable for those applications that need to straddle both cgroup v1 and v2.

        • We would like to hear the community's thoughts on settings that are unmappable from cgroup v1 to cgroup v2, like cpuset.sched_relax_domain_level. Currently libcgroup will raise an UNMAPPABLE error in cases like this; note the user can silence this error. Is this the best we can do?

        • Is there interest in adding a libcgroup abstraction/mapping of cpu.stat and other multiline cgroup files?

        [1] https://docs.oracle.com/en/operating-systems/oracle-linux/9/relnotes9.0/ol9-features-changes.html
        [2] https://www.redhat.com/en/blog/whats-new-rhel-90-beta
        [3] https://lpc.events/event/4/contributions/524/
        [4] https://lpc.events/event/11/contributions/930/
        [5] https://github.com/libcgroup/libcgroup/releases/tag/v3.1.0

        Speakers: Kamalesh Babulal, Tom Hromatka
      • 87
        Protecting Sensitive Data in Container Checkpoints

        With the recent integration of container checkpointing in Kubernetes, it is crucial to protect the captured container state in order to maintain the confidentiality and integrity of application data. In this talk, we are going to discuss a built-in mechanism for providing data security by default through asymmetric encryption of CRIU images. By extending CRIU with encryption capabilities, we enable seamless end-to-end security across cluster nodes, without the need for modifications of the underlying container infrastructure. The talk will cover the current state of the project, the necessary changes for integration with existing container environments, and discuss how this mechanism can be utilized in combination with role-based access control in multi-tenant clusters.

        Speakers: Adrian Reber (Red Hat), Radostin Stoyanov (University of Oxford), Wesley Armour (University of Oxford)
      • 88
        Closing session
        Speaker: Stéphane Graber (Zabbly)
    • KVM MC "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82

      The KVM microconference will focus on KVM itself, as well as KVM's touchpoints with other kernel subsystems. Topics that are primarily aimed at something other than KVM are firmly out of scope and will be rejected. Please consider the Confidential Computing MC, the VFIO/IOMMU/PCI MC, or KVM Forum 2024 if you have a virtualization topic that isn't directly related to KVM internals.

      • 89
        Hypervisor-Enforced Kernel Integrity (Heki) for KVM

        Linux kernel vulnerabilities can be mitigated with kernel self-protection mechanisms include control-register pinning and memory page protection restrictions. Unfortunately, none is bullet proof because they are implemented at the same level as the vulnerabilities they try to protect against. To get a more effective defense, we propose to implement some of these protection mechanisms out of the kernel thanks to KVM. Our implementation is inspired by other private implementations currently in use (e.g. Windows's Virtual Secure Mode), but our approach is tailored to Linux specificities.

        Taking into account feedback from the first RFC patch series, we are working on a new version bringing minimal static configuration, a new GFN attributes management (alternative to multiple page-track modes), dynamic memory protection for all kernel mappings (e.g. kernel modules, kprobes, ftrace, eBPF JIT), VMM notification, and version management. We'd like to present these new features, discuss the best way to land these changes in mainline, and reach out to potential contributors or users.

        Speakers: Mr Madhavan Venkataraman (Microsoft), Mickaël Salaün (Microsoft)
      • 90
        Multi-KVM Abstract

        Problem Statement

        Rolling out KVM bug fixes and feature upgrades requires unloading KVM modules, which disrupts guests.

        Multi-KVM is a proposal to allow multiple, independent KVM modules to be loaded, unloaded, and run concurrently on the same Linux host to:

        • Upgrade and rollback KVM without disrupting running VMs and other
          KVMs on the host.
        • Enable running KVM modules with different parameters on the same host.
        • Facilitate easier A/B testing for KVM.

        Proposal & Objectives

        The proposed solution for multi-KVM requires refactoring and “privatizing” KVM code, mainly for conflict resolution between the various KVM modules running on the host.

        The approach we plan to take is:

        1. Make all KVM-owned data structures and values KVM-only on x86 (and to a lesser extent on other architectures) so that the structures/values don't need to be fixed across KVM modules.
        2. Fold KVM vendor modules (kvm_intel.ko and kvm_amd.ko) into KVM, removing any exports along the way to avoid DLL hell with vendor modules on x86.
        3. Extract shared system resources out of KVM, e.g. ASID management, into a new base module (proposed name VAC - Virtualization Acceleration Code). This module will run as a prerequisite to KVM, and will manage the shared state of all the KVM modules on the system.
        4. Allow the builder to assign a unique name to KVM modules and devices at compile time.
        5. Make multi-KVM opt-in, and have KVM be fully backward compatible when multi-KVM isn't utilized.

        Since these are fundamental changes in the way the KVM code is laid out, we would like to elicit feedback from the broader KVM community on our proposal.

        Speakers: Anish Ghulati, Sean Christopherson (Google)
      • 91
        Unifying KVM API for protected VM and utilities

        In this session, discuss the options for unifying KVM API for
        protected guests. What kind of APIs should/can be unified, and what
        kind of APIs should allow vendor-specific APIs.

        At this moment, each technology for protected guests adapts its APIs
        to construct a guest and make it ready to run. APIs to debug a
        protected guest when guest debugging is allowed. There are several
        user space VMMs like qemu, cloud hypervisor, etc. Also, there are the
        related components like VFIO etc. Although a sort of divergence may be
        inevitable, it's undesirable to have unnecessary divergence. Discuss
        how we casn make KVM easy for those (user space VMM and related
        components) to embrace protected guests.

        Speaker: Isaku Yamahata (Intel)
      • 11:00
        Break
      • 92
        pkernfs: Persisting guest memory and kernel/device state safely across kexec

        Hypervisor live update is a mechanism to support updating a hypervisor in a way that has limited impact to running virtual machines. This is done by pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM processes and then deserialising/resuming the VMs so that they continue running from where they were. So far, all public approaches with KVM neglected device assignment which introduces a new dimension of problems. This session will highlight the additional problem space that device assignment brings and discuss potential solutions.

        To support hypervisor live update with device assignment Linux needs new memory management capabilities. In addition to the ability to preserve guest memory and state across kexec, it needs to be able to persist and re-hydrate kernel and device state such as IOMMU page tables so that DMA can keep running during kexec. In this session we explore these requirements and a proposed solution: pkernfs. This is a new in-memory persistent file system which can store guest memory, userspace memory and kernel/device memory for IOMMU page tables.

        We also explore other requirements around improving the security posture of guest memory and how pkernfs will solve these, such as by integrating with gmem [1] and keeping guest memory out of the kernel’s direct map. By moving the guest memory into reserved DRAM it also avoids the struct page overhead for guest memory and do huge/gigating allocations similar to what DMEMFS [2] was aiming to achieve.

        I will give a short demo of hypervisor live update with PCI device assignment to illustrate what is being solved.

        There will be a request for reviews and feedback on the RFC which has been posted to lkml.

        The QEMU side of the live update is done largely via Steven Sistare’s QEMU live update patch set [3], with additional changes to support live update with PCI device passthrough; the focus of this session is on the kernel memory management side.

        [1] https://lore.kernel.org/lkml/20230718234512.1690985-1-seanjc@google.com/T/
        [2] https://lore.kernel.org/kvm/cover.1607332046.git.yuleixzhang@tencent.com/
        [3] https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/

        Speakers: Alexander Graf, James Gowans (Amazon EC2)
      • 93
        Hyper-V's Virtual Secure Mode in KVM project update

        Windows Credential Guard is a security feature that provides protection to user credentials by utilizing Hyper-V's Virtual Secure Mode (VSM) hypervisor enlightenments. This feature comes enabled by default in Windows 11 and is becoming a prerequisite in the industry. However, KVM has not been able to support it due to its complexity and intrusiveness.

        We published a VSM proof of concept implementation alongside our upstreaming plan in the KVM forum 2023. It generated a healthy amount of interest in the project. We plan on publishing a first patch series before LPC, and believe the KVM MC and its key attendees make it a good venue to provide an update on the project and to discuss any contentious topics in person.

        Additionally, VSM introduces concepts that might overlap with other discussions held at the KVM MC, like multiple execution contexts per-vCPU and dynamic permission updates of IOMMU and MMU page tables.

        Speaker: Nicolas Saenz Julienne (AWS)
      • 94
        Supporting guest private memory in Protected KVM on Android

        Abstract

        Please consider as a submission for a small topic.

        In this talk we, present the current approach for supporting guest private memory in Protected KVM (pKVM) on Android for Arm64.

        Support for confidential computing is rapidly becoming more popular, with hardware-supported solutions such as Intel's TDX, AMD's SEV, and Arm's CCA, and software-based implementations such as pKVM. One of the aspects in common is the ability to create and run a "protected" guest, whereby no other entity in the system, including the host, has access to the guest's data (unless explicitly shared).

        The current KVM API presents guests' memory to the host via a host userspace address. Although the host is prevented from accessing the guest memory via that address, such an erroneous access could be fatal to the system and result in a full reset. Moreover, memory for protected guests might demand additional restrictions, such as preventing swapping and migration, which may require host access to the underlying physical pages.

        Guest private memory was proposed for Intel TDX as a new API and solution to address these issues. It represents guest memory using a file descriptor, along with a new allocator that imposes restrictions on what can be done with the memory. It removes the need altogether for having any mapping of the guest private memory in the host operating system, whether in userspace or in the kernel.

        This talk describes the work to port the (proposed) guest memory interface to pKVM on Android (Arm64). Being a software-based Arm64 solution, pKVM has some differences from the guest memory's original target of TDX. The most relevant difference is that pKVM allows sharing memory in place, since it does not encrypt guest memory. Moreover, in pKVM, host stage-2 faults are not necessarily fatal to the system, as they can be injected back to the host giving it a chance to handle them in some situations.

        Audience

        Kernel developers with an interest in confidential computing or virtualization.

        Benefits to the ecosystem

        At the time of this submission, the guest memory work is still in flux, the biggest issue stalling its progress being the API. All Linux (or at least KVM-based) confidential computing proposals would benefit in using the same API, which should be flexible enough to allow for differences in implementation to accommodate the differences between them. Moreover, Android is currently shipping with a GUP-based approach to prevent swapping and migration, with the drawbacks mentioned earlier, and would ideally switch to an upstream-supported interface as soon as possible.

        By presenting the pKVM work in this area, we hope it would lead to a better understanding of the issues involved, aid in arriving at a consensus (if it hasn't happened already), and serve as a guide to other confidential computing proposals that haven't started using guest private memory yet.

        Speaker: Fuad Tabba (Google)
    • eBPF & Networking "James River Salon C" (Omni Richmond Hotel)

      "James River Salon C"

      Omni Richmond Hotel

      225

      For the fourth year in a row, the eBPF & Networking Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s networking stack as well as BPF subsystem and their surrounding user space ecosystems such libraries, loaders, compiler backends, and other related system tooling.

      The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of both subsystems.

      The track will be composed of talks, 30 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.

      eBPF & Networking Track's technical committee: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann (chair), Andrii Nakryiko and Martin Lau.

      • 95
        bpftime: Fast uprobes with user space BPF runtime

        In kernel operations, the uprobe component of eBPF often faces performance inefficiencies, primarily due to the overheads introduced by context switches. Transitioning to userspace, eBPF can bypass these context switch-induced delays, leading to optimized performance. Moreover, this transition facilitates greater configurability without requiring root access or privileges, thus reducing the kernel attack surface.

        In this talk, we will introduce bpftime, a prototype userspace eBPF runtime. It offers rapid uprobe and syscall hook capabilities: userspace uprobes can be 10x faster than kernel uprobes, without the necessity of two context switches. It can also programmatically hook syscalls of a process safely and efficiently.

        Utilizing binary rewriting, we enable uprobe and syscall hooks: These can trace or patch the execution of a function, and hook, filter, or redirect all syscalls of a process with an eBPF userspace runtime. This runtime can be injected into any running process without the need for a restart or manual recompilation.

        Additionally, we have implemented interprocess eBPF Maps in shared userspace memory for summary aggregation or control plane communication. Compatibility is ensured with existing eBPF toolchains like clang and libbpf for developing userspace eBPF without any modifications. We support CO-RE via BTF, and offer userspace host function access, broadening the utility and ease of use of the bpftime runtime.

        Speaker: Yusheng Zheng (PLCT Lab)
      • 96
        Make ftrace_regs a common trace interface for function entry/exit tracing

        We are looking for the new register-set data structure, instead of pt_regs, for function entry/exit trace events. This is because pt_regs is expected to save all registers including some control registers which are usually saved when an exception or interrupt happens. However, using ftrace it will not be able to be used on some architecture. Moreover, for most RISC architecture, saving all registers will take a lot of time and consume a large amount of stacks. And that is useless on function entry and exit since the registers which we need are a part of registers which can be used for passing function parameters, or return value and stacks.
        Previously, we have only kprobe which uses pt_regs because it is based on the software breakpoint, which is usually implemented as an exception and saves pt_regs automatically.
        Now, we have fprobe for function entry/exit tracing, which is based on ftrace and rethook.
        From the tracefs user’s point of view, fprobe is used for fprobe-event. And users are only able to access function arguments and function return value, and stacks from the fprobe event. Thus we don’t need to use pt_regs.
        The problem is that the eBPF. Since fprobe is used for eBPF to enable multiple kprobe events, which expects the handler will access registers via pt_regs data structure (but usually only access limited registers for function arguments ). So it can be updated to ftrace_regs too, but needs another interface.
        Once we moved to ftrace_regs, we can start integrating rethook with function-graph tracer. Both implement the shadow stack but in different ways. If those provide the same interface, we can choose one of them.

        Speaker: Masami Hiramatsu (Google)
      • 97
        Where have all the kprobes gone

        We recently suffered a case where we did not get expected count of
        attached kprobe hits and I realized that we don't properly keep stats
        of missed probes.

        While this was not surprise for legacy kprobes (perf + SET_BPF ioctl)
        which was brought up long time ago, it's currently not possible get
        such stats even on kprobes created on top of bpf perf and kprobe_multi
        links.

        Also we are missing recursion accounting of BPF layer on several places
        that can drop kprobe and related BPF program execution.

        In this talk I'll go through all BPF related tracing probes and show
        where the probe can get lost and how the user can keep track of that.

        Speaker: Jiri Olsa (Isovalent)
      • 98
        xprobes: Hybrid User/Kernel eBPF Probes for Cross-Layer Observability

        eBPF is fundamental for diagnosing performance issues in production environments - where flexible and continuous profiling is key. Understanding, for example, why some functions are taking too long can provide a quick path to uncovering the root cause of performance issues. However, high-level indicators such as high latency are not informative enough: To disambiguate the source of high latency, one must consider not only underlying functions, but also system resources and kernel events - e.g., is the latency of a slow function because of CPU contention? blocking on I/O or locks? or just slow processing?

        In this talk we will start by analyzing the shortcomings of widely-used tools (e.g., flamegraphs, distributed tracing, and BCC probes) for this kind of diagnosis. For example, BCC kprobes have visibility over kernel events, but will only attribute their measurements to entire processes (PIDs). Alternatively, BCC uprobes have higher granularity and attribute to specific userspace functions, however, they lack visibility over the system events (e.g., context switches) which influence their measurements. We will propose xprobes, hybrids between uprobes and kprobes with visibility over application functions as well as kernel events. For profiling, an xprobe maintains aggregate distributions for the current set of functions of interest. These aggregates are updated jointly by userspace (e.g., function entry/return) and kernelspace events.

        xprobes rely on probe-specific code for merging data from kernel and userspace events. An example is our cpu-latency xprobe, which measures only the on-cpu time of userspace functions of interest. Achieving this is complicated because context switching can happen in-between the function entry and return, meaning that only getting the diffs from function entry/return events will result in wall-clock latency, not actual on-cpu latency distributions of each function. For this xprobe specifically, our solution still involves uprobes recording start and end times, but also has kprobes write information to correction variables so as to account for context switching (e.g., if a function is switched out of the CPU, subtract all the time it's been away from the CPU from the final latency measurement).

        We will show how the rich data from xprobes can provide powerful performance insights. For example, if, for a given function, we compare the wall clock latency distributions with distributions from our xprobes (e.g., actual on-cpu latency and I/O latency) we can help disambiguate the root cause of a performance degradation. Finally, we will discuss possible avenues for managing the overheads of xprobes.

        Speaker: Lucas Castanheira (CMU)
      • 11:00
        Break
      • 99
        BPF programmable netdevice

        BPF for networking has seen a number of infrastructure improvements since the last year such as the introduction of tcx as the new tc BPF fast path with BPF link support. The next bigger step in this area is the introduction of a BPF programmable netdevice called "netkit" where the BPF logic is part of the driver's xmit routine. This talk elaborates on why it is needed, provides a detailed overview of netkit's current state and ongoing work. We show how it interacts with tcx and how we integrated both tcx and netkit into Cilium as its new datapath foundation.

        Speaker: Daniel Borkmann (Isovalent)
      • 100
        Application network security and observability in an encrypted future

        Application security and observability systems provide useful insight into L7 application networking. These systems promise nice looking service maps showing all your GRPC connections and how all the network services interact. They snoop DNS traffic providing the key insights of IP to DNS name mappings in a world where IPs are increasingly dynamic and meaningless from an identity perspective. From observability security policies can be built and applied pushing least privilege principles into the application networking.

        However, this tooling is on a collision course with TLS1.3 (encrypted SNI), encrypted DNS and HTTPS. Faced with losing the valuable insights folks have proposed moving observability into the encryption library with uprobes or application modifications. Alternatively, to keep the threat models required when applications are not trusted users have proposed proxy logic and complex certificate management systems. We've even seen proposals to steal the SSH keys from the pod directly through the filesystem. Often security architects cringe.

        Linux has all the building blocks to build a better model though. In this talk we discuss the threat models we believe a security or reliable observability platform needs to meet. Then show how we might build this system using kTLS and BPF. We will discuss current limitations and propose improvements to make the system more seamless and easier to use. As well as provide performance benchmarks. The hope here is to show by extending the operating system with BPF we can chart a course to transparent encryption and keep the security tooling working.

        Speaker: John Fastabend (Isovalent)
      • 101
        Safe sharing of the network with eBPF

        In this talk, we share some transport tunings built using eBPF to improve network performance and reliability. We will discuss examples of problems observed along with their solutions at different scopes – intra datacenter(small RTT) and inter-region(long RTT) network. Next, we talk about how we used one BPF attach-point (struct_ops) to try a TCP congestion control change aimed at improving network performance. We will walk through our journey around prototyping and deploying custom congestion control algorithms. We will discuss how tcp_iter helped us tackle the problem of existing connections and version management. Lastly, we bring some open challenges for the eBPF community from our experience so far.

        Speakers: Balasubramanian Madhavan (Meta Inc.), Prankur Gupta (Meta Platforms Inc)
      • 13:00
        Lunch
      • 102
        BPF struct_ops - current status and the last developments

        The BPF struct_ops is a kernel-side feature in Linux which allows user-defined methods to be called by subsystems. For example, it is now possible to define a congestion control algorithm in BPF and then proceed to register it with the TCP subsystem in order to effectively regulate traffic.

        The presentation will provide audiences with an understanding of the inner workings of struct_ops, along with its current applications. It will also delve into related projects and highlight the latest developments within the realm of struct_ops.

        Speaker: Kui-Feng Lee (Meta)
      • 103
        BPF Static Keys

        In the Linux kernel the Static Keys feature allows the inclusion of seldom used features in the fast-path code via the 'asm goto' compiler feature and code live-patching techniques. When disabled, a static key incurs zero overhead.

        While looking into ways to extend functionality of the pwru [1] utility to trace networking events it became clear that a similar Static Keys feature would be a good addition to the BPF stack. A draft sketch of the BPF Static Keys API was introduced at the LSF/MM/BPF 2023 conference [2].

        In this talk a complete BPF Static Keys API will be demonstrated together with examples of usage from C and Go programs, including an example of full introspection of the Linux + Cilium networking stack [3].

        [1] https://github.com/cilium/pwru
        [2] http://vger.kernel.org/bpfconf2023_material/anton-protopopov-lsf-mm-bpf-2023.pdf
        [3] https://cilium.io/

        Speaker: Anton Protopopov (Isovalent)
      • 104
        Troubles and Tidbits from Datadog’s eBPF journey

        Datadog has been using eBPF in production for observability, security and networking for several years now. While we managed to leverage eBPF to build new features, which would have been impossible otherwise, we also learned a lot the hard way. In this talk, we aim to get into the details of some gotchas, pitfalls and bugs uncovered over the years. You'll learn about eBPF hook points coverage whoopsies, common bypasses for eBPF-based security tools and a couple of unfortunate series of events from Datadog's cloud workload security product. You will also hear about some challenges with using eBPF for networking like using LRU maps at scale, problems with using shared skb mark value and some fun interactions between sk_reuseport and bpf_sk_assign.

        Speakers: Guillaume Fournier (Datadog), Hemanth Malla (Datadog)
      • 16:00
        Break
      • 105
        eBPF Shenanigans with Flux

        Flux is a framework for writing multicore schedulers, written in eBPF, that runs on top of Google's Ghost kernel scheduling class (analogous to sched_ext). Although the Flux framework is interesting in its own right, this talk will cover the data structures and trickery involved with making Flux work in BPF. Particularly, we heavily utilize array maps as a quasi memory allocator, and have data structures such as linked lists and AVL trees within those maps. Additionally, Flux is lightly "object oriented", and we have strategies for function dispatch in a world where there are no function pointers and all objects are the same type. Hopefully our techniques and data structures will be useful to other BPF users as people make increasingly complicated programs.

        Speaker: Barret Rhoden (Google)
      • 106
        Developing Continuous eBPF Profiler to look Beneath the Kernel to Beyond the Clouds

        At LPC 2022, we talked about experimenting with eBPF to extend the existing stack unwinding facility in the Linux kernel for interpreted languages, such as Ruby and Python, as well as runtimes emitting JITed code, like NodeJS.

        While we have successfully implemented these features in parca-agent across both Arm64 and x86 architectures, there is scope for improvement in scalability and uniformity across compilers and kernel versions.

        Despite having pre-existing tools to leverage underlying debug information in stripped binaries, integration with fully open-sourced dynamic profilers is sparse in the cloud-native ecosystem.

        In this talk, we would like to discuss the unexpected edge cases we ran into while walking the stack for different runtimes(such as V8), different compilers(Rust, C++), different platforms(Arm64) and their interactions with DWARF and ELF formats.

        We want to touch upon the challenges we faced in implementing low-level observability in the cloud and shed light on what goes into extending eBPF to fetch stacktraces for interpreted runtimes(like JIT code) or to understand GitHub CI better and various use cases in between. We will elaborate on the challenges while leveraging eBPF to bridge the gap between kernelspace and userspace while dealing with systems at scale.

        One of our goals with this talk is to introduce to the community how we can use parca-dev to understand control flow and optimise systems in the Linux Kernel. We want feedback on how we can make it easier to debug large codebases such as the Linux Kernel and make dynamic continuous profiling more developer-friendly.

        Speaker: Sumera Priyadarsini (Polar Signals)
      • 107
        Towards a standardized eBPF ISA - Conformance testing

        Towards a standardized eBPF ISA - Conformance testing

        The BPF Conformance Suite, consisting of a test runner and a suite of test cases, is a tool that addresses the challenge of ensuring cross-runtime compatibility for BPF programs.

        This presentation will delve into the core aspects of the BPF Conformance Suite, including its purpose, components, and the crucial role it plays in the BPF ecosystem. We will explore its ability to evaluate BPF runtime conformity to the BPF Instruction Set Architecture (ISA) specification and its significance in promoting compatibility, security, and collaboration within the BPF community.

        In this 30-minute presentation, we will cover the following key points:

        1. Introduction to BPF and Cross-Runtime Compatibility: An overview of BPF technology, its versatility, and the importance of ensuring BPF program compatibility across various runtimes.
        2. The BPF Conformance Suite: An in-depth look at the suite's composition, including the test runner and suite of test cases, and how it evaluates runtime adherence to the BPF ISA specification.
        3. Real-World Applications: An exploration of how the BPF Conformance Suite is actively used to measure conformance in significant BPF runtimes, such as the Linux Kernel's BPF, uBPF, eBPF for Windows, rbpf, and the Prevail Verifier.
        4. Benefits and Impact: A discussion on the suite's value in enhancing cross-runtime compatibility, bolstering security through the identification of non-conformant behaviors, streamlining development processes, and fostering collaborative efforts within the BPF community.
        5. Looking Ahead: An exploration of the suite's potential for future developments and adaptability as BPF technology evolves.

        In conclusion, this presentation will provide an understanding of the BPF Conformance Suite and its role in ensuring the consistent and reliable operation of BPF programs across diverse runtime environments. We will highlight its impact on compatibility, security, and community collaboration within the dynamic BPF ecosystem.

        Speaker: Alan Jowett (Microsoft)
    • Birds of a Feather (BoF) "Magnolia" (Omni Richmond Hotel)

      "Magnolia"

      Omni Richmond Hotel

      187
      • 108
        RCU Office Hour

        This is a gathering to discuss Linux-kernel RCU-related topics, both internals and usage.

        The exact topics depend on all of you, the attendees. In 2018, the focus was entirely on the interaction between RCU and the -rt tree. In 2019, the main gathering had me developing a trivial implementation of RCU on a whiteboard, coding-interview style, complete with immediate feedback on the inevitable bugs. The 2020, 2022, and 2023 editions of this BoF were primarily Q&A sessions.

        Come (on-site if you can, otherwise virtually) and see what is in store in 2023!

        Speaker: Paul McKenney (Facebook)
      • 109
        Improve Linux Perf tool to account for task sleep

        Problem: As per the current architecture of Linux Perf tool, ‘perf record’ does not collect samples if target process is in sleep state. Due to this perf tool has following limitations:

        Incorrect ‘CPU usage’ calculation: If target task was in sleep state for around 50% of the time, the CPU usage represented by perf tool does not account for the same.

        No ‘task sleep time’: As perf tool does not provide any sleep sample, so it’s not possible to determine for how long the task was in sleep state.

        Solutions: Perf-record sampling happens when perf_swevent_hrtimer() handler executes. If the target process is in sleep state, the handler is not being called.

        1) When perf_swevent_hrtimer() handler executes, it can calculate missing samples for the period when the target was in sleep state, using:
        missed_sample_count = ((current_time – hrtimer_start_time) / sampling_freq)
        missed sample count would have to be sent to user space perf-sample handler which stores this information to perf.data. And perf-report processes all missed samples and adds them to total samples.

        2) User space perf tool could calculate CPU usage based upon expected samples instead of total samples collected, as shown:
        expected_sample = total_time / freq

        3) Change the behaviour of perf_swevent_hrtimer() handler so that it should always be called even if target task is in sleep state (either wake up the target task or run in another task’s context).

        Speakers: Ajay Kaher, Alexey Makhalov
      • 11:00
        Break
      • 110
        Powering up “discoverable bus-attached” devices on DT-based platforms

        On DT-based platforms, the discoverable buses don't have a chance to power up devices before the discovery takes place. This needs to be improved somehow. There are several ways this can be done, but all of them involve either changes in the bus implementation or at driver core level, or one-time hacks for each type of device. Maybe it's time the discoverable buses start using the DT description of the device before actually finding it on the bus and therefore setting up the device's resources in time for the discovery.

        Speaker: Abel Vesa (Linaro)
      • 111
        Installing and Using the Linux-Kernel Memory Model (LKMM)

        This session will help people install LKMM on their Linux systems. Time permitting, we will also go through some sample LKMM litmus tests to get a feel for what LKMM can do and to learn how to interpret its output.

        Speaker: Paul McKenney (Facebook)
      • 13:00
        Lunch
      • 112
        kdevops sync up

        This is to open up the floor for kdevops development discussion. Topics for discussion so far:

        • Rationale and current k8 development
        • Replacing vagrant
        • Rants / desires

        More will be added with time.

        Speaker: Luis Chamberlain (Samsung)
      • 113
        Empowering Engagement: Introducing a Dynamic Dashboard for Proactive Retention Strategies

        In this session, we are going to present a dynamic dashboard designed to facilitate proactive strategies against contributor disengagement. By harnessing data mining and statistical modeling, the dashboard analyzes project activity metrics and contribution histories to forecast the probability of disengagement. Additionally, it offers predictions and a comprehensive overview of contributors' participation, segmented by diverse groups such as gender, region, and affiliation. Our session's primary aim is to introduce this dashboard, inviting valuable input and feedback from industry experts to gauge its practicality and effectiveness.

        Speakers: Prof. Anita Sarma (Oregon State University), Dr Bianca Trinkenreich (Oregon State University), Zixuan Feng (Oregon State University)
      • 16:00
        Break
      • 114
        KDLP: Kernel Development Learning Pipeline

        A SPECTRE IS HAUNTING THE LINUX KERNEL…

        …The spectre of time!

        Throughout the industry, we have observed a shortage of qualified entry-level software engineers focused on the low-level niche, and especially within the sub-niche of the Linux kernel. As young, novice software engineers, we noticed this problem, and we quickly became aware that we were far from the only ones to do so. Engineers and managers of all levels within our industry niche have observed – both privately and publicly – that we must bring new talent into these spaces, and we must do so urgently. We have seen firsthand among the many talented and experienced engineers firsthand and we have benefited greatly from their guidance and mentorship, but we know that this opportunity will not last forever. Time spares no one, not even the engineers from the days of Digital Equipment Corporation.

        We have come to understand the invaluable and non-quantifiable nature of the tribal knowledge that exists distributed among the great engineers of our industry, and we believe that we must act quickly to enable them to pass on the torch, to strengthen the next link of the great chain of our relatively young but quickly maturing industry.

        With this understanding, we have created the Kernel Development Learning Pipeline, a program to bridge the gaps between academia and industry, between the novice and the legend, between the past and the future.

        Speaker: Joel Savitz (Red Hat)
      • 115
        resctrl filesystem

        While resctrl filesystem is x86-specific on mainline Linux today, a significant refactoring is underway to extend support the platform QoS and monitoring solutions on other architectures, such as MPAM on ARM and CBQRI on RISC-V.

        Speaker: Peter Newman (Google)
    • Build Systems MC "Potomac G" (Omni Richmond Hotel)

      "Potomac G"

      Omni Richmond Hotel

      80

      In the Linux ecosystems there are many ways to build all the software used to put together a running system. Whether it’s building all the binary packages for a binary Linux distribution, using a source-based distribution, or building an embedded system from scratch, there are a lot of shared challenges which each system solves in their own way.

      • 116
        Securing build platforms

        The software chain of trust — that source code, actors, and outputs all meet (often implicit) expectations when placed under scrutiny — is an area of growing concern. Accidental or malicious tampering with the chain of trust can result in security issues, failure to comply with software licences, inexplicable errors and more.

        Linux distributions reduce the number of trust decisions consumers have to make, but how is trust in a distribution and its chain of trust evaluated such that consumers feel confident consuming otherwise opaque blobs?

        npm have adopted SLSA and Sigstore to produce build provenance, which provides a non-falsifiable link from source code to built package. However, npm package builds are much simpler than many distribution builds (even before one considers bootstrapping toolchains). Further, the use of GitHub Actions and Sigstore in npm's architecture may not be technically or socially feasible in many established communities.

        Can we map this technique to distribution build platforms? SUSE and Flatcar Linux have started down this path, but both have unsolved issues around verification which prevents consumer adoption for evaluating trust.

        This topic would aim to introduce the motivations for securing build platforms, discuss approaches language ecosystem registries are taking, and explore how we might adapt and adopt these solutions for Linux distribution build platforms. Examples and demos will be focused on OpenEmbedded/Yocto Project through the submitters proof of concept experiments integrating these concepts in the yocto-autobuilder2 system.

        Speaker: Joshua Lock (Verizon)
      • 117
        Improving UAPI Compatibility Review with Automated Tooling

        Maintaining userspace API (UAPI) compatibility has been a cornerstone of Linux’s success over the years as it has allowed users to confidently upgrade their kernels without worrying about their userspace programs breaking.

        Traditionally, kernel developers have used code review and testing to find UAPI-breaking changes. With libabigail, an additional tool could be added to the kernel tree which would allow developers to analyze a patch for UAPI breakages before the code is even executed. It could be integrated into build pipelines/processes to give developers helpful, immediate feedback and codify the kernel’s UAPI stability policy.

        Beyond UAPI headers, this talk will explore additional boundaries between user and kernel space: sysfs and module parameters; in order to guarantee a stable kernel upgrade, changes to these interfaces must be backwards-compatible as well. Though, since they do not offer a clean C API, libabigail cannot help to analyze them. We hope to generate a robust discussion on how we could analyze the compatibility of these interfaces going forward.

        Speaker: John Moon (Qualcomm Innovation Center, Inc.)
      • 118
        kernel: build system outputs and workflows (and how to balance them)

        While the kernel is a core output of a build system that targets a
        full platform or system, it also needs to balance many competing
        use cases and outputs.

        Is it providing uapi headers ? Is it testing tightly coupled user
        space packages ? Is it tracking LTS or -stable releases, or is it also
        testing on the bleeding edge ? Is it only used in loadbuild type
        environments or is interactive development and debugging part of the
        consideration ? Is it unit tested and/or full system tested ? What
        kernel versions are supported ? Which (upstream) source ? Other
        concerns such as the build overhead and developer/user/kernel build
        and target system workflow(s) are also significant.

        These are only a few of the questions that surround the kernel as part
        of a build system. The answers to the questions impact how the kernel
        is configured, built, deployed and the outputs consumed.

        If the answer is "all of the above" (which it often is) then build
        system flexibility is important. There are many different ways to
        integrate the kernel into a build. The OpenEmbedded and Yocto Project
        ecosystem have a flexible approach to these workflows and
        questions. This discussion will present the OE core kernel build
        system integration and discuss the associated pain points and
        challenges with the approach, while contrasting with other options
        and approaches where possible.

        Speaker: Bruce Ashfield (AMD)
      • 11:30
        Break
      • 119
        How big of a problem are Un-upstreamed patches?

        In the never-ending quest to run all of the latest versions of all of the software, issues ensue. From CVEs to makefiles changes needed to get the thing building in an environment, patches are created and applied. Some of these patches are upstreamed and others live forever in the murky bowels of the distro's package. The latter's technical debt can cause issues on the next release, etc. But how big of a problem is this? How long do patches live? How many are upstreamed? We'll take a look at some pretty graphs and see.

        Speaker: Jon Mason
      • 120
        Building for Heterogeneous Systems

        Its undeniable that heterogeneous systems are more commonly used on embedded products nowadays, for performance reasons it is better to use different operating systems running on different architectures on a single device.

        However, each OS has its own build-time and runtime dependencies e.g. different C library (if one at all) as well as developer workflow, pre-configured IDE and such, this requires that a single build system should be capable of cross-compiling different OSs targeting different architectures at the same time.

        On Bitbake, this is currently possible by taking advantage of its multiconfig feature along with multiconfig dependencies, but its still problematic to emulate the expected developer experience for non-Linux systems as well as configuring multiconfig builds.
        Unifying the developer workflow across OSs and improving the ease of use of features such as multiconfig (or other features used by different build systems) would greatly improve teams efforts on creating new products.

        Speaker: Alejandro Hernandez Samaniego
    • LPC Refereed Track "James River Salon D" (Omni Richmond Hotel)

      "James River Salon D"

      Omni Richmond Hotel

      183
      • 121
        Standardising Linux DRM drivers implementations by interfacing DRM Bridge as a single API

        Display and graphic drivers in Linux are part of the Linux DRM subsystem and are using DRM resources like memory management, interrupt handling, and DMA via Kernel Mode Settings (KMS) that act as an abstraction layer to provide uniform access to applications.

        Encoders are one of the key KMS components that takes pixel data from a CRTC and converts it to a format suitable for an attached connector. Early Linux-4.0 encoders play a crucial role in connecting display hardware attributes to KMS however, for new-age display solutions like Bridges, Converters, and Switches it becomes hard for the encoder to handle these topologies in order to support various functionalities.

        Linux-v4.0 has introduced a DRM bridge, a linear link structure of objects always attached to a single encoder at a time and connected directly or chain of bridges to the connector of a given KMS pipeline. Supporting bridge chains to new-age display solutions will end up with an encoder becoming a dumb encoder without any operations limited.

        This progressive change of moving the DRM drivers from the encoder to the bridge will standardize the single API, So it becomes simple and clear for implementing a drm drivers for new-age display solutions without touching the existing KMS pipeline.

        This talk explains how the DRM drivers are converted from encoder to bridge in order to standardize a single API by considering real-time solutions submitted to Mainline Linux on Samsung DSIM IP with a conclusion that are encoders replaced or removed from the drm stack?

        Speaker: Mr Jagan Teki (U-Boot Allwinner, SPI, SPI Flash Maintainer, Linux DRM Bridge Contributor and Maintainer, AI/Ml-enthusiast)
      • 122
        Enabling Large Block Size devices in Linux

        Increasing block sizes in storage devices will be one of the keys to support larger capacities, more density, and higher cost-effective SSDs in the future. Although R&D on this topic has been discussed in the Linux community for 16 years recent advances in Linux are making support for larger block sizes more easily attainable and we may soon be able to start leveraging support for it.

        512 bytes logical/physical sectors were the de-facto standard for a long time in the storage industry. Considering the rate at which the storage density was increasing, the industry started to realize 512 bytes sector is too small and settled on 4096 bytes (4k) sector as the new standard. Known technical historic advocacy for supporting large block sizes have been reducing fsck times, reducing IO for larger data sizes (writing 1 TB requires 256 million 4k IOs today on x86), cross architecture compatibility (mounting a filesystem with block size > 4k on x86), but more recently there has also been interest to leverage the ability to do larger atomic writes to reduce database latencies due to large IO journaling requirements.

        This talk explores the ongoing challenges and effort to support LBS greater than the page size in Linux based on recent advancements with folio adoption in the page cache and with iomap. This talk will also discuss existing known potential advantages of large LBS devices to enable a new generation of higher cost-effective SSDs with larger capacities.

        Speakers: Luis Chamberlain (Samsung), Pankaj Raghav (Samsung)
      • 11:00
        Break
      • 123
        Linux Kernel Autotuning

        Linux kernel nowadays provides thousands of parameters to users. Tuning them for optimal performance becomes more and more difficult. Most often, different workloads require different tunings for different sets of Linux kernel parameters. In large-scale data centers like ByteDance's, it has become nearly impossible to tune Linux kernel parameters manually for hundreds of different workloads.

        We came up with a solution to automate the entire Linux kernel parameter tuning process with minimal engineering efforts. We also noticed that memory management is one of the Linux kernel subsystems which has more needs for auto tuning. With machine learning algorithms, such as Bayesian optimization, we believe automated tuning could even beat most Linux kernel engineers. In this talk, we will present how our Linux kernel autotuning solution works and an overview of its design and architecture. We will also examine some specific cases of Linux kernel memory management to show our results as a proof-of-concept.

        For future work, we would like to use this opportunity to propose and discuss an in-kernel machine learning framework which could take this project even further to optimize the Linux kernel fast-path entirely in the kernel-space.

        Speaker: Cong Wang
      • 124
        Standardizing CPUID data for the open-source x86 ecosystem

        The x86 architecture is extensive, with many features (and misfeatures) added since its first 32-bit i386 CPU, released 38 years ago.

        Runtime identification of x86 CPU features occurs through the CPUID instruction. Through an input "leaf"/"sub leaf" mechanism, CPUID returns various information scattered through a vast list of output bitfields — now up to 750+ bitfields. The returned feature bitfields can differ depending on the x86 CPU vendor and include flags about CPU vulnerabilities and if any known software mitigations are needed.

        The Linux Kernel has grown its CPUID code "organically." It does provide an x86 CPU features abstraction through the X86_FEATURE_* definitions at cpufeatures.h. Such symbols are incomplete, and their usage is problematic in a future where CPU features are not guaranteed to be the same on each x86 core anymore. Hundreds of kernel call sites (including drivers) invoke the CPUID instruction directly and perform ugly bitwise operations to extract the necessary information.

        The CPUID bitfield data is also redundantly described in multiple projects in the open-source x86 ecosystem: Linux kernel "cpufeatures.h", Linux kernel kcpuid tool, the Xen hypervisor, BSD kernels, the GCC and LLVM compilers, OpenSSL, sandpile.org CPUID database, user-space CPUID utilities (most notably, Todd Allen's CPUID tool), and so on. Such redundancy has led to bitfield interpretation mistakes, and only some components adequately represent the complete CPUID information.

        In cooperation with the Linux Kernel x86 tree co-maintainers, the author would like to present a proposal for standardizing CPUID information. We have collected all the publicly-known CPUID bitfields in an extensible data model filtered by the x86 CPU vendor, along with usage hints for the Linux Kernel and the Xen hypervisor.

        This talk aims to present the problem of x86 CPU feature identification in more detail and show the finished progress regarding automatically generating CPUID data structures for Linux, Xen, and kcpuid. Afterward, the author would like to gather feedback from the kernel developers in attendance regarding any required modifications to the data model design or its associated tooling.

        Speaker: Ahmed S. Darwish (Linutronix GmbH)
      • 13:00
        Lunch
      • 125
        Hunting Heisenbugs

        The term "heisenbug" was inspired by the Heisenberg Uncertainty Principle from quantum physics, which states that it is impossible to exactly quantify a given particle’s position and velocity at any given point in time. Any attempt to more accurately measure that particle's position will result in increased uncertainty of its velocity and vice versa. Similarly, attempts to track down the heisenbug causes its symptoms to radically change or even disappear completely [1].

        If the field of physics inspired the name of this problem, it is only fair that the field of physics should inspire the solution. Fortunately, particle physics is up to the task: Why not create an anti-heisenbug to annihilate the heisenbug? Or, perhaps more accurately, to annihilate the heisen-ness of the heisenbug? Although producing an anti-heisenbug for a given heisenbug is more an art than a science, this talk will cover ways of doing just that.

        [1] The term “heisenbug” is a misnomer, as most heisenbugs are fully explained by the observer effect from classical physics. Nevertheless, the name has stuck.

        Speaker: Paul McKenney (Facebook)
      • 126
        nouveau and kernel GPU VMA management

        Since last year the work on nouveau to support Vulkan and use NVIDIA's firmware has moved forward a lot.

        This talk will update the status of the firmware integration for newer platforms, and look at the state of the NVK driver.

        One large subproject that has been developed as part of this work is a GPU driver independent virtual address space manager (GPU VA). Tracking GPU VM allocations has been driver specific prior to this, but there was no good reason to maintain different semantics for what is an operating system decision mostly. This talk will go into depth on the new GPU VA manager and discuss some of the development roads it took and paths it might go in the future.

        Speaker: David Airlie
      • 16:00
        Break
      • 127
        Putting Linux into Context – Towards a reproducible example system with Linux, Zephyr & Xen

        Demos on embedded systems using Linux are plentiful, but when it comes to reproducing them, things get complicated. Additionally, on decent embedded systems Linux is only one part of the system and interacts with real-time operating systems and virtualization solutions. This makes reproduction even harder.

        Within the Linux Foundation’s ELISA project, we started to create a reproducible example system consisting of Linux, Xen, and Zephyr on real hardware. This is the next step after we achieved a reproducible system with a pure Linux qemu image.

        The idea is to have documentation, a continuous integration including testing, which can be picked up by developers to derive and add their own software pieces. In this way they should be able to concentrate on their use case rather than spending effort in creating such a system (unless they explicitly want this). We also show how to build everything from scratch. The assumption is that only in this way it is possible to get a system understanding to replace elements towards their specific use cases.

        We had challenges finding good hardware, tools, freely available GPU drivers and more and we are still not at the end. A good system SBOM is also creating additional challenges, although leveraging the Yocto build system has provided some advantages here.

        While we are setting up the first hardware with documentation from source to build to deployment and testing on embedded hardware, we aim to have at least two sets of all major system elements like Linux flavor, a choice of virtualization technique, real-time OS and hardware. Only when software elements and hardware can be exchanged, we identify clear interfaces and make a system reproducible and adoptable.

        Open Questions are:

        • What will be a good next hardware to extend this PoC scope?
        • Where do open source, security, safety, and compliance come best together?
        • Which alternative real-time operating systems and virtualization should be incorporated?
        Speaker: Philipp Ahmann (Robert Bosch GmbH)
      • 128
        Dynamic vCPU priority boosting in KVM for latency sensitive workloads

        KVM, a virtualization technology in Linux, delegates memory and virtual CPU execution management of virtual machines to the Linux kernel. This has both advantages and disadvantages. One disadvantage is that it can lead to latency issues in time-sensitive workloads in the VM (such as audio and video). This is because KVM creates one task per vCPU for the VM and then delegates the scheduling of these tasks to the kernel. The kernel does not differentiate between vCPU tasks and other tasks, and it does not have insight into the workloads executed by the vCPUs. As a result, scheduling latencies for the vCPUs in a busy system can translate to apparent latency issues within the VM.
        This talk discusses an effort to minimize the latencies in VMs by having a framework of communication between the host and guest. The basic idea is to have a communication mechanism between the host and guest such that the host can make an educated decision on the priority of vCPU threads. The guest communicates its scheduling needs to the host, and the host can boost the priority of vCPU threads as needed. The host does have some insight about the guest (eg: interrupt injection) and can boost the vCPU explicitly. This is communicated to the guest, and the guest can determine when to boost/unboost.
        The talk details the design, implementation, and performance details of a x86_64 prototype that implements this idea using shared memory and hypercalls. We would also be discussing technical issues/challenges during the design and implementation of the prototype.

        Speakers: Joel Fernandes (Google), Mr Vineeth Remanan Pillai (Google)
    • Linux Kernel Debugging MC "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83

      When things go wrong, we need to debug the kernel. There are about as many ways to do that as you can imagine: printk, kdb/kgdb over serial, tracing, attaching debuggers to /proc/kcore, and post-mortem debugging using core dumps, just to name a few. Frequently, tools and approaches used by userspace debuggers aren't enough for the requirements of the kernel, so special tools are created to handle them: crash, drgn, makedumpfile, libkdumpfile, and many, many others.

      • 129
        The taming of the kernel dump

        In the past few years, much attention has been paid to various tools that enable live debugging and post-mortem analysis. Some of these tools access the underlying data and metadata through the libkdumpfile library. But not (yet) all.

        This talk is a tour of the kernel dump file format zoo, how these formats can be handled in a unified way, and what needs to be done to make libkdumpfile a viable replacement for other current implementations in makedumpfile and crash.

        Speaker: Petr Tesařík
      • 130
        drgn Writing to Memory and Breakpoints, Safely in Production?

        drgn is currently read-only: it can attach to the running kernel and read memory, but it can't modify memory or modify the flow of execution. These read-write features would clearly be useful for development (for example, in a virtual machine or a lab). If done safely, they could also be useful for modifying the kernel in production. There are many potential mechanisms for implementing this, each with risks.

        I'll share some kernel bugs we hit in production at Meta where we wished drgn could make modifications in order to mitigate a bug until a fix is deployed. Then, let's brainstorm how we can allow this without creating huge footguns and backdoors.

        Speaker: Omar Sandoval
      • 131
        Beyond DWARF: Debugging the Kernel with Drgn, BTF/CTF, and kallsyms

        Kernel debugging takes a variety of forms, but when a "real debugger" is required, you usually need to have debuginfo, and the standard kind of debuginfo is usually DWARF.

        While DWARF is very powerful, it's not always the right choice for every situation. Fortunately, the kernel already contains nearly enough introspection information to power basic debugging operations. Kallsyms can provide symbol table information, while BTF or CTF could provide type information. For stack unwinding, frame pointers and ORC can be used.

        This talk is a status report on the "DWARFless Debugging" project in Drgn, which aims to add support for debugging the kernel without DWARF debuginfo. We'll discuss the status, how it works, and go into some of the challenges we face.

        Speaker: Stephen Brennan (Oracle)
      • 11:00
        Break
      • 132
        When kdump is way too much

        For some lightweight systems, triggering a kdump could be a bit painful - it requires a generous amount of RAM to be pre-reserved, not available for regular usage at kernel runtime. Also, the panic kernel boot process takes time, and is prone to non-deterministic failures due to HW status or related to the cause of the panic event. So, despite kdump is a pretty standard way for collecting debug information when kernel panics, sometimes is not the best fit for some cases.

        Alternative ways of kernel debugging not relying on kdump includes hypervisor debug data collection (as present in qemu, which can collect a vmcore in fact, but not through kdump) or pstore. The goal of this presentation is to talk about the pstore technology, some brief introduction to the backends and the Steam Deck use case, but more important, to bootstrap some discussions: what data could we
        collect in pstore that is useful but not currently collected? What improvements could be done in the kdump/debug tooling for distros to support pstore for lightweight data collection on panic? Any other correlated topic or feedback from the audience is very welcome, as it will only make the discussion richer and more useful!

        Speaker: Guilherme Piccoli (Igalia)
      • 133
        Minidump to debug end user device crashes

        Qualcomm devices in engineering mode provide a mechanism for generating full system RAM dumps from field / test farm for postmortem debugging even in the case of not-kernel system crashes. But, on end user devices, taking complete RAM dump at the moment of failure has substantial storage requirement as well as it is time consuming to transfer them electronically. So, instead of copying and parsing the complete RAM dump, collecting minimum required data from RAM is much more efficient and easy to transfer electronically. The minidump mechanism provides the means for selecting which snippets should be included in the RAM dump. It is built on the premise that System on Chip (SoC) or subsystems on the SoC crash due to a range of hardware and software bugs. Minidump support for Qualcomm remote processor (MODEM/ADSP) regions is already supported in upstream. Now, the effort is to upstream the kernel driver which helps to collect kernel regions as well.

        The intention of this talk is to present minidump overview and how the solution could be made more generic so that it can fit into the need of other SOC vendors. Also, we are exploring if there is any way to extend existing solution to accommodate above problem.

        https://lore.kernel.org/all/1687955688-20809-1-git-send-email-quic_mojha@quicinc.com/

        Speakers: Elliot Berman (Qualcomm), Mukesh Ojha
      • 134
        Kernel Livedump

        Linux Kernel currently has a mechanism to create a dump of a whole memory for
        further debugging of an observed issue with the help of crashkernel.
        Unfortunately, we are unable to do this without restarting the host which causes
        a problem in case of having a high availability service running on the system
        experiencing some complex issue that cannot be debugged without the complete
        memory dump and hypervisor-assisted dumps are not an option on bare metal
        setups. For this purpose, there is a live dump mechanism being developed which
        was initially introduced by Yoshida Maasanori [1] in 2012. This PoC was already
        able to create a consistent image of memory with the support of dumping the data
        into a reserved raw block device.

        The PoC remained idle and as the ever-growing Linux community introduces dozens
        or even hundreds of new features every release, that work obsoleted, especially
        due to MM changes. I've spent time adapting the patchset to make it work again
        on Linux v6.4 and I've added a few more features (like vmcore formatting).

        In order to put forward the patchset into the upstream again, there is a lot of
        research and work ahead because of a few tradeoffs that must be better
        described and understood. Similar to the crashkernel method, which necessitated
        preallocated space specific to each running instance and was resolved through
        approximations, there is also reserved preallocated memory. If this memory is
        not sufficiently large, it may, in certain cases, compromise the consistency of
        the dumped state. To maintain consistency, one option is to wait within the
        page fault in kernel memory, but this approach could potentially introduce
        failures in the original kernel due to synchronization or deadlines in
        different parts of the kernel.

        At LPC I would like to gather as much feedback as possible on my current
        approach with a discussion about other possible usecases.

        [1] https://lore.kernel.org/all/20121011055356.6719.46214.stgit@t3500.sdl.hitachi.co.jp/

        Speaker: Lukáš Hruška
    • 13:00
      Lunch "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83
    • 13:00
      Lunch "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82
    • 13:00
      Lunch "Shenandoah H" (Omni Richmond Hotel)

      "Shenandoah H"

      Omni Richmond Hotel

      60
    • Confidential Computing MC "Potomac G" (Omni Richmond Hotel)

      "Potomac G"

      Omni Richmond Hotel

      80

      The Confidential Computing microconferences in the past years brought together developers working secure execution features in hypervisors, firmware, Linux Kernel, over low-level user space up to container runtimes. A broad range of topics was discussed ranging from enablement for hardware features up to generic attestation workflows.

      • 135
        Confidential Computing Microconference Introduction

        Short welcome and introduction into the microconference and the topics to be discussed.

        Speakers: Dhaval Giani, Joerg Roedel (SUSE)
      • 136
        COCONUT Secure VM Service Module Discussion

        Discuss the next development steps for the COCONUT-SVSM.

        Speaker: Joerg Roedel (SUSE)
      • 137
        Remote Attestation in AMD SEV-SNP Confidential VMs

        The Trusted Platform Module (TPM) is an industry standard that is widely used as hardware root-of-trust for UEFI measured boot, Integrity Measurement Architecture (IMA) and remote attestation. Although virtual TPMs play the same role for VMs, standard vTPMs cannot be safely used for Confidential VMs since their state would be accessible by the hypervisor, which is considered an untrusted entity in the CVM threat model.

        The Secure VM Service Module (SVSM) is a firmware component that runs in AMD SEV-SNP Confidential VMs to provide an isolated environment that can be used to run privileged modules, such as a vTPM, without interference from the hypervisor and the guest OS.

        In this talk, we will discuss some of the design and implementation challenges we encountered while running a vTPM in the SVSM restricted environment. That includes aspects related to using the vTPM for remote attestation, maintaining and injecting the vTPM state, crypto support, and running the vTPM as a CPL3 module inside the SVSM.

        Speaker: Claudio Carvalho (IBM)
      • 138
        Shrinking The Elephant - A Confidential Computing Attestation Sequel

        At the 2022 confidential computing LPC microconference, we talked about the elephant in the confidential computing room: guest attestation and verification. We showed how opaque, fragmented and closed this essential piece of the confidential computing puzzle is, adding one more hurdle to this technology adoption.

        During the past year, the Confidential Containers project worked on putting the elephant on a diet, and here we will first talk about how we are building the first-ever fully open source attestation service. We’d also like to use this microconference as an opportunity to introduce the new and unresolved attestation challenges that we uncovered during that journey, to a wider confidential computing audience.

        We will start by describing the Attestation Service and Key Broker Service (KBS) projects that form the backbone of the open source and vendor agnostic attestation service framework that we built during the last months. We’ll talk about the public facing interface of that framework, the KBS API and protocol, and will show how that simple HTTPS based interface supports different attestation models without being bound to vendor specific formats or data. We’ll also mention how the verification backend of the Attestation Service already supports all major CoCo vendor attestation format evidences by abstracting attestation verification operations through a simple plugin interface.

        The next part of the presentation will go through the known and remaining issues we’ll have to address in the short term: Converging the attestation results format and plugging the attestation service with the rest of the software supply chain for consuming trustable attestation reference values and policies, for example. As with any technological exploration, one challenge only leads to the next one and the last section of this presentation will focus on introducing the new, longer term problems that we are facing. Integrating the asynchronous, SoC vendor-independent, trusted device attestation flow with the guest one might be the biggest one.

        Speaker: Samuel Ortiz
      • 139
        How to Build a Confidential Attestation Client

        When designing an attestation framework, implementing a client which runs inside a confidential guest might seem like the simplest part, but this session will introduce several subtle factors that can undermine security and usability if not addressed. We will discuss how these issues might apply to different confidential projects and how they can be resolved. We will include some provocative examples and interesting proposals. For example, the session will introduce evidence factory attacks, which can compromise not just one enclave, but an entire service or deployment. We will show how severe these attacks can be and how they can be prevented. We will look at how to design an attestation client that supports separation of privileges within one guest. We will discuss best practices for populating the guest data in an attestation report and for providing extra information to a relying party. We will also consider challenges in orchestration including how to provide connectivity to attestation clients running in minimal environments. Even with a standardized attestation flow, a thoughtful guest implementation is essential to building a secure, performant, generic, and easy-to-use system. There are many open questions in this space that will be discussed as a group.

        Speaker: Tobin Feldman-Fitzthum (IBM)
      • 140
        Supporting Live Migration of Confidential VMs in KVM

        Confidential VM live migration involves migrating the running secure VM on the same host or to another host. Vendors are designing solutions to achieve this based on underlying Coco technology. AMD SEV-SNP plans to achieve this with the co-operation of an SVSM (Secure VM Service Module), or similar service, running in guest context. Intel plans to achieve this with a migration TD VM.
        The goal of this talk is to discuss the design details of the SEV-SNP live migration solution and how a common API to achieve this can be created for use across all vendors.

        Speakers: Pankaj Gupta, Thomas Lendacky
      • 16:00
        Break
      • 141
        Secure I/O

        This is the umbrella topic for secure I/O discussions. The discussions will overflow into BoF sessions.

        Speakers: Dan Williams (Intel Open Source Technology Center), Jeremy Powell, Samuel Ortiz, Steffen Eiden (IBM Germany), Thomas Lendacky
      • 142
        Taming the Incoherent Cache Issue in Confidential VMs

        It is well known that in AMD CPUs prior to Milan, cache lines within a confidential VM are incoherent with those outside of confidential VMs. SME_COHERENT is a feature introduced by AMD 3rd gen EPYC to improve cache coherency in their confidential computing environment. However, as testing demonstrates, SME_COHERENT does not support cache coherence between CPU and devices. This means that guest pages which were previously used for DMA may still contain dirty caches incoherent with cache lines generated by the CPUs. Since KVM does not track the page provenance, skipping the cache flush may lead to memory corruptions at host level when the guest pages are freed.

        This problem is even worse when malicious or subverted host userspace could leverage the existing KVM API to fill confidential pages with dirty caches and free them without flushing. The security problem was recorded as CVE-2022-0171 [1]. Because of the limitation of SME_COHERENT, this vulnerability was not just limited to 2rd gen EPYC but 3rd gen EPYC and later. Upstream Linux solves this issue by flushing the cache unconditionally when a guest page mapping was removed from a VM’s NPT.

        Flushing the cache lines is a principled approach working with a confidential VM since confidential guest pages are pinned and thus cannot be moved by the host OS i.e., the guest pages are not leaving the VM until the VM dies. However, due to limitations of MMU API, there is no API telling KVM a guest page is deallocated. The most relevant APIs are the mmu_notifiers which were invoked when the guest page mappings were removed during the VM lifetime. In fact, guest mapping could be removed due to various legitimate reasons: changing page granularity from one to another, e.g., from 2M/1G to 4K and vice versa; page migration due to NUMA balancing, defragmentation and others like KSM.

        The host capability to change the mapping of the guest VM creates performance problems with the existing upstream solution for CVE-2022-0171, as KVM may unnecessarily flush the cache lines in some of the scenarios mentioned above. This is reported at [2][3] when running with SEV-SNP VMs.

        In this talk, we will discuss the prospective solutions to this problem, such as how we should nicely flush cache lines without introducing performance bottlenecks. One solution would be to use VMPAGE_FLUSH MSR instead of wbinvd, the latter of which requires the whole machine wise cache flush. This should improve performance in general with the cost of slowing down the shutdown of individual VMs. On the other hand, it might be feasible to leverage the “reason” parameter of mmu_notifiers to conditionally flush the cache. We will discuss further details during the talk.

        References
        [1] https://bugzilla.redhat.com/show_bug.cgi?id=2038940
        [2] https://lore.kernel.org/all/876a7707-c9b9-0985-af00-c7fc461ada02@windriver.com/
        [3] https://lore.kernel.org/kvm/YzJFvWPb1syXcVQm@google.com/T/#mb79712b3d141cabb166b504984f6058b01e30c63

        Speakers: Mingwei Zhang (Google), Jacky Li (Google), Sean Christopherson (Google)
      • 143
        Towards unified confidential computing ABIs

        The configfs-tsm proposal arose from the observation that there are several platform vendors all building similar confidential-computing functionality features into their products. It makes the assertion that the kernel has a role to play and a vested interest in aligning stakeholders behind common ABI. Going forward attestation reports are just one example of shared interfaces that the community can build to lower, or better distribute, the long term maintenance burden of confidential computing for the kernel. Another example area of collaboration is userspace ABIs for QEMU to use for managing secure device assignment to confidential VMs. Lets have an open discussion on assertions made in the configfs-tsm proposal and the future implications.

        Speaker: Dan Williams (Intel Open Source Technology Center)
      • 144
        Update on RISC-V Confidential VM Extension (CoVE)

        This session aims to cover the ongoing development of the RISC-V architecture for Confidential VM Environment (CoVE) using ratified RISC-V privileged ISA extensions and proposed new ISA extensions (SmMTT). This session will describe the ongoing specifications for proposed ABI, ISA and SoC requirements that enable Confidential Computing on RISC-V-based platforms. The common/abstract aspects that are cross-architectural for Linux/KVM will be discussed to enable inter-operability across different RISC-V as well as non-RISC-V platforms. The discussion is to cover RFC patches for ABI in development stage and target for public review by Q4'23.

        Speakers: ATISH PATRA (Rivos), RAVI SAHITA (Rivos)
      • 145
        Secure TSC for AMD SEV-SNP guests

        TSC value calculations for guests are controlled by the hypervisor. A malicious hypervisor can prevent guests from moving forward. The Secure TSC feature for SEV-SNP allows guests to securely use RDTSC and RDTSCP instructions. This ensures the guest gets a consistent view of time and can prevent a malicious hypervisor from making it appear that time rolls backwards, increments at a ridiculously fast rate, or similar tricks. In this talk we will discuss the Secure TSC changes needed to support hypervisor/guest and current upstreaming status.

        Speaker: Nikunj Dadhania
      • 146
        Secure AVIC: Securing Interrupt Injection from a 'malicious' Hypervisor

        SEV-SNP is a security feature that protects the confidentiality and integrity of VM memory from 'malicious' hypervisors or other VMs. Secure AVIC is a new HW feature added to SEV-SNP to prevent a 'malicious' hypervisor from generating unexpected interrupts to a vCPU or otherwise violate architectural assumptions around APIC behavior.

        One of the significant differences from AVIC or emulated x2APIC, is that Secure AVIC uses a Guest Owned and Managed APIC Backing Page. It also introduces additional fields in both the VMCB and the Secure AVIC Backing Page to aid the guest in preventing (or moderating) interrupts to be injected by a 'malicious' hypervisor.

        This proposal provides an overview of the hardware changes introduced by Secure AVIC, as well as the software design changes required on both the hypervisor and guest sides.

        Speakers: Kishon Vijay Abraham I, Suravee Suthikulpanit
    • Power Management and Thermal Control MC "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83
      • 147
        Sensors aggregation

        On one side a thermal sensor is associated with one thermal zone. On the other side a governor manages one thermal zone. The governor acts on a cooling device.

        Some configurations have multiple sensors for the same performance domain. For instance, one sensor per core and a DVFS per group of cores.

        When the energy model is used along with the power allocator, having a shared cooling device for all the thermal zones makes very difficult to have a the PID loop to work correctly as all governors can do opposite actions.

        The proposal is to aggregate the sensors belonging to the same performance domain into a single thermal zone. Consequently one governor can operate on the temperature returned by the an aggregator function (eg. max or avg).

        As a side note, the device tree bindings already describes such a an aggregation but it seems it was never implemented.

        Speakers: Alexandre Bailon, Daniel Lezcano (Linaro)
      • 148
        New thermal trip types

        Currently we have 'hot', 'passive', 'active', 'critical' trip points. The userspace needs to deal with the thermal also but in a different manner. Currently, it has to poll the thermal zone to get the temperature, thus introducing more wakes up on the system. However the sensors have now a programmable register to trigger an interrupt and the userspace may benefit that. By adding one or several writable 'custom' trip points, the userspace can program a point to be notified and do an action. So we separate the actions from the kernel to protect the silicon and the ones from userspace to manage the overall thermal behavior in the system at lower temperatures.

        Speaker: Daniel Lezcano (Linaro)
      • 15:30
        First Break
      • 150
        CPUfreq/sched and VM guest workload problems

        This is a continuation of the similarly titled talk from last year and will focus on the remaining open items.
        https://lpc.events/event/16/contributions/1195/

        Running a workload on VM results in very disparate CPUfreq/sched behavior compared to running the same workload on the host. This difference in CPUfreq/sched behavior can cause significant power/performance regression (on top of virtualization overhead) for a workload when it is run on a VM instead of the host.

        We've already sent out v3 of the patch series we want to land, but that design still leave a lot of performance/power on the table. This talk will discuss any remaining issues and proposals on how we can solve them.

        Speakers: David Dai, Saravana Kannan
      • 151
        VM-CPUFreq for x86: Scaling the guest frequency for performance and power savings

        With the increasing size of virtual machines (VMs) and the growing deployment of workloads within VMs, power management has become a crucial aspect to optimize performance and energy efficiency. However, the absence of frequency scaling within VMs and the lack of workload utilization visibility for the hypervisor can result in suboptimal performance and power management. To address this issue, Google proposed patches to improve dynamic voltage and frequency scaling (DVFS) for VMs on ARM systems [1]. By providing the hypervisor with VMs' vCPU utilization data, it can make informed decisions regarding frequency scaling, leading to improved performance and power efficiency.

        In this talk, we discuss what is needed to make this work on x86 servers with AMD EPYC as an example.

        We discuss

        • The metrics that need to be communicated from the guest to the hypervisor and vice-versa.

        • The potential interfaces through which such guest-host communication can be achieved with low-overhead.

        • Experimental results across VMs of different sizes running different results in order to demonstrate the effectiveness of the vCPU utilization information being communicated to the hypervisor and the impact that it has on the performance and power-consumption of the
          workloads.

        [1]: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior: https://lore.kernel.org/lkml/20230330224348.1006691-1-davidai@google.com/

        Speaker: Mr Wyes Karny (AMD)
      • 152
        Virtualized Frequency Control for Telco Workloads

        Packet processing workloads and software applications that are polling in nature bypass the typical kernel power governors. As these workloads are polling they appear 100% utilized and results in zero power savings. This solution implements a user space power governor that has visibility of the "real" workload utilization and triggers power savings. The presentation will review key components required such as virtio-serial and intel_pstate drivers.
        We know there are efforts to virtualize frequency control, the goal is how to harmonize all efforts to one solution.

        Speakers: Mr Chris Macnamara, Srinivas Pandruvada
      • 153
        uclamp in CFS: Fairness, latency, and energy efficiency

        Utilisation clamping enables user-space to set performance/efficiency preference per-task and task group. It is used on Android systems today but alignment with upstream Linux is lacking behind.
        This a discussion topic with the aim aligning on the expectations, use-cases, and possible implementation of uclamp.

        Speakers: Dietmar Eggemann, Morten Rasmussen (Arm)
      • 16:45
        Second Break
      • 154
        Make sync_state()/handoff work for the common clk framework

        There have been multiple[1] attempts[2] to implement sync_state() or handoff[3] for the common clk framework so that clks which are left enabled while booting Linux aren't disabled during lateinit when unused clks are disabled. The most common reason to do this is to display a boot splash screen until the display driver can probe and enable the clks itself. All attempts to implement a solution to this problem have failed. The most recent attempt[1] didn't properly handle edge cases such as parent-child dependencies between clk controller devices and handing off the enable state from boot. We'll discuss the problem, go over the attempts to solve this, and also investigate what other subsystems (regulator, interconnect) and firmwares (ACPI) have done to solve the problem.

        [1] https://lore.kernel.org/r/20221227204528.1899863-1-abel.vesa@linaro.org
        [2] https://lore.kernel.org/r/20210407034456.516204-1-saravanak@google.com
        [3] https://lore.kernel.org/r/1455225554-13267-1-git-send-email-mturquette@baylibre.com

        Speaker: Stephen Boyd (Google)
      • 155
        Intel Low Power Mode Daemon on Hybrid CPUs

        Hardware vendors increasingly employ hybrid-CPU processors. Hybrid-CPU processors mix multiple CPU types within the same processor, to satisfy previously conflicting tradeoffs in power, performance, and cost.

        Some hybrid-CPU processors benefit from a “low-power mode”, a non-linear power savings resulting from all work being restricted to a “low-power CPUs”, while no work runs on the “high-performance CPUs”. The power benefit of low-power mode depends on how long it can be sustained, while the performance impact depends on how quickly low-power mode can exit.

        Introducing, intel_lpmd (Intel Low-Power Mode Daemon), which monitors system activity, hardware hints and user preferences to decide when to enter and exit low-power mode. intel_lpmd is successful when it recognizes near idle workloads that can fit on the low-power CPU. Conversely, intel_lpmd must recognize when to exit low-power mode to avoid impacting performance.

        intel_lpmd’s power goal is to get low-power mode savings to approach that of using CPU hotplug offline. intel_lpmd works best with the cgroup v2 CPU controller and its “isolated” partition feature. However, there are multiple sources of noise on the quiesced CPUs that we must handle, including RCU, idle load balancing, and recurring timers – all of which can defeat the power saving impact of low-power mode.

        The goal of this presentation is to describe challenges, workable solutions and get community feedback on user space intelligence.

        Note that this topic has been submitted to the main track already, but in case it is not accepted there, I'd still like to present it in the Power Management and Thermal Control MC.

        Speakers: Rui Zhang, Srinivas Pandruvada
      • 156
        Enabling DDR segments on demand during memory pressure for DDR power reduction

        In Android systems, there is a potential for significant power consumption due to entire DDR segments being enabled even during periods of lite usage of system when only a fraction of it is actively used, thereby consuming power unnecessarily.

        We can optimize DDR power consumption by selectively disabling certain segments during lite system usage and enabling them when additional memory is required. By detecting memory events that exceed specific usage thresholds, one or more DDR segments are enabled successively to meet the demand, ensuring that unused segments remain collapsed or disabled during lite usage. This results in power reduction by turning OFF the self/auto-refresh of these DDR segments.

        The memory pressure detection is done by an userspace daemon registering to Linux kernel’s PSI (Pressure Stall Information) mechanism and communicates with the kernel driver that uses memory-hotplug to dynamically add/remove DDR segments. The PASR (Partial Array Self Refresh) technology is used to control the refresh rates for these DDR segments. Importantly, the entire process of detecting memory pressure and enabling/disabling DDR segments incurs negligible latency. This helps increase overall DoU for given battery without any performance impact.

        Speaker: Sudarshan Rajagopalan (Qualcomm)
      • 157
        Improving monitoring of power saving states

        ChromeOS is the second most-popular Linux distribution, designed to be long-lasting on battery and snappy, even for low-cost devices. This is achieved by leveraging various power-saving policies on multiple different platforms, including x86 (Intel, AMD), and ARM. Starting at the point of device development, we monitor and verify cpu idle, module (package/device) idle and system deep sleep power state residencies and behaviors. This data allows us to test and tune our system policies (that are applied by our userspace tools) for optimizing power and performance. In addition, we monitor this data through lab and user telemetry, to ensure that existing policies continue to be effective,

        Retrieving this data from the Linux kernel has its challenges: standardized interfaces such as cpuidle are often inaccurate due to insufficient hardware visibility, and may suffer from lack of atomicity (when reading residencies in bulk). Reading module idle and system sleep state counters is not standardized, and happens through a combination of debugfs and raw registers, which requires complex userspace code paths depending on SoC vendor, architecture, generation, and even kernel version. Some data is duplicated, and some crucial information is missing, such as the deepest sleep state reached, and time spent there.

        In this talk we’d like to discuss the future direction of these kernel interfaces. How can we improve them to meet the needs of modern OS health monitoring?

        Speakers: Mr Stanislaw Kardach (Google), Sven van Ashbrook (Google)
    • Tracing MC "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82

      The Linux kernel has grown in complexity over the years. Complete understanding of how it works via code inspection has become virtually impossible. Today, tracing is used to follow the kernel as it performs its complex tasks. Tracing is used today for much more than simply debugging. Its framework has become the way for other parts of the Linux kernel to enhance and even make possible new features. Live kernel patching is based on the infrastructure of function tracing, as well as BPF. It is now even possible to model the behavior and correctness of the system via runtime verification which attaches to trace points. There is still much more that is happening in this space, and this microconference will be the forum to explore current and new ideas.

      • 158
        libside: Giving the preprocessor a break with a tracer-agnostic instrumentation API

        Discuss the new libside [1] API/ABI which allows a kernel tracer and many user-space tracers to attach to static and dynamic instrumentation of user-space application.

        This user-space library exposes a type system and a set of macros to help applications declare their instrumentation and insert instrumentation calls. It exposes APIs to kernel and user-space tracers so they can list and connect to the instrumentation, and conditionally enables instrumentation when at least one tracer is using it.

        This is relevant in the context of User Events which was recently merged into the Linux kernel.

        [1] https://github.com/efficios/libside

        Speaker: Mathieu Desnoyers (EfficiOS Inc.)
      • 159
        Graphing tools for scheduler tracing

        Trace-cmd is an invaluable tool for understanding application behavior and debugging performance problems at the kernel level. However, the amount of information generated can be overwhelming, especially for large multicore machines. In this talk, we present some graphing tools that we have developed for summarizing and visualizing the information collected using trace-cmd in various ways. However, we find that different performance problems require developing different visualizations. We would thus like to discuss 1) what kinds of visualizations can be useful, and 2) how to describe these visualizations, so that new visualizations can be created directly by the end user.

        Speaker: Julia Lawall (Inria)
      • 160
        Function return hook integration with Function graph tracer

        Currently, there are two different function return hook mechanisms: kretprobe (rethook) and function graph tracer. Both have similar but different ways to hook the function return. They both modify the return address on the stack (or link-register) with their trampoline code and save the correct return address to their own list for each task. The difference is how they allocate the per-task list. Kretprobe allocates a linked list of storage objects when it is initialized. Users can specify the maximum number of concurrently used objects at that point. However, the object list is not shared among kretprobes. On the other hand, function graph tracer allocates a shadow stack array for each task when it is enabled. This is simpler but consumes more memory at that point. However, this shadow array will be shared among all functions, so each function takes up very little memory.
        The problem is that if both mechanisms are used at the same time, they both allocate such memory independently, and the function return is hooked twice or more. This is inefficient. To avoid this, we can integrate them. However, it might be better to remove kretprobes and use fprobe's exit handler to simplify the solution.
        In short, there are two different function return hook mechanisms, kretprobe and function graph tracer. They both have similar but different ways to hook the function return. The problem is that if both are used at the same time, they both allocate such memory independently, and the function return is hooked twice or more. This is inefficient. To avoid this, we can integrate them or remove kretprobes and use fprobe's exit handler.

        In this session, we will talk about the background and how to do this, and my expectations. BPF and other tools may need some more work about this change.

        Speaker: Mr Masami Hiramatsu (Google)
      • 161
        pt_regs - the good, the bad and the ugly

        We'll first go through the history of "struct pt_regs" uses and abuses. Between ptrace, kprobe, ftrace, fprobe, kretprobe, rethook, perf, ebpf... There will be some ground to cover. This will be an opportunity to give an overview of all the tracing subsystems and how they build on one another by exchanging pt_regs structures.

        We'll then spend time discussing "sparse pt_regs". These are structures created outside an exception entry and containing a light subset of registers. We'll discuss how they can propagate from one subsystem to the other and lead to subtle issues.

        Speaker: Florent Revest (Google LLC)
      • 162
        RTLA: Requests and TODOs

        RTLA suite of tools is part of tracing tools and, as such, depends on the kernel tracing infrastructure, from kernel tracers to libraries. This topic will discuss some limitations of these dependencies that allow rtla to perform better and add new functionalities. Among the requests, we can mention:

        • On the kernel side:
        • The ability to do histograms with ftrace: events (tracers output)
        • e able to enable two tracers in a single instance
        • On the library side
        • Be able to record trace-cmd.dat file
        • Parse per_cpu/cpu/trace_pipe
        • On the tracer side:
        • Osnoise user-space support
        • Execution time tracer
        • Integration of the tracers with perf counters

        These points are better discussed at an MC because they touch the tracing stack.

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 163
        Function parameters with BTF

        Recently, function graph tracing has the new option of showing the return value of the function. Of course this could be producing some problems with functions that are void and the return is garbage. This could be fixed by using BTF.

        Better yet, both function graph and the function tracer could possibly use BTF to show all arguments for all functions! But this would require a way to quickly look up the function's BTF values and a way to translate the BTF values into something that can be generically printed to the trace file.

        Speakers: Mr Masami Hiramatsu (Google), Steven Rostedt
      • 16:30
        Break
      • 164
        Performance Monitor Control Unit

        Performance Monitor Control Unit (PMCU) is a device that offloads PMU accesses from CPUs, handling the configuration, event switching, and counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained and multi-PMU-event CPU profiling, while the software overhead of accessing PMUs, as well as its impact on target workloads, is reduced. In the current PMU counting scheme, the target CPUs have to handle events locally, affecting its own workload execution; PMCU, instead, accesses PMUs through external memory-mapped interfaces, providing non-intrusive CPU monitoring. PMCU's software stack is currently implemented with the 'perf_event' auxtrace framework. Its patchset contains the documentation, driver, and user perf tool support.

        Implementation-wise, we wonder how to make PMCU synchronized with CPU internal accesses? PMUs can be accessed from CPU and PMCU simultaneously. The current ARM PMU standard does not appear to have a mechanism that synchronizes internal and external accesses. Hence, running arm_pmu and PMCU events at the same time may mess up the operation of PMUs, delivering incorrect data for both events, e.g. unexpected events or sample periods. We probably need a software solution to such a case, where two drivers access the same hardware.

        Besides the above problem, we are looking forward to general feedback of PMCU from the kernel community, in terms of use cases, interfaces, implementation, etc.

        Reference: https://lwn.net/Articles/922351/

        Speaker: Jie Zhan (Huawei Technologies Co., Ltd.)
      • 165
        DTrace and eBPF: new challenges

        The presentation will present a brief technical status update on the project and then delve into the pitfalls we have encountered in the past months. Creative solutions for some problems will be presented, but the bulk of the presentation is focused on current unresolved pitfalls. The goal is to spark creative discussion towards resolving some of the issues we have encountered and to work towards proposing improvements to eBPF and other kernel tracing features to the benefit of tracing in Linux in general.

        Speaker: Kris Van Hees (Oracle USA)
      • 166
        Implementing sframes

        sframes have been added to binutiles, which can allow the kernel to do a user space stack trace without the need of frame pointers. But there are several issues to overcome.

        • Requires reading user space that can fault (cannot be done in interrupt context)
        • Perf and ftrace will need to postpone when the stack trace is done
        • Can load in the sframe information from the elf file, but what to do about dynamically linked objects? (New system call?)
        • Same issue with JIT code. Need a way to tell the kernel how to do the stack trace dynamically.
        Speakers: Indu Bhagat, Steven Rostedt
    • Birds of a Feather (BoF) "Potomac G" (Omni Richmond Hotel)

      "Potomac G"

      Omni Richmond Hotel

      80
      • 167
        How to make syzbot reports easier to debug?

        Syzbot is a continuous kernel fuzzing system that publishes hundreds of auto-generated bug reports each year.

        If you have ever tried to address a syzbot report, what did you miss most? What pieces of information that we share only create noise. And, on the other hand, what automatically extractable data could have helped you spend less time debugging?

        Please feel free to share your experience and suggestions during this BoF session. This information will help us adjust syzbot to the needs of the community.

        Speaker: Aleksandr Nogikh (Google)
      • 11:00
        Break
      • 168
        UEFI Setvariable at runtime -- Problems, status and solutions

        The UEFI spec mandates that UEFI variables related to the UEFI keyring must be stored in a non-volatile storage that is tamper and delete-resistant. On embedded platforms with an RPMB available this is supported at Boottime in U-Boot (U-Boot has supported this since ~2020). With SystemReady-IR getting adopted from various hardware vendors, SetVariable at Runtime is becoming a necessity for distro installers and firmware updates.
        Due to the complexity of the solution, supporting it and adhering to the UEFI spec is very difficult on certain platforms.

        There is a patchset under review https://lore.kernel.org/linux-efi/20231013074540.8980-2-masahisa.kojima@linaro.org/ which enables SetVariable at runtime for such platforms. This introduces a few dependencies on the kernel and violates the EFI spec.
        Discuss the implementation, implications, current status, and ideas to lift the kernel dependencies.

        Speaker: Mr Ilias Apalodimas
      • 169
        Reporting and tracking regressions across the ecosystem

        We want to gather some of the key people involved in Linux kernel testing: kernel maintainers, test developers, CI systems developers, etc. to share ideas and opinions on regression analysis, reporting and tracking in particular. Regressions can be found through many different means: manual testing, different tools, CI systems, production deployments and so on. Understating how to congregate the data and the reporting and tracking process could be very valuable for the community.

        There are many different efforts in the community that are addressing the problem of regression reporting on their own. We think that a more systematic approach based on experimentation and on getting feedback from the developers and maintainers who frequently work on fixing regressions would help us highlight the common needs and missing functionalities faster, and that collaboration between the authors of different tools will get us closer to a standard and common solution that the community can use to query, triage, report and track regressions more easily.

        Speakers: Gustavo Padovan (Collabora), Mr Ricardo Cañuelo (Collabora), Thorsten Leemhuis
      • 13:00
        Lunch
      • 170
        Embedded Linux BOF

        This session will be a discussion of the current status of technology areas related to embedded Linux. Specific technology areas that are candidates for discussion are system size, boot time, power management, realtime performance, embedded-related filesystems, memory technology devices, embedded busses (i2c, spi, etc.), build systems, and architecture support. Other topics may be discussed, as driven by the interest and concern of the attendees. The goal of the session is to identify current areas of work, and coordinate plans for collaboration, by various stakeholders in the embedded Linux ecosystem.

        Speaker: Tim Bird (Sony)
      • 171
        PCI device authentication & encryption

        At LPC 2022 we had a fruitful BoF session to align on an architecture for PCI device authentication (CMA-SPDM, PCIe r6.1 sec 6.31).

        The BoF allowed community members' concerns to be addressed. Rough consensus on a path forwards was established, with device authentication to be performed by the PCI core before a driver is probed.

        At the time, a proof-of-concept implementation of in-kernel CMA-SPDM had been submitted as an RFC.

        That implementation has since been refined and extended based on what we discussed at LPC 2022, and it was submitted as a non-RFC patch set before LPC 2023.

        We would like to reconvene for a face-to-face discussion on these authentication patches and the next steps to bring up measurement retrieval, certificate provisioning and encryption on top of them (IDE, PCIe r6.1 sec 6.33).

        Another topic worth discussing is how this in-kernel implementation can be made to work with vendor-defined firmware implementations (such as Intel TDX Connect, AMD SEV-TIO, ARM CCA).

        The audience of this BoF includes PCI, CXL and virtualization developers interested in device security.

        Speakers: Jonathan Cameron (Huawei Technologies R&D (UK)), Lukas Wunner
      • 16:00
        Break
      • 172
        Secure I/O BoF

        Continuation of the secure I/O session at CoCo microconference.

        We plan to move the bigger discussions out from the microconference into the BoF.

        Speakers: Dhaval Giani (AMD), Joerg Roedel (SUSE)
    • Birds of a Feather (BoF) "Potomac E" (Omni Richmond Hotel)

      "Potomac E"

      Omni Richmond Hotel

      72
      • 173
        Android BoF

        Discussion space following the Android Microconference

        This BoF will provide additional discussion space, for folks to digest and then talk about topics covered at the Android Microconference, as well as - time allowing - any topics missed in the Microconference.

        Speakers: Amit Pundir, John Stultz (Google), Karim Yaghmour (Opersys inc.), Sumit Semwal (Linaro)
      • 11:00
        Break
      • 174
        Multiple system-wide low power-states

        Driven mostly from unofficial requests/discussions from the ARM community, there is a need to support multiple system-wide low power-states on certain platforms. In this regards, we also need a way to select the proper low power-state, for the currently running use-case.

        In principle it looks like we need mixture of information to be able to select the low power-state. Like constraints based upon system-wakeup configurations, wakeup latency constraints specified for certain use-cases and generic functional requirements for some specific devices.

        The intent is to present a couple of these use-cases to raise a wider discussion for a potential generic solution to these problems.

        Speaker: Ulf Hansson (Linaro)
      • 175
        Secure VM Service Module for Confidential Computing

        Continue the bigger SVSM discussions from the Confidential Computing microconference in a BoF setting.

        Speakers: Dhaval Giani (AMD), Joerg Roedel (SUSE)
      • 13:00
        Lunch
      • 176
        XFS BoF

        Lets get together and review XFS development future things / issues / rants.

      • 16:00
        Break
      • 177
        DAMON Beer/Coffee/Tea Chat

        DAMON Beer/Coffee/Tea Chat[1] is an open, regular, and informal bi-weekly meeting series for DAMON community. We had the first physical version of it as a BoF session of LPC last year. Continuing that, this will be the second in person version of the DAMON community meetup. That is, we will discuss any topics about DAMON including but not limited to:

        • Introduction of each other (who they are, what kind of interest/expectation they have for DAMON),
        • Sharing each person or company's current progress and bottlenecks on their DAMON development/application works,
        • Discussions on possible collaborations on DAMON-related works (both on kernel space and user space),
        • Discussions on direction and prioritization of DAMON's future,
        • Just show each other's face and saying hi, and
        • Anything.

        Also, this session may have QnA and discussions for the kernel summit DAMON talk[2].

        [1] https://lore.kernel.org/damon/20220810225102.124459-1-sj@kernel.org/
        [2] https://lpc.events/event/17/contributions/1624/

        Speaker: SeongJae Park
    • Kernel Summit "Magnolia" (Omni Richmond Hotel)

      "Magnolia"

      Omni Richmond Hotel

      187
      • 178
        Kernel handling of CPU and memory hot un/plug events for crash

        Once the kdump service is loaded, if changes to CPUs or memory occur,
        either by hot un/plug or off/onlining, the crash elfcorehdr must also
        be updated.

        The elfcorehdr describes to kdump the CPUs and memory in the system,
        and any inaccuracies can result in a vmcore with missing CPU context
        or memory regions.

        The current solution utilizes a udev event per CPU or memblock to
        initiate an unload-then-reload of the kdump image (eg. kernel, initrd,
        boot_params, purgatory and elfcorehdr) by the userspace kexec utility.
        In a rapidly scaling environment, significant performance problems
        occur related to offloading this activity to userspace.

        This talk introduces a generic crash handler that registers with
        the CPU and memory notifiers. Upon CPU or memory changes, from either
        hot un/plug or off/onlining, this generic handler is invoked and
        performs important housekeeping, for example obtaining the appropriate
        lock, and then invokes an architecture specific handler to do the
        appropriate elfcorehdr update.

        Speakers: Eric DeVolder (Oracle), Sourabh Jain
      • 179
        A kernel documentation update

        The documentation discussion at past kernel summits has been lively, so
        I think we should do it again. Some topics I would bring to a session
        this year would include:

        • The ongoing restructuring of the Documentation/ directory. I've been
          slowly moving the architecture docs into Documentation/arch/, but
          would like to do more to reduce the clutter of the top-level directory
          and make our documentation tree more closely resemble the organization
          of the source.

        • Structure. We continue to collect documents, but do little to tie
          them together into a coherent whole. Do we want to change that and,
          if so, how?

        • Support for documentation work. There is nobody in the community who
          is paid to put any significant time into documentation, and it shows.
          How can we fix that?

        • Infrastructure. Sphinx brings a lot but is far from perfect; what can
          we do to improve it?

        Other topics will certainly arise as well.

        jon

        Speaker: Jonathan Corbet (Linux Plumbers Conference)
      • 11:00
        Break
      • 180
        Tips and Strategies for Reducing Stress and Burnout by Creating Psychological Safety

        In today's hybrid computing and work environments, many challenges make it difficult to collaborate, communicate and feel a sense of belonging as a team-member. It can become increasingly difficult to develop trust and teamwork in these environments which impact innovation and productivity. Creating psychological safety can be helpful when embracing strategies that reduce stress and burnout. Learn strategies that build trust, support for team members; leverage collaborative language; demonstrate understanding and empathy and self-awareness.

        Speakers: Dr Gloria Chance, Shuah Khan (The Linux Foundation)
      • 13:00
        Lunch
      • 181
        DAMON: Current Status and Future Plans

        DAMON[1,2] is a Linux kernel subsystem that provides efficient data access
        monitoring and access-aware system operations (DAMON-based Operation Schemes,
        a.k.a DAMOS). A service provider reported that they are showing about a 16%
        reduction in memory usages with modest overhead on their products by utilizing
        a DAMON-based system operation scheme.

        After its initial introduction[3], it has continued to develop in response to
        the voices from the users and kernel hackers. We also proactively tried to get
        more feedback by sharing the status and discussing future works in multiple
        venues including every kernel summits since 2019[3,4,5,6] and DAMON
        community[7,8]. As a result, DAMON made substantial improvements while the
        list for future works never emptied.

        This talk will aim to continue the sharing and discussion at the kernel summit
        of 2023. We will share what feedback we received, what patches have been
        developed or are under development, what requests are still in the planning
        stage, and what the plans are. With that, hopefully we will have discussions
        that will be helpful for improving and prioritizing the plans and specific
        tasks, and finding new requirements.

        Specific sub-topics would include, but are not limited to:

        • Efficient ABI and a convenient user-space tool
        • Fine-grained DAMOS control
        • Partial self-tuning of DAMOS
        • Extension of DAMON monitoring targets
        • Plan for collaborative memory-overcommit VM system management
        • Plan for tiered-memory management
        • Plan for DAMON accuracy improvement

        Based on the progress until the summit, some items can be added or dropped.

        [1] project homepage, https://damonitor.github.io
        [2] official doc, https://docs.kernel.org/mm/damon/index.html
        [3] ksummit 2019, https://linuxplumbersconf.org/event/4/contributions/548/
        [4] ksummit 2020, https://www.linuxplumbersconf.org/event/7/contributions/659/
        [5] ksummit 2021, https://linuxplumbersconf.org/event/11/contributions/984/
        [6] ksummit 2022, https://lpc.events/event/16/contributions/1224/
        [7] DAMON mailing list, https://lore.kernel.org/damon
        [8] DAMON meetup, https://lore.kernel.org/damon/20220810225102.124459-1-sj@kernel.org/

        Speaker: SeongJae Park
      • 182
        Kernel Samepage Merging (KSM) at Meta and Future Improvements to KSM

        Kernel samepage merging (KSM) is a memory de-duplication scheme that finds duplicate pages in memory and shares them in order to alleviate memory bottlenecks. However, KSM can have a negative impact on performance, as it requires scanning all target pages.

        In this presentation, I describe how a real-world application is used, how it is configured to make good use of KSM, and what benefits can be achieved in terms of memory savings.

        The presentation evaluates the current limitations of KSM and how future KSM features can reduce the CPU consumption. Specifically, the presentation will include details about "adaptive page scanning" and an advisor to select suitable values for how many pages to scan.

        Results show that this can reduce the CPU consumption of the ksmd background thread significantly.

        Speaker: Stefan Roesch (Meta)
      • 16:00
        Break
      • 183
        VSOCK: From Convenience to Performant VirtIO Communication

        The VSOCK family of sockets has traditionally been embraced for its convenience in enabling communication between virtual machines and the host in virtualized environments. However, recent practical advancements have developed VSOCK into more than just a convenience; it has become a viable networking protocol even for some extremely demanding networking workloads across the host/VM boundary.
        This talk will delve into virtio/vsock and its new support for datagrams, unlocking new potential for efficient packet exchange between VMs and the host. By comparing VSOCK datagrams with UDP over virtio, we showcase its practical performance advantages, making it a compelling choice for high-throughput point-to-point socket-based communication scenarios.
        Additionally, we'll explore the integration of sockmap for VSOCK, empowering eBPF programs to interact with VSOCK sockets within the kernel. This capability allows for dynamic socket management, providing the ability to leverage the performance advantages of both sockmap and VSOCK in the same practical applications.

        Speakers: Amery Hung (ByteDance), Bobby Eshleman (ByteDance)
      • 184
        Improving resource ownership and life-time in linux device drivers

        Recently there have been several talks about issues with object ownership in device drivers, use-after-free bugs and problems with handling hot unplug events in certain subsystems.

        First Laurent Pinchart revisited an older discussion about the harmful side-effects of devres helpers during LPC 2022[1]. I then went down that rabbit hole only to discover a whole suite of issues, not really linked to devres in any way but rather mostly caused by the way subsystems and drivers mix reference counted resources with regular ones[2]. This year Wolfram Sang continued the research and presented even more vulnerable subsystems as well as some potential remedies during his talk at the EOSS 2023 in Prague[3].

        I have since experimented with several approaches and would like to present some updates on this subject. During this talk I plan to jump straight into presenting concrete ideas and timelines for improving the driver model and introducing some unification in the way subsystems handle driver data. While this is a significant effort spanning multiple device subsystems that will need to be carried out in many phases over what will most likely be years, without addressing the problems, we'll be left with many parts of the kernel not being able to correctly handle simple driver unbinds.

        [1] https://lpc.events/event/16/contributions/1227/
        [2] https://fosdem.org/2023/schedule/event/devm_kzalloc/
        [3] https://eoss2023.sched.com/event/1LcPP/subsystems-with-object-lifetime-issues-in-the-embedded-case-wolfram-sang-sang-engineering-renesas

        Speaker: Bartosz Golaszewski (BayLibre)
    • LPC Refereed Track "James River Salon D" (Omni Richmond Hotel)

      "James River Salon D"

      Omni Richmond Hotel

      183
      • 185
        Syzbot: 7 years of continuous kernel fuzzing

        Syzbot is an automated system that continuously fuzzes OS kernels and routes the reports to kernel developers. Since 2017, syzbot has already found and published more than 10000 findings in the Linux kernel.

        Levels of adoption and reception of the tool differ throughout the kernel, but it has definitely had a positive impact on the Linux kernel's health. To date, more than 5500 findings have been addressed and 3300+ Linux commits explicitly mention syzbot. Occasionally (for better or worse), our reports spark passionate discussions about kernel testing and development paradigms.

        The talk begins with a brief overview of the tool and the functionality it offers to kernel developers to facilitate bug report triaging, debugging, and fixing. We'll also cover the major challenges of building and operating an automated kernel testing system that have been (and are still being) faced since the launch of syzbot.

        The second half of the talk will be dedicated to the discussion of why some fuzzer-reported findings are addressed by the kernel developers community, while some are not, and what can be done to improve the situation. We'll share our understanding of the important factors as well as the relevant syzbot-accumulated statistics and we also hope to hear your opinion on the subject.

        Additionally, we will elaborate on the syzbot areas that have been significantly reworked to address user feedback and to get more kernel bugs fixed. For instance, we'll talk about the classification of findings by the affected kernel subsystems and per-subsystem lists of not yet addressed bugs. Another highlight is the new functionality to detect missing fix backports for syzbot findings in stable trees.

        We welcome and encourage discussion, feedback, and constructive criticism on how syzbot can be made more helpful to the kernel development community.

        Speaker: Aleksandr Nogikh (Google)
      • 186
        Linux Virtualization Based Security (LVBS)

        Linux kernel offers built-in self-protection mechanisms like control-register pinning, module/file authentication and protection restrictions; but a sophisticated kernel-level attacker can still bypass these. To get a much more effective defense, we propose to enforce such protection mechanisms via the hypervisor itself or a hypervisor-backed trusted entity. This also allows us to consider safeguarding and monitoring other critical system assets (passwords, critical kernel data structures) through the same trusted entity. In this talk we want to introduce Linux Virtualization Based Security (LVBS), an umbrella term under which we can offer various hypervisor backed kernel protection solutions.
        We want to present a common hypervisor agnostic extendable architecture in Linux kernel that can be used by any hypervisor to implement and extend Linux kernel protections and how different hypervisor frameworks (Hyper-V as an example of type-1 hypervisor and KVM as an example of type-2 hypervisor) can plug into the common layer to harden the Linux kernel. We then will briefly discuss the ongoing work at Microsoft in the Linux kernel to implement the proposed architecture briefly touching upon:
        a. Parts of hypervisor-agnostic common layer under development [1]
        b. Using Hyper-V’s Virtual Secure Mode (VSM) along with the common layer to harden Linux kernel
        c. Using KVM along with the common layer to harden Linux kernel [1]
        We intend to publish and upstream all the Linux kernel code for this project.
        1. https://lore.kernel.org/all/20230505152046.6575-1-mic@digikod.net/
        2. https://github.com/heki-linux

        Speakers: James Morris, Mickaël Salaün (Microsoft), Thara Gopinath (Microsoft)
      • 11:00
        Break
      • 187
        Is Linux Suspend ready for the next decade?

        Users demand speed, reliability, and low-power from system-suspend. To assure Linux can meet these goals, Intel's upstream kernel team built "sleepgraph" a decade ago, and we have been running and improving it ever since.

        Today, Linux OEM's demand over 10,000 consecutive successful suspend iterations to demonstrate suspend reliability. And so our efforts have evolved from function and performance to stress-testing. Our modest development lab is on track to complete over 5-million suspend-resume cycles this year -- primarily by stealing time on lab development systems during off-peak hours.

        Continuous stress testing has cast light on new pain points. Most failures stem from intermittent device driver quality -- typically breaking in Linux rc1, but sometimes regressing as late as rc8! Long term testing has also shown failures stemming from uncontrolled changes in hardware configuration, network environment, even temperature matters!

        We have found a diminishing return in longer endurances tests. More value comes from spreading testing across different hardware. We need help from our co-travelers in the community to broaden the hardware population being tested. We need help from you!

        And so we will demonstrate how easy it is for community members to contribute by running sleepgraph -- encouraging you to do so in your lab, and on the laptop that you have in front of you, whether you interpret the results (and file bugs) yourself, or forward them to a repository where they can be analyzed by experts.

        Speakers: Len Brown (Intel Open Source Technology Center), Rafael Wysocki (Intel Open Source Technology Center), Mr Todd Brandt (Intel Open Source Technology Center)
      • 188
        Encryption for filesystems with advanced features: new fscrypt functionality

        fscrypt has long been the standard subsystem for filesystems to adopt filesystem-level encryption. Traditionally fscrypt has encrypted data on a per-inode level; however, this made snapshotting or reflinking encrypted data difficult. Over the past two years, btrfs has worked to add per-extent encryption to fscrypt: encrypting on a per-extent level allows reflinking and snapshotting of encrypted data, and potentially other features in the future like changing encryption keys for new data and the use of authenticated encryption for greater security.

        This talk will go what your filesystem can do with the new per-extent fscrypt, the tradeoffs of inode vs extent based fscrypt, and challenges encountered in btrfs. Afterward we'll discuss what's coming next, and address questions about whether per-extent fscrypt is suitable for the unique featureset of your filesystem.

        Speaker: Sweet Tea Dorminy (Meta)
      • 13:00
        Lunch
      • 189
        Trust, confidentiality, and hardening: the virtio lessons

        Can you trust your hardware? How do you know? And if not can you still use it?

        These questions are not new, however linux currently lacks a comprehensive answer. The confidential computing platforms that are becoming more popular offer both a new perspective and new uses - and with them, a new sense of urgency - to efforts to answer these questions.

        Being a common part of the hypervisor/guest interface, virtio found itself at the forefront of some of these efforts.

        This talk will review several approaches to the question of trust and highlight some (sometimes subtle) differences between these.

        Further, the experience in virtio driver hardening will be reviewed, including difficulties posed by existing infrastructure and issues that remain unaddressed to this day.

        Finally, some ideas for unifying our approach to trust in hardware will be presented.

        Speakers: Michael S. Tsirkin (Red Hat), Stefan Hajnoczi (Red Hat)
      • 190
        Using hardware hints for optimal page placement

        Hardware platforms have started exposing useful and actionable memory access information to the OS in various ways [1] [2]. There are sub-systems in the kernel like NUMA balancing which need such information but currently employ software methods to generate and act upon such access information. They could benefit if hardware can directly provide access information to the kernel in an easy to consume manner.

        This talk intends to look at the ways, opportunities (NUMA balancing, reclaim, hot page promotion in tiered memory systems) and challenges in using hardware provided memory access information within the kernel. The talk will also share the experience of using the Instruction Based Sampling (IBS) mechanism present in AMD EPYC processors for NUMA balancing and hot page promotion.

        [1] https://lore.kernel.org/linux-mm/20230208073533.715-1-bharata@amd.com/
        [2] https://lore.kernel.org/linux-mm/20230402104240.1734931-1-aneesh.kumar@linux.ibm.com/

        Speaker: Mr Bharata Bhasker Rao (AMD)
      • 16:00
        Break
      • 191
        Linux perf tool metrics

        Linux kernel perf events are great, however, individual events are often too low-level to understand a performance issue. For example, a metric like memory bandwidth may consist of read and write counters on multiple memory controllers, with different counters for different types of memory and with counts having additional data combined within them like acknowledgements. The Linux perf tool provides metrics to combine events and give human/tool readable performance data. This presentation will describe the use of performance metrics, recent improvements, background on how metrics are implemented and what the upcoming challenges for metric implementation are.

        A particular area where metrics have been improving has been in support of Intel CPUs. The Valkyrie project, a joint effort between Intel and Google, has contributed hundreds of metrics across client and server CPUs. The presentation will show features of this work, discuss new performance data from Intel processors, as well as problems that are general to metrics and their implementation in the Linux perf tool.

        Speakers: Ian Rogers (Google), Ms Weilin Wang (Intel)
      • 192
        Ship your Critical Section, Not Your Data: Enabling Transparent Delegation with TCLocks

        Today’s high-performance applications heavily rely on various synchronization mechanisms, such as locks. While locks ensure mutual exclusion of shared data, their design impacts application scalability. Locks, as used in practice, move the lock-guarded shared data to the core holding it, which leads to shared data transfer among cores. This design adds unavoidable critical path latency leading to performance scalability issues. Meanwhile, some locks avoid this shared data movement by localizing the access to shared data on one core, and shipping the critical section to that specific core. However, such locks require modifying applications to explicitly package the critical section, which makes it virtually infeasible for complicated applications with large code bases, such as the Linux kernel, which comprises 28M LoC with more than 180k static lock call sites.

        We propose transparent delegation, in which a waiter automatically encodes its critical section information on its stack and notifies the combiner (lock holder). The combiner executes the shipped critical section on the waiter’s behalf using a lightweight context switch. Using transparent delegation, we design a family of locking protocols, called TCLocks, that requires zero modification to applications’ logic. We have implemented a spinlock, a delegation-based blocking lock, and a phase-based reader-writer lock. TCLocks also handles nested locking and out-of-order (OOO) unlocking, which is heavily used in the Linux kernel. While booting the kernel, we measure around ~80k nested locking calls and ~20k OOO calls.

        Naively using transparent delegation breaks the assumptions of Linux kernel code about the stability of access to per-CPU variables under special execution contexts like interrupt handlers, non-preemptible or non-migratable contexts. One solution is to switch the gs register for x86 architecture to provide per-CPU variables of the waiter’s CPU. However, it will lead to subtle bugs in the combiner phase unless we annotate parts of the kernel code which require access to the combiner’s per-CPU variables, such as the scheduler, RCU, etc. We adopt a conservative approach of disabling combining for such execution contexts and fallback to qspinlock (Linux kernel spinlock).

        The evaluation after replacing spinlock, mutex, rwlock and rwsem in the Linux kernel shows that TCLocks provide up to 5.2× performance improvement compared with recent locking algorithms. Detailed information about achieving transparent delegation for different types of locks can be found in our OSDI'23 paper and our implementation.

        Enabling transparent delegation for the kernel still requires answering several questions like how does resource accounting works or how does delegation work with the current macro? Since the combiner thread executes the waiter's critical section, correct CPU/memory accounting must be done for the combiner and waiter thread. Similar to per-CPU variables, the Linux kernel uses the current macro for multiple purposes, including resource accounting for cgroups, permission checks using credentials, thread scheduling etc. The current macro has to be handled correctly within the combining phase otherwise it can lead to subtle bugs. We plan to discuss possible solutions to these problems to enable transparent delegation for the kernel.

        Speaker: Vishal Gupta
    • Live Patching MC "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83

      The Live Patching microconference at Linux Plumbers 2023 aims to gather stakeholders and interested parties to discuss proposed features and outstanding issues in live patching.

      • 193
        Livepatch Visibility at Scale

        To support various Linux Kernels in hyperscale data centers, it is important to aggregate signals (console output, crash dump, etc.) among millions of servers. One of the key types of information in this massive dataset is the Kernel version running on each host.

        At Meta, we use netconsole to analyze console outputs from millions of servers. Recent work [1] added the kernel version to each netconsole message. This allows us to answer questions like “which kernels get error messages like XXX?”.

        However, with the presence of livepatch, kernel version alone is not enough to differentiate systems with Kernel X and systems with Kernel X and livepatch Y.

        To better understand the impact of a livepatch on thousands of hosts, i.e. during a livepatch roll out, we use some hack [2] to append a suffix to the kernel version in the netconsole record. This approach, however, is not sufficient for upstream use. Specifically, it cannot handle systems with multiple livepatches attached at the same time.

        In this livepatch MC, we would like to discuss different options to include livepatch information in netconsole (and maybe also other data sources). The following are a few premature proposals:

        • Expand struct klp_patch to contain a tag, and send it with Netconsole
        • Send struct klp_patch->mod->name in Netconsole
        • Append the information to struct new_utsname

        [1] https://lore.kernel.org/lkml/20230707132911.2033870-1-leitao@debian.org/
        [2] https://pastebin.com/tcJTkNP2

        Speakers: Breno Leitao (Meta), Song Liu (Meta)
      • 194
        KLP for Clang LTO Kernel

        We want to discuss live patching with clang built LTO kernel. We have managed to make it work on top of kpatch now. We would like to discuss:

        • How current comparison with vmlinux binary itself not scalable.
        • How we used clang lto internal flag to compare pre-linker object files.
        • How new special symbols introduced by lto may complicate live patching and how build system needs to adapt from it.
        • Whether and how we can improve kpatch to support comparison with vmlinux itself.
        Speakers: Yonghong Song, Song Liu (Meta)
      • 195
        Kbuild support for klp-relocation generation

        Livepatches may use symbols which are not contained in its own scope, and, because of that, may end up compiled with relocations that will only be resolved during module load. Yet, when the referenced symbols are not exported, solving this relocation requires information on the object that holds the symbol (either vmlinux or modules) and its position inside the object, as an object may contain multiple symbols with the same name. Providing such information must be done according to what is specified in Documentation/livepatch/module-elf-format.txt.

        Currently, there is no trivial way to embed the required information as requested in the final livepatch elf object. A proposed version [1] of klp-version submitted by Joe Lawrence solves this problem in two different forms: (i) by relying on a symbol map, which is built during kernel compilation, to automatically infer the relocation targeted symbol, and, when such inference is not possible (ii) by using annotations in the elf object to convert the relocation accordingly to the specification, enabling it to be handled by the livepatch loader.

        After further discussion we got to the point there's currently no means whatsoever to create the symbol map and Nicolai Stange proposed resorting to a more minimal utility doing only the (ii) part. This minimalistic klp-convert would have no knowledge of the symbols available in the livepatched target objects at all and would simply rely only on symbols requiring the relocation having specific prefix by using predefined macro.

        Currently there already exists first version of this minimalistic klp-convert, which is being tested and the patch will be submitted to the LP mailing list soon.

        Now there is still a big question about upstreamability of this tool and also which approach sounds better to the community.

        [1] https://lore.kernel.org/live-patching/20230314202356.kal22jracaw5442y@daedalus/T/#mc1028d2e6eb1c6b7ee1891025cc4913075d4332c

        Speaker: Lukáš Hruška
      • 11:00
        Break
      • 196
        Simplify Livepatch Callbacks, Shadow Variables, and States handling

        Livepatches allow fixing critical security or functional bugs without reboot. They are useful when an downtime is expensive.

        The basic livepatch feature functionality is to redirect problematic functions to fixed or improved variants. In addition, there are two features helping with more problematic situations:

        • pre_patch(), post_patch(), pre_unpatch(), post_unpatch() callbacks might be called. For example, they allow to enable some functionality at the end of the transition when the entire system is using the new implementation. [1]

        • Shadow variables allow to add new items into structures. [2]

        Many fixes might be accumulated before the system gets rebooted. They might be added by separate livepatches or the existing fixes might be replaced by a cumulative livepatch including the old and new fixes.

        The cumulative livepatches help with keeping the kernel in a well known state. All the changes are done by a single livepatch. It is easy especially when the livepatch modifies only the implementation of the patched functions.

        The situation gets more complicated when callbacks and/or shadow variables are used. The new cumulative livepatch should not do some actions when they were already done by the previous livepatch. On the other side, it should revert some actions or free shadow variables when they are no longer supported by the new livepatch.

        The problems with callbacks and shadow variables were supposed to be solved by livepatch states [3]. The API allows to check when the previous patch already had the state and behave accordingly. The states are versioned which allows to check if the new livepatch is compatible with the current one. The new livepatch could be installed only when it supports all the existing states.

        The practice shows that all the pieces do not play well together:

        • Shadow variables and states are associated with the patch.
        • Callbacks are associated with livepatched objects. They are called when the livepatch or the livepatched module gets loaded or removed.
        • Only callbacks from the new livepatch are called when it is replacing an older one which might prevent downgrades.
        • The state API is hard to use.

        But they should be connected. The callbacks are often used together with the shadow variables, for example:

        • post_patch() callback might be need to enable using the shadow variable after the entire system is ready to handle them.
        • pre_unpatch() callback might be needed to disable using the shadow variable.
        • post_unpatch() callback might be needed to free the no longer used shadow variables.

        The shadow variables have a lifetime [4]. They are introduced by one livepatch. They might be still used by newer patches. They need to get removed when the livepatch gets disabled or when they are no longer needed by a newer livepatch.

        In fact, any changes done by the callbacks have a lifetime. It means that any state has a lifetime.

        It is time to connect all the pieces a better way:

        1. Connect callbacks and shadow variables with states.
        2. Define all three pieces either per-patch or per-object.
        3. Call the callbacks when the state is introduced and removed.

        Proposal:

        • Move callbacks to struct klp_state
        • Rename callbacks to setup(), enable(), disable(), remove()
          and call them when the state is introduced and removed.
        • Add @is_shadow flag to connect the state with a shadow
          variable with the same @id
        • Add @shadow_dtor() callback to struct state and use it for
          garbage collection of obsolete shadow variables.
        • Add @block_disable flag to prevent disabling or downgrading the
          livepatch when the state can't be disabled.

        Pros:

        • All pieces play well together.
        • Naturally support lifetime of changes done by callbacks and shadow variables.
        • disable() and remove() callbacks might be called from the patch which supported the state. It allows downgrade to a livepatch which was not aware of the state.
        • Obsoleting and disabling states is handled the same way.

        Cons:

        • API changes are not backward compatible.
        • Callback might be called only when the livepatch gets loaded
          when the state stay associated with the livepatch.
        • More complexity when states get associated with livepatched object (modules).
        • Callbacks are not called when the new livepatch support the same state. More callbacks might be needed when a transition is needed.
          Solvable by passing data via shadow variables?

        Reference:

        [1] https://lore.kernel.org/lkml/1507921723-1996-1-git-send-email-joe.lawrence@redhat.com/
        [2] https://lore.kernel.org/lkml/1504211861-19899-1-git-send-email-joe.lawrence@redhat.com/
        [3] https://lore.kernel.org/all/20191030154313.13263-3-pmladek@suse.com/T/#mf86ded54e03bf2a80a48d66040c381c9af219d89
        [4] https://lore.kernel.org/all/20221026194122.11761-1-mpdesouza@suse.com/

        Speaker: Petr Mladek (SUSE)
      • 197
        Moving livepatching module building to kselftests

        The kernel livepatching subsystem has a number of tests that reside in the kernel. There are kernel modules and scripts that are placed in different locations. The kernel livepatch modules are stored in lib/livepatch, while the scripts to load the modules and run tests are stored in tools/testing/kselftests. The test modules are currently only compiled when CONFIG_TEST_LIVEPATCH is enabled, making it difficult to use the same tests on systems which do not have the option enabled.

        The current approach of compiling the test modules when the kernel modules are being built prevents us to create more dynamic tests, based on code templates. Such code would allow us to create and compile modules "on the fly", allowing some out-of-tree[1] livepatch tests to be adapted and submitted upstream. This could work by moving the modules from lib/livepatch to tools/testing/selftests/livepatch, allowing to use the "gen_tar" target to create a tarball containing both modules and test scripts. The obstacle to this idea is a policy on kselftests to only contain userspace code.

        The proposed idea poses some challenges in how we see kselftests, and which code should be placed within it, on the other hand, it gives us the tools to merge more tests and make the system more robust and reliable. Such a move can even motivate other teams to submit their livepatch tests if support for a more dynamic test infrastructure is added.

        [1] https://github.com/SUSE/qa_test_klp/

        Speaker: Marcos de Souza (SUSE)
      • 198
        Arm64 live patching

        The current status of the kernel side of Arm64 live patching will be provided. Missing pieces and obstacles which prevent the enablement on the architecture.

        Speaker: Mark Rutland (Arm Ltd)
    • VFIO/IOMMU/PCI MC "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82

      The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components.

      • 199
        Improve Xeon IRQ throughput with posted interrupt

        Server SoCs today offer more PCIe lanes as well as the ability to stack more IO devices on a single port. Out of the box, devices such as high-speed NVMe drives can generate a significant number of interrupts at high frequencies. Due to microarchitecture choice and PCIe strong ordering requirements, limited IRQ throughput on Intel Xeon has also become a limiting factor for DMA throughput. IOPS can drop more than 50% as evidenced by the FIO test on multiple NVMe disks. This talk describes a scheme and RFC patch that optimize IRQ throughput by enabling posted interrupts on bare metal (beyond the VM usage today). The result is that much of the performance loss can be recovered without any new hardware or driver changes.

        Speaker: Jacob Pan
      • 200
        PCI Endpoint Subsystem Open Items Discussion

        PCI Endpoint subsystem allows Linux Kernel to run on the PCI endpoint devices thereby establishing communication with the PCI host for data transfer. There are 3 open items to discuss for the PCI Endpoint subsystem:

        1. The heart of the PCI Endpoint subsystem is the Endpoint Function (EPF) driver that describes the Physical and Virtual functions inside the Endpoint device. So far 3 EPF drivers were supported in the upstream Linux kernel. Recently, there are attempts to add VIRTIO-based EPF drivers for interoperability. This discussion aims at presenting current and past proposals and getting feedback on the desired approach.

        2. Most of the EPF drivers that exist today are virtual function drivers (i.e. not backed by a hardware entity) except Modem Host Interface (MHI) for Qcom platforms. So there is a requirement to describe those functions in devicetree and binding and also allow the EPF drivers to bind with EPC during boot without ConfigFS intervention (using devicetree link between EPC and EPF).

        3. The PCI Endpoint subsystem uses a custom memory allocator for managing the PCI outbound window memory. But it could make use of the Linux kernel's generic "genalloc/genpool subsystem".

        Speaker: Manivannan Sadhasivam
      • 201
        Non-discoverable devices in PCI devices

        Modern PCI devices can expose a whole slew of hardware behind a single PCI "device". While the PCI device itself is discoverable, everything behind it (via BARs) is not. These devices aren't fixed in what downstream devices are exposed nor their configuration. There's already a solution for discovering devices and their configuration which is Devicetree. There's also already a mechanism to dynamically add to a DT with DT overlays.

        Specific use cases:

        • AMD/Xilinx Alveo Accelerator cards - FPGA based PCIe card. Multiple downstream devices (hardware peripherals) exposed
          through PCI BARs. The downstream devices may be unrelated and already have a driver (platform bus) in the kernel.
        • Microchip LAN9662 Ethernet controller - An SoC already supported upstream using Devicetree. This device also has a PCIe endpoint which can expose the entire SoC to Linux. Using DT to describe the SoC allows reusing all the drivers as-is.
        • roadtest - A testing framework which exposes platform devices behind a virtual PCI device in UML. See https://lore.kernel.org/all/20230120-simple-mfd-pci-v1-1-c46b3d6601ef@axis.com/

        What's needed in the kernel:

        • Generating PCI device DT nodes - PCI devices can already be described in DT, but usually are not. In order to have a base to apply DT overlays to, the PCI device needs a DT node. That can be solved by generating the DT node (and parent nodes) for the device. With this, the PCI driver for the device can load and apply DT overlays to its node. See https://lore.kernel.org/all/1692120000-46900-1-git-send-email-lizhi.hou@amd.com/ (Should be upstream in 6.6)
        • Enable Devicetree on Non-DT systems (x86_64) - A solution for ACPI based systems is desired as well. There's SSDT overlays for ACPI which might work?, but many of the devices to support don't use ACPI. With the generated PCI nodes, we just need the PCI host bridge and hooking up a few host resources (interrupts, MSI) to reuse the same DT overlay mechanism. A skeleton base DT is needed as well. That already exists as the DT unittest creates one for non-DT systems. There may be other issues lurking where we make DT vs. ACPI decisions in the kernel.

        Questions:
        - What about swnode and/or auxiliary bus?

        Speakers: Lizhi Hou (AMD), Rob Herring (Arm)
      • 11:00
        Break
      • 202
        IOMMU overhead optimizations and observability

        IOMMU overhead memory, which is primarily page table memory, is allocated directly from the buddy allocator, and is not charged or accounted for. Also, there is no easy way to debug IOMMU translations as there are no user interfaces that allow walking through IOMMU page tables. Below are the proposals to solve the problems.

        Add an observability for IOMMU page table memory into /proc/meminfo:

        PageTables:        XXX kB
        SecPageTables:     XXX kB
        IOMMUPageTables:   XXX kB
        

        This would allow users to see how much IOMMU page table memory is
        being used, which could help them identify and troubleshoot performance problems.

        Charge the IOMMU page table memory to the proper owner when DMA
        mappings are established:

        This would allow users to control and limit the amount of IOMMU page table memory that is used by each process.

        Allow walking through IOMMU page tables on live systems and in kdumps:

        This would allow users to debug IOMMU translations and identify problems.

        For live systems the interface should be similar to /proc/PID/pagemap, so users could walk through IOMMU page tables, and study which physical pages are currently mapped into page tables.

        For kdumps, it should be a crash-utility extension to dump IOMMU page tables.

        Limit the growth of page tables:

        Currently, when pages are removed unmapped from the page table, the
        free page table levels are not returned back to the system, see [1]
        for example. This can cause substantial overheads in cases where VA
        addresses are not recycled. On the other hand, recycling VA addresses in order to save memory can be a security risk, and in general a bad practice.

        We propose to limit the maximum number of empty page table levels to a certain amount.

        Add iova_stress[1] into kernel selftest:

        This would allow us to verify that page table overhead does not
        regress in the future.

        [1] https://github.com/soleen/iova_stress

        Speakers: Pasha Tatashin, Yu Zhao (Google)
      • 203
        iommufd discussion

        Open discussion on iommufd topics that have not been settled on the mailing list prior to the conference:

        • IOMMU based dirty tracking
        • IOMMU nested translation
        • IOMMU userspace command queue
        • Unique driver features
        • iommufd support of SVA/PRI/PASID
        • ARM interrupt handling in VMs
        • Driver enablement for iommufd features
        Speakers: Mr Jason Gunthorpe (NVIDIA Networking), Kevin Tian (Intel)
    • eBPF & Networking "James River Salon C" (Omni Richmond Hotel)

      "James River Salon C"

      Omni Richmond Hotel

      225

      For the fourth year in a row, the eBPF & Networking Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s networking stack as well as BPF subsystem and their surrounding user space ecosystems such libraries, loaders, compiler backends, and other related system tooling.

      The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of both subsystems.

      The track will be composed of talks, 30 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.

      eBPF & Networking Track's technical committee: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann (chair), Andrii Nakryiko and Martin Lau.

      • 204
        Zero Copy Receive using io_uring

        Memory bandwidth is a bottleneck in many distributed services running at scale as I/O bandwidth has not kept up with CPU or NIC speeds over time. One limitation of kernel socket-based networking is that data is first copied into kernel memory via DMA, and then again into user memory, which adds pressure to overall memory bandwidth and comes with a CPU cost. The classic way of addressing this limitation is to bypass the kernel entirely, moving packets either to host memory (e.g. DPDK, PF_RING and AF_XDP sockets) or device memory (e.g. RDMA, RoCE and InfiniBand) without involving the kernel at all. These require re-implementing the networking stack either in userspace or the NIC hardware, and re-working applications which can no longer assume generic kernel-based socket I/O.

        With hardware support for features such as flow steering and header splitting and new kernel features such as page pool memory providers, it is now possible to have a hybrid solution that sits in between full kernel bypass and full kernel copying. This paper proposes a way of doing zero copy network RX into host userspace memory using io_uring for the user facing API. We use header splitting to direct headers into the kernel for handling the stateful aspects of the networking layer, and payloads directly to its intended destination in userspace memory without copying using DMA. Flows are forwarded to specific RX queues configured for this using flow steering.

        In this paper, we first discuss our design using io_uring from end to end in detail. We then compare our design with existing zero copy RX APIs in kernel and kernel bypass methods, and share some preliminary performance results after that. As part of this, we go into kernel features (both WIP and still in proposal) that are needed to support our design.

        In the second part of the paper, we discuss the limitations of zero copy receive, and the challenges of fully making use of it in userspace applications. Unlike TX where data size is known ahead of time, RX data size is unpredictable and potentially bursty. To get the most out of zero copy receive, the write end has to coordinate closely with the receive end to agree on the exact shape of the data in order to avoid more copies down the line. For example, if the final destination is to write to a block device, then there are requirements for O_DIRECT to work such as alignment. Finally, we talk about our plans for further work e.g. extending support to GPU device memory.

        Speakers: David Wei (Meta), Pavel Begunkov (Meta)
      • 205
        Enhancing Homa Linux for Efficient RPC Transportation

        Homa, a unique transport protocol created specifically for hype-scale datacenters, provides optimized round-trip performance for request/reply messages. An in-depth evaluation of the Homa Linux module in contrast to TCP showed a considerable decrease in latency with RPC application benchmarks. Furthermore, our analysis of gRPC operating over Homa versus gRPC over TCP revealed significant benefits, specifically for smaller RPC messages (less than 20k), in both latency and throughput.
        Despite these advantages, Homa's broader use as a standard RPC transport protocol is constrained by two main challenges:
        1. Constraints of Message-based Interface: Homa's message-based interface poses hurdles for efficient pipelining, as it demands waiting for the complete message to ensure its full delivery to the application. This results in relatively low throughput for larger RPC messages (average size over 20k) compared with TCP.
        2. Exclusive Unary RPC Support: Currently, Homa only accommodates unary RPC, where a client sends a single request message and waits for one response. The absence of support for Bidirectional Streaming RPC, which allows full-duplex message exchange between client and server, limits Homa's adaptability in certain situations.
        This presentation aims to introduce our solutions to these identified obstacles, and offer practical guidance on improving the performance of Homa as a conventional RPC transport protocol. We will discuss optimal strategies to harness Homa's strengths to attain minimal RPC latency and maximum throughput. The session will incorporate real-world examples and demonstrations to clarify these concepts and highlight the benefits of using Homa as an efficient RPC transport protocol.

        Speakers: Dr Xiaochun Lu (Bytedance), Zijian Zhang (Bytedance)
      • 206
        An introduction to the DPLL subsystem

        More and more systems require precisely synchronized time to operate effectively. NTP and PTP are well-known protocols that distribute time information across networks. However, systems need hardware to be properly configured and monitored as a part of the solution. With SyncE adoption there is a requirement to support more rigorous approaches to time synchronization and distribution on hosts. One method is to use Digital Phase Lock Loop (DPLL) circuits to syntonise frequencies over the board. This new subsystem provides an API to configure and monitor such hardware in Linux using netlink as a transport protocol.

        Speaker: Vadim Fedorenko (Meta)
      • 11:00
        Break
      • 207
        Unblocking the softirq lock for PREEMPT_RT

        Disabling bottom halves is essentially a per-CPU Big Kernel Lock. While some data structures have explicit locking - other rely on disabling BH. Depending on the load, networking has to wait until timer callbacks have finished. Even if preempted by a task with higher priority, it can not send a packet until all receiving is done.
        This talk intends to discuss with the networking community, which is the major softirq user, what can be done to avoid this big lock.

        Speaker: Sebastian Siewior (Linutronix)
      • 208
        Offloading QUIC Encryption to Enabled NICs

        In large deployments, significant CPU cycles are used on encryption for transport security (QUIC, TLS, etc). CPU crypto instructions and ‘look-a-side’ accelerators can have significant performance penalties (memory copies, cache pollution, etc).

        NIC or Inline offload solves many of these problems and it leverages the natural memory copy into the NIC to implement crypto-offload. Other protocols (kTLS, IPSec) have been successfully offloaded using this technique, it is time for QUIC to do the same.

        This presentation will cover the software design (from userspace to hardware driver) for utilizing crypto offload for QUIC packets using offload capable NIC implementations (including future Broadcom NICs). In addition to covering the design and implementation of this infrastructure there will be discussion around the performance benefits to this solution to those that want to utilize QUIC offload in their infrastructure.

        Speaker: Andy Gospodarek (Broadcom)
      • 209
        Extending AF_XDP with hardware metadata

        AF_XDP is a relatively novel packet family which builds on top of XDP and supports directly accessing low-level networking queues from userspace. It exposes raw packet headers and payload and bypasses most of the kernel stack. Recently, it gained the support of NIC receive-side offloads and I'm actively working on exposing the transmit-side offloads as well. In this talk, I'll expand more on what has been done so far and what the future looks like.

        Speaker: Stanislav Fomichev (Google)
      • 13:00
        Lunch
      • 210
        connect() - why you so slow?!

        What happens when your application opens upwards of 50k connections to a single
        destination? Short answer - connect() syscall becomes slow. Cloudflare found out the
        hard way.

        Through this talk we would like to share our story of what we have learned about
        connect() implementation for TCP in Linux, both its strong and weak sides. How
        connect() latency changes under pressure, and how to open connection so that the
        syscall latency is deterministic and time-bound.

        In this talk we would like to cover:

        • Why Cloudflare services sometimes experience pressure, where we need to open
          lots of connections to just one destination.
        • How we have been avoiding the connect() latency pitfall so far, and why it is
          no longer a viable option.
        • Our efforts to benchmark connect() syscall and characterize its latency as the
          the number of open connections increases.
        • Existing difficulties in tracing and monitoring connect() performance at scale
          in a production environment.
        • A look at how connect() is implemented in Linux for TCP; its evolution and
          previous attempts dealing with high-latency under pressure.
        • How to control how connect() takes with existing Linux APIs - recipes for how
          to open TCP connections with predictable syscall latency.
        Speaker: Frederick Lawler (Cloudflare)
      • 211
        Container Networking: The Play of BPF and Network NS with different Virtual Devices

        In the container-centric ecosystem, achieving efficient network isolation without compromising on performance has become paramount. Not all containers require the stringent network isolation akin to VMs. Many can benefit from a more flexible approach, like using eBPF hooks, to mark and manage network traffic with QOS. This presentation delves into the application of cgroup-bpf based hooks (bind/connect/sendmsg) in crafting lightweight network isolation solutions.

        Yet, there exist cases where the above cgroup-bpf solution falls short. Here, we'll explore the indispensable role of network namespaces in ensuring robust network isolation, separating container traffic from the host network. But with stringent isolation, come challenges. Latency-sensitive applications can experience performance bottlenecks in this setup.

        To address these challenges, we have investigated various techniques such as veth, netkit, IPVLAN, and SR-IOV. Each has its merits and drawbacks in terms of performance, usability, configuration complexities, and kernel support requirements. Join us as we decode the intricacies of network isolation in a containerized world, offering insights for both novice and seasoned developers.

        Speakers: Martin Lau (Meta), Takshak Chahande (Meta)
      • 212
        Evolution of Direct Server Return (DSR) implementation for containerized applications

        The industry extensively relies on direct server response, DSR, and Meta has a long history of employing this technology for L4 load balancing. At the same time, our fleet went through an evolution of being an isolated subset of machines per team, to a more efficient model with a single shared pool that provides multi-tenant capacity. Moving services to network namespace becomes necessary to implement stackable workload solutions allowing multiple services running on the same ports to be able to schedule on the same host. Our approach for DSR has transformed together with those changes.

        Initially, network service running in datacenters used a "rootlet", system wide XDP program array to jump sequentially from one XDP program to another. It's an in-house built solution to attach XDP program, but has its own limitations making it not work well with shared hosts. To migrate traffic services to shared hosts, we built XDP Chainer, an in-house solution to attach multiple XDP programs on a single interface, which also addresses some of the shortcomings of the "rootlet"-based solution.

        However, the introduction of multi-tenancy and network namespaces has brought new challenges, requiring reevaluation of how we facilitate decapsulation support for backends. This evolution involves the migration of the decapsulation data path to the TC-BPF solution. This approach is tightly integrated with internal container orchestration at Meta’s and requires no configuration and less privileges from users.

        During our presentation, we will address the challenges encountered, discuss the alternatives considered, describe wins that we achieved with a new approach and reflect on the lessons learned throughout the duration of this project

        Alternative Titles:
        * Direct Server Return (DSR) and Multi-tenancy: obstacles and solutions
        * Adaptation of BPF implementation for Direct Server Return in response to containers evolution

        Speakers: Lalit Gupta (Meta), Pavel Dubovitsky, Raman Shukhau
      • 16:00
        Break
      • 213
        SYN Proxy at Scale with BPF

        SYN Cookie is a technique used to protect servers from malicious connection requests. Under SYN flood, the Linux TCP stack encodes the client information into the initial sequence number (ISN) of SYN+ACK, which is called SYN Cookie, and decodes that from ACK of 3WHS so that the kernel can release resources for the connection and stays stateless during 3WHS.

        For security reasons, SYN Cookie is calculated with some host-specific secrets, so can only the generator validate the cookie. Even with SYN Cookie, intermediate nodes between the client and the server must keep tracking the connection. SYN Cookie does consume resources and thus is NOT stateless in the network.

        SYN Proxy reduces such unwanted resource allocation by handling 3WHS at the edge network. After 3WHS with a client, SYN Proxy generally restores the initial SYN packet from SYN Cookie and forwards it to the backend server to initiate 3WHS. However, the ISN in SYN+ACK is selected randomly by the server and does not match the SYN Cookie. To be transparent to the client and the server, SYN Proxy must keep the ISN mapping and fix the SEQ/ACK numbers in all packets. This solution also is not stateless and does not scale well for a service with high bandwidth.

        This talk will cover
        - what the kernel encodes into SYN Cookie
        - our stateless SYN Proxy and kernel module
        - ongoing effort to add a new kfunc to replace the module

        Speaker: Kuniyuki Iwashima (Amazon Web Services)
      • 214
        bpfilter: a BPF-based packet filtering framework

        For a significant period, bpfilter wasn't more than an empty usermode helper and an abandoned patch series. However, it has recently undergone active development as a userspace daemon, which can be found on GitHub at https://github.com/facebook/bpfilter. This daemon now offers userspace services a swift and user-friendly interface to generate packet-filtering BPF programs dynamically. This discussion aims to provide further insights into bpfilter, including its current capabilities, performance, and ongoing development efforts.

        Speaker: Quentin Deslandes (Meta)
      • 215
        Blinking Lights, getting it wrong again, again and again

        RJ45 Ethernet sockets often come with a couple of LEDs. Front panels
        of STB, cable modems, WiFi access points and switches have LEDs. They
        give some representation of what is happening in the network, link,
        link speed, RX or TX of frames etc.

        How these LEDs are configured was until recently a big problem. Many
        patches have been NACKed, all repeating the same problem, again and
        again.

        This talk will first look at the existing hacks used in device tree to
        describe how PHY LEDs should be configured, and at the NACKed patches
        trying to add more hacks. Why these are hacks will be explained.

        A step back will then be taken to look at the bigger problem. Why is
        everybody repeating the same mistake? And how the same or similar
        problem happens in other parts of netdev. How can these problems be
        avoided when adding a new feature to the network drivers?

        The last part of the talk will quickly look at the new APIs added
        recently for configuring PHY and MAC driven LEDs. This should give an
        introduction as how driver writes can add support for their devices
        LEDS and how users can configure these LEDs.

        Speaker: Andrew Lunn
    • 13:00
      Lunch "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83
    • 13:00
      Lunch "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82
    • Internet of Thing5 MC "James River Salon A" (Omni Richmond Hotel)

      "James River Salon A"

      Omni Richmond Hotel

      82
      • 216
        Linux-wpan updates
        • team maintainership
        • recent and upcoming features: mlme, beacons, scanning, associations, peers
        • basic parts for a pan coordinator userspace service
        • wpanusb generic specification challenges (phy layers, channels, multi-band, hw
          offload, ...)
        • link-Layer security and status and problems
        Speaker: Mr Stefan Schmidt
      • 217
        TSCH@Zephyr: IEEE 802.15.4 SubG IIoT in the Making

        Zephyr's native IEEE 802.15.4 L2 is a hidden treasure: It supports a much larger variety of SoCs, vendors and PHYs than its more popular OpenThread counterpart. Native L2 not only runs the common 2.4G O-QPSK modulation but also has rich SubG support on all regional bands, from legacy BPSK all the way to SUN O-QPSK, FSK and OFDM and even initial support for HRP UWB. The latter is increasingly hot as mobile manufacturers converge around 802.15.4z/FiRa for precision UWB indoor localization. When I realized this huge potential I immediately wanted to leverage it for industrial use cases. That's when the TSCH@Zephyr project was born in late 2022.

        TSCH is IETF/IEEE's open contender to the proprietary WirelessHART standard (and to some extent to ISA 100.11a): a reliable and available wireless (RAW), low-power, deterministic real time protocol, relevant to wireless industrial automation, TSN and distributed battery-driven IIoT sensor networks.

        This BoF presents the current state of affairs wrt TSCH, SubG and distributed clocks @ Zephyr. We'll run through solved and unsolved challenges on the way to support a precision TDMA protocol on Zephyr's TI CC13/26xx driver, look at related driver API changes and at some of the underpinning conceptual work re precision distributed clocks. The latter are a cornerstone of an embedded RTOS that wants to provide re-usable primitives for all kinds of precision timing applications like ranging, PTP, 15.4 superframes/DSME/LE, TSN/DetNet, industrial ethernet/SERCOS/Profibus/... or the upcoming 5G/6G RAW extensions.

        Speaker: Chris Friedt (Meta)
      • 218
        Zephyr Retro-and-Prospective: Project Growth, Long Term Support, and Linux Interoperability

        Zephyr has been a part of the Linux Plumbers IoT Microconference since the first year in 2019. Needless to say, much has happened in that short period of time.

        Increasingly more devices are shipping with Zephyr. More companies are becoming members. More devices and are compatible out-of-the-box with Linux (and macOS, and Windows). The Internet of Things is made of devices both big and small - from Edge devices to The Cloud. Zephyr usage has skyrocketed from personal BLE monitors, to Industrial IoT, all the way to the some of the highest-throughput datacenter accelerators that power The Internet.

        While we love to see Linux and Zephyr working in concert, industry collaboration and standards have enabled interoperability with all major operating systems and several Real-Time Operating Systems.

        This will be a Lightning-Talk style recap outlining the rapid growth that we have seen, major features added, standards supported, and problems solved, in large part due to you!

        We'll touch on what went great (and not-so-great), provide pointers for developers looking to transition from Zephyr LTSv2 to the up-coming LTSv3, and offer a glimpse into what is on the horizon for the Linux and Zephyr IoT Ecosystem.

        Speaker: Chris Friedt (Meta)
      • 16:00
        Break
      • 219
        Shared FPU Support in Zephyr for ARM64 and RISC-V

        Computers are good at doing computations, obviously. But this is not that simple when floating point numbers are involved. Many processors implement a dedicated floating point unit (FPU) to perform computations on such numbers much faster compared to using the regular arithmetic logic unit (ALU).

        FPU usage is not free though, especially when an operating system is involved whose purpose is to arbitrate resource usage amongst competing computing tasks. The FPU context may be quite large and simply preserving and restoring it across task switches, just like with the ALU context, may represent a significant overhead we want to avoid when possible, especially on an RTOS such as Zephyr. But adding smartness to the FPU context switching does come with its share of challenges and surprises.

        In this presentation we'll quickly review the IEEE754 floating point standard in the context of ARM64's and RISC-V's FPU. Then we'll look in greater details at Zephyr's FPU sharing support for those architectures, design rationales, as well as some interesting snags the implementation had to deal with.

        Speaker: Nicolas Pitre (BayLibre)
      • 220
        Challenges in Device Tree Sync - kernel, Zephyr, U-boot, System DT

        The description of hardware through Device Tree, which includes Firmware in some instances, has become an increasingly common practice in many software ecosystems. However, despite various efforts, the device tree description of hardware has yet to be standardized across different software ecosystems, creating challenges for users, automated tools, and ecosystems.

        The objective of this session is to:
        a) Share experiences of device tree challenges seen with U-Boot, Zephyr, and kernel in the recent attempts for support of Texas Instrument's AM625, TDA4VM platforms
        b) Rationale and challenges created by the diverse approaches
        c) Propose a hybrid approach toward the Zephyr device tree support for TI platforms

        Speaker: Nishanth Menon (Texas Instruments, Inc)
      • 221
        Breaking Barriers: Arduino Core API advancements in Zephyr, Linux and IoT Systems

        This presentation will provide an overview of the Arduino Core API and Zephyr RTOS, and explain how their integration can simplify and streamline IoT development. We will cover the advantages of using the Arduino programming model with Zephyr, and how it can benefit developers by providing access to a wide range of pre-built functions and modules. The presentation will also cover the key features of the Arduino Core API for Zephyr RTOS, including digital and analog input/output, serial communication, and peripheral interfaces. We will discuss how these features can be used to create real-time applications with reduced development time and complexity.
        There is still scope to achieve an even more seamless experience for beginners, by integrating it with Platform IO or Arduino IDE. However this approach of how we can tie zephyr, the Arduino core module and platform IO needs to be discussed further, as to what the ideal way to do this would be, and if there are other better platforms to target instead.
        We will also explore possibilities as to how one can leverage a Linux Host machine as a CI tool to enable development and testing of an Arduino application code with the help of native_posix target. This can help Arduino Code projects to test and validate their codes faster and in a simpler fashion. No clear way to do this exists today and this too is a topic that could garner some attention.
        There is also room for improving the Arduino Core API support to include the SPI, CAN and USB implementations. There's also an opportunity to leverage the excellent BLE stack in zephyr in an Arduino friendly way using something like the ArduinoBLE compatible calls. The talk will cover a few approaches to tackling these challenges and hope to get better suggestions or reviews from the community.

        Speaker: Dhruva Gole
    • Rust MC "James River Salon B" (Omni Richmond Hotel)

      "James River Salon B"

      Omni Richmond Hotel

      83

      LPC 2023 will host the second edition of the Rust MC. This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics. Proposals can be submitted via LPC submission systems, selecting the Rust MC track.

      Rust is a systems programming language that is making great strides in becoming the next big one in the domain. Rust for Linux is the project adding support for the Rust language to the Linux kernel.

      Rust has a key property that makes it very interesting as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc. It also provides other important benefits, such as improved error handling, stricter typing, sum types, pattern matching, privacy, closures, generics, etc.

      Possible Rust for Linux topics:

      • Rust in the kernel (e.g. status update, next steps...).
      • Use cases for Rust around the kernel (e.g. subsystems, drivers, other modules...).
      • Discussions on how to abstract existing subsystems safely, on API design, on coding guidelines...
      • Integration with kernel systems and other infrastructure (e.g. build system, documentation, testing and CIs, maintenance, unstable features, architecture support, stable/LTS releases, Rust versioning, third-party crates...).
      • Updates on its subprojects (e.g. klint, pinned-init...).

      Possible Rust topics:

      • Language and standard library (e.g. upcoming features, stabilization of the remaining features the kernel needs, memory model...).
      • Compilers and codegen (e.g. rustc improvements, LLVM and Rust, rustc_codegen_gcc, Rust GCC...).
      • Other tooling and new ideas (bindgen, Cargo, Miri, Clippy, Compiler Explorer, Coccinelle for Rust...).
      • Educational material.
      • Any other Rust topic within the Linux ecosystem.

      Last year was the first edition of the Rust MC and the focus was on showing the ongoing efforts by different parties (compilers, Rust for Linux, CI, eBPF...). Shortly after the Rust MC, Rust got merged into the Linux kernel. Abstractions are getting upstreamed, with the first major drivers looking to be merged soon: Android Binder, the Asahi GPU driver and the NVMe driver (presented in that MC).

      • 222
        Klint: Compile-time Detection of Atomic Context Violations for Kernel Rust Code

        The unique demands of the Linux kernel often blur the lines between safety and correctness: a prime example is the potentially hazardous act of sleeping inside an atomic context. While at first glance it may seem to be merely a correctness concern, in scenarios involving an RCU read lock, it could escalate to a safety violation by leading to use-after-free issues. Addressing these concerns through safe APIs often involves runtime costs or suffer from ergonomic issues, making them less favourable for kernel work. Klint is a specialized tool that is designed to catch such violations at compile time. It aims to use simple and easy-to-understand rules to generate useful and developer-friendly diagnostics.

        Speaker: Dr Gary Guo
      • 223
        pin-init: Solving Address Stability in Rust

        Address stability is required for a lot of kernel structures. For example linked lists require the elements to have a stable address for as long as the elements are part of the list. Not complying with this requirement can result in memory safety issues.
        Rust aims to prevent all such issues, therefore it prevents programmers from moving certain memory. When combining stable address requirements with initialization, Rust currently does not natively provide a way to initialize values with stable addresses using no unsafe code. Since one of the goal of Rust-for-Linux is to prevent memory issues in the kernel, the amount of unsafe code should be minimized.
        As types with stable addresses are plenty in the kernel, I have create an API that allows users to solely use safe code to initialize values in-place and have teh guarantee that the address will not change later.
        This talk covers the underlying issue, the current solution used in the kernel and problems that still have to be solved in the future.
        If you are interested in the specifics of the API, please see the Rust-for-Linux website.

        Speaker: Benno Lossin
      • 224
        Coccinelle for Rust

        Coccinelle is a tool for making widespread searches and changes in source code. Coccinelle was originally developed for C code and has been extensively used in the Linux kernel. We are working on porting Coccinelle to Rust, based on the infrastructure provided by Rust Analyzer. This talk will present the current state of the tool, with the goal of getting feedback from developers interested in Rust for Linux.

        Speaker: Julia Lawall (Inria)
      • 225
        Using Rust in the binder driver

        In this joint talk by the maintainers of C binder and Rust binder, we will give motivation for why the binder driver is a good place to use Rust, and we will discuss the experience of using Rust in the Linux kernel.

        Please see the RFC for more information: https://lore.kernel.org/rust-for-linux/20231101-rust-binder-v1-0-08ba9197f637@google.com/

        Speakers: Alice Ryhl (Google), Carlos Llamas
      • 16:10
        Break
      • 226
        Block Layer Rust API

        Rust for Linux has brought in Rust as a second programming language in the Linux
        Kernel. The Rust for Linux project is making good progress towards building a
        general framework for writing Linux kernel device drivers in safe Rust. The
        block layer is still missing necessary plumbing to achieve this goal.

        In this talk report on the work that we have done in the block layer to enable
        Rust drivers. We showcase two drivers using the work: the Rust null_blk driver
        and the Rust NVMe driver. We present results with performance analysis and we
        discuss challenges we encountered during performance optimizations.

        We also talk about the challenges we encounter while attempting to upstream the
        Rust block layer API, and potential paths forward for the project.

        Speaker: Mr Andreas Hindborg (Samsung)
      • 227
        Rust in V4L2: a status report

        Discuss the status of the Rust support in the Media subsystem in light of the patchset with initial support sent by the presenter together with the current and planned features as well as the points raised by the V4L2 maintainers during face-to-face discussions at the Media Summit 2023.

        The objective is to discuss and identify ways to introduce Rust support to a large subsystem given some of the roadblocks presented at the Media Summit while discussing a sustainable maintainership model for the Rust abstractions. Another key objective is to discuss the topic of C APIs for Rust in the kernel as a way to offer access to Rust code to existing drivers in V4L2.

        Speaker: Daniel Almeida (Collabora)
      • 228
        Converting a DRM driver to Rust

        Programming in C and shifting to Rust can be a hard challenge for an old-school C programmer. Rust for Linux introduced a new programming paradigm to the Linux Kernel and this means that C programmers like me need to shift our mindset. I'll share my view on the matter after rewriting the VGEM DRM driver in Rust during my Igalia Coding Experience in the summer: the view of a C programmer and a beginner in Rust.

        In this talk, I will discuss the challenges for a C programmer to write a Rust kernel driver, addressing the use of the DRM bindings, developed by Asahi Lina, the performance of the Rust driver, the benefits of the Rust features for driver development, and the roadmap to convert a C DRM driver to a Rust DRM driver.

        Speaker: Mrs Maíra Canal (Igalia S.L.)