Linux Plumbers Conference 2021

Name: Linux Plumbers Conference 2021
Start: 2021-09-20T06:40:00-07:00
End: 2021-09-24T12:00:00-07:00
Location: No location set

20 Sept 2021, 06:40 → 24 Sept 2021, 12:00 US/Pacific

Description

20-24 September,Virtually

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond. LPC 2021 will be held virtually (like in 2020). We are looking forward to seeing you online!

2021

contact@linuxplumbersconf.org

Monday 20 September
- BPF & Networking Summit Networking and BPF Summit/Virtual-Room (LPC Virtual)
  
  Networking and BPF Summit/Virtual-Room
  
  LPC Virtual
  
  150
  
  The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
  
  This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
  - 1
    
    BPF & Networking Opening Session
    
    Short intro/welcome session to the Networking and BPF track.
    
    Speakers: Daniel Borkmann (Isovalent), David Miller (Red Hat), Alexei Starovoitov (Facebook), Andrii Nakryiko (Facebook), Jakub Kicinski (Facebook), Eric Dumazet (Google)
    
    Slides (PDF)
    
    Video (Youtube)
  - 2
    
    BPF tracing: exploring additional debugging capabilities
    
    Sometimes using tracing - instead of traditional kernel debuggers - to investigate kernel issues can be necessary. Some problems such as races are inherently time-sensitive, so minimally invasive tracing is ideal in such cases. However, it is also true that debuggers have capabilities that BPF-based tracing does not (or has recently acquired) - printing data structures, tracking local variable values, inline function visibility, etc. In fact looking at gdb capabilities is a great way to explore some possibilities for future tracing functionality. Here we explore some of the possibilities, the potential cost-benefit, and seek feedback and discussion with the community on the potential value of the approaches explored.
    
    Speaker: Alan Maguire (Oracle)
    
    Slides (PDF)
    
    Video (Youtube)
  - 3
    Use of eBPF in cpu scheduler
    
    eBPF has been used extensively in performance profiling and monitoring. In this talk, I will describe a set of eBPF applications that help monitor and enhance cpu scheduling performances. These applications include:
    
    Profiling scheduling latencies. I will talk about an application of eBPF to collect scheduling latency stats.
    
    Profiling resource efficiency. For background, I will first introduce the scheduler feature core scheduling which is developed for mitigating L1TF cpu vulnerability. Then I will introduce the eBPF feature ksym which enables this application and describe how eBPF can help report the forced idle time, a type of cpu usage inefficiency caused by core scheduling.
    
    The third application of eBPF is to assist userspace scheduling. ghOSt is a framework open sourced by Google to enable general-purpose delegation of scheduling policy to userspace processes in a Linux environment. ghOSt uses BPF acceleration for policy actions that need to happen closer to scheduling edges. We use this to maximize CPU utilization (pick_next_task), minimize jitter (task_tick elision) and control tail latency (select_task_rq on wakeup). We are also experimenting with BPF to implement a scaled-down variant of the scheduling policy while upgrading the main userspace ghOSt agent.
    
    Speakers: Hao Luo (Google), Barret Rhoden (Google)
    
    Slides (PDF)
    
    Video (Youtube)
  - 4
    
    Socket migration for SO_REUSEPORT
    
    This talk presents our recent work available in the v5.14 kernel, which improves the SO_REUSEPORT functionality.
    
    The SO_REUSEPORT option was introduced in v3.9. In the former version, only one socket is allowed to listen() on any given TCP port. The traditional technique for a high-performance server is to have a single process that accept()s and distributes connections to other processes or to have multiple processes that accept() connections from the same single socket. However, the accept() syscalls to a single listen()ing socket can be a bottleneck. The SO_REUSEPORT option allows multiple sockets to listen() on the same port and addresses the bottleneck.
    
    If the option is enabled, the kernel distributes connections evenly to each listen()ing socket when SYN packets arrive. Once the kernel has committed a connection to a listen()ing socket, it does not change later. Thus, when a listen()ing socket is close()d, the not yet accept()ed connections are aborted even if other sockets still listen() on the same port.
    
    This talk shows how the SO_REUSEPORT mechanism works with SYN processing, when it causes connection failures, how we can work around it with BPF, and how we address it with the new socket migration feature and the extension of BPF.
    
    Speaker: Kuniyuki Iwashima (Amazon)
    
    Slides (PDF)
    
    Video (Youtube)
  - 5
    
    bpf: mass attachment of tracing probes
    
    There's currently kernel-side bottleneck when attaching probe
    to multiple functions, which makes several tools like bpftrace
    or retsnoop suffer in such use cases - it takes forever to
    attach single probe for multiple functions ;-)
    
    After multiple discussions and many failed attempts it looks
    like we are on track to have working solution, which consists
    of multiple gradual changes in ftrace and bpf code.
    
    In this presentation I'll show and explain why the current code is
    slow and introduce the proposed solution to the problem and the
    status of the implementation.
    
    Speaker: Jiri Olsa (Red Hat)
    
    Slides (PDF)
    
    Video (Youtube)
  - 6
    
    BPF Map Tracing: Hot Updates of Stateful Programs
    
    In this talk I will outline a facility for tracing BPF map updates which might be used to perform zero downtime upgrades of stateful programs.
    
    Map updates cannot currently be natively traced within BPF. I propose a set of kernel changes where tracing programs can be attached to individual maps. These programs run in response to particular operations: one might run on update, and another on deletion.
    
    This facility seems like it should be broadly useful, but it was designed with a specific use case in mind. We would like to be able to migrate state between two versions of a set of programs, and swap between the two versions with zero downtime. By tracing updates on the original set of maps, I believe that we can achieve this goal.
    
    Please see the attached paper for a deeper discussion of the problem and proposed solution.
    
    Speaker: Joe Burton (Google)
    
    Paper (PDF)
    
    Slides (PDF)
    
    Video (Youtube)
- Containers and Checkpoint/Restore MC Microconference2/Virtual-Room (LPC Virtual)
  
  Microconference2/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Containers and Checkpoint/Restore Microconference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.
  
  Contributions to the micro-conference are expected to be problem statements, new use-cases, and feature proposals both in kernel- and userspace.
  - 7
    
    Opening Session
    
    Speaker: Stéphane Graber (Canonical Ltd.)
  - 8
    Simplified user namespace allocation
    
    The user namespace currently relies on mapping UIDs and GIDs from the initial namespace (full uint32 range) into the newly created user namespace. This is done through the use of uid_map/gid_map with the kernel allowing mapping your own uid/gid and otherwise requiring a privileged process write a more complete map.
    
    As more and more software (not just container managers) are making use of user namespaces, a few patterns and problems have started to emerge:
    
    For maximum security, containers should get their own, non-overlapping uid/gid range. Non-overlapping here means no shared uid/gid with any other container on the system as well as no shared uid/gid with any user/processes running at the host level. Doing this prevents issues with configuration tied to a particular uid/gid which may cross container boundaries (user limits are/were known to do that) as well as prevent issues should the container somehow get access to another's filesystem.
    
    Network authentication as well as dynamic user creation is causing more "high" uid/gid being used than before. So while it was perfectly fine to give a contiguous range of 65536 uid/gid to a container before, nowadays, this often needs to be extended to 1-2M to support network authentication and may need a bunch of mapped uid/gid right at the end of the 32bit range to handle some temporary dynamic users too.
    
    Coordination on uid/gid range ownership in theory was done through the use of shadow's /etc/subuid, /etc/subgid and the newuidmap/newgidmap helpers. In practice those have only ever really been used when containers are created/started by non-root users and are generally ignored by any tool which operates as root. This effectively means there's no coordination going on today and its' very easy to accidentally assign a uid/gid to a user namespace which may in fact be used by the host system or by another user namespace.
    
    The bulk of those constraints come from the fact that the user namespace has to deal with filesystem access and permissions. Having a uid in the container translated to a real uid and have that be what's used to access the VFS. If a uid cannot be translated, it's invalid and can't be used.
    
    But now that we have a VFS layer feature to support ID shifting, we may be able to decouple the two and allow for user namespaces that have access to the full uid/gid range (uint32) yet are still technically using completely distinct kuid/kgid from everything else. This would then rely on the use of VFS based shifting for any case where data must be accessed from outside of that namespace.
    
    This kind of approach would make it trivial to allocate new user namespaces, would drop the need for coordination and avoid conflicts with host uid/gid. Anyone could safely get a user namespace with access to all uid/gid but restricted to virtual filesystems. Then with help from a privileged helper could get specific mounts mapped into their namespace allowing for VFS operations with the outside world.
    
    There are some interesting corner cases though, like what do we do when transferring user credentials across namespace boundaries. As in, how do we render things on the parent user namespace, whether it's the ownership of a process or the data in a ucred.
    
    Speakers: Stéphane Graber (Canonical Ltd.), Christian Brauner
  - 9
    Cgroup v1/v2 Abstraction Layer
    
    The Oracle database offers a long-term-stable version that is supported and
    maintained for many years. But as Linux distributions slowly transition
    from cgroup v1 to cgroup v2, this creates a challenge for the DB. cgroup v1
    and cgroup v2 have different interfaces and best practices.
    
    This talk is to discuss the current status of the cgroup abstraction layer, how
    applications like the Oracle database plan to use it, and gather/discuss other
    users' requirements for this layer.
    
    cgroup v2 support added to libcgroup - DONE [1]
    
    cgroup v1/v2 abstraction layer - OUT FOR REVIEW [2]
    
    This layer hides the underlying details of cgroup v1 vs v2 (e.g. cpu.weight
    vs cpu.shares). The user can make a request in v1 or v2 format, and the layer
    will do the proper translation. The user still needs to have intimate cgroup and
    system knowledge
    
    Higher-level abstraction layer - GATHERING REQUIREMENTS
    
    The goal of this layer is to abstract away the gory cgroup details and
    let the user specify their needs (e.g. 2 CPUs that are side channel attack
    safe). This layer should also handle interactions with systemd (dbus,
    cgroup delegation, etc.)
    
    [1] https://github.com/libcgroup/libcgroup/releases/tag/v2.0
    [2] https://sourceforge.net/p/libcg/mailman/message/37317688/
    
    Speaker: Tom Hromatka
    
    Cgroupv1v2Abstraction-LPC2021.pdf
  - 07:55
    
    Break
  - 10
    Mount-v2 CRIU migration engine: status update
    
    Mount checkpoint/restore is an important part of CRIU, it is responsible for
    consistency of the file system view of dumped processes. In current state we
    can only restore simple mount configurations, something more complex would
    either make us fail or, which is even worse, make us creating wrong file
    system view for restored processes.
    
    In CRIU we only see the final state, the result of probably multiple kernel API
    calls inside a container, and on restore we need to recreate the sequence of
    calls which would lead to the exact same state, in general this task can be
    very complex. So sometimes the only way is to simplify the API so that it
    become easier to restore all possible configurations.
    
    Last year [1] we discussed a variety of problems CRIU faces with mounts, most
    important ones related to mount propagation and how to simplify mount
    propagation configuration so that even complex setups can be re-created simply
    and correctly.
    
    This talk will start with showing more complex mount configurations to
    demonstrate that that we still need an API change. Then there will be a status
    update on kernel patch progress and changes that were done during the last year
    followed by the discussion on how to make the patch [2] mergeable to the
    upstream kernel.
    
    Here are links:
    
    [1] https://www.linuxplumbersconf.org/event/7/contributions/640/
    
    [2] https://lore.kernel.org/linux-api/20210715100714.120228-1-ptikhomirov@virtuozzo.com/
    
    mount-v2 draft for criu - https://github.com/Snorch/criu/commits/mount-v2-poc
    
    mount-v2 for VZ criu - https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c
    
    Thanks to Andrei Vagin and Christian Brauner for a great help with it!
    
    Speaker: Pavel Tikhomirov (Virtuozzo)
    
    LPC2021 Mount-v2 CRIU migration engine status update.pdf
    
    mounts.gif
  - 11
    Secrets in cloned snapshots
    
    Starting things is slow. Even if only 1 second slow, saving 1s on a million container restores means we can save 11 days of useless work that every container will perform identically.
    
    That's where snapshots come in. Snapshots in theory allow us to save an initialized container once, but then restore it a million times at less overhead than cold starting it takes.
    
    Unfortunately, Linux applications (and the kernel in VM based container setups) expect that during their lifetime they don't get cloned from the outside. Applications create user space PRNGs (Pseudo Random Number Generators) which would generate identical random numbers after a clone. They create UUIDs that would no longer be unique. They generate unique temporary key material that is no longer unique.
    
    Eventually, if we want to enable cloning properly, user space applications will need to learn that they have to adapt to clone events. For that they need notifications.
    
    This session will discuss the requirements such a notification mechanism has as well as possible paths forward to implement it and drive adoption.
    
    References:
    
    https://github.com/systemd/systemd/issues/19269
    
    https://lkml.org/lkml/2021/3/8/677
    
    Speakers: Alexander Graf, Mr Adrian Catangiu
  - 09:00
    
    Break
  - 12
    
    Fast Checkpoint Restore for GPUs
    
    We recently announced our work to support Checkpoint and Restore with AMD GPUs. This was first time a device plugin is introduced and that deals with one of the most complex devices on the system i.e. GPU. We made several changes to CRIU, introduced new plugin hooks and reported some issues with CRIU.
    
    https://github.com/RadeonOpenCompute/criu/tree/amd-criu-dev-staging/plugins/amdgpu#readme
    
    While there were several new challenges that we faced to enable this work, we were finally able to support real tensorflow/pytorch work loads across multi-gpu nodes using criu and were also able to migrate the containers running gpu bound worklaods. We have another proposed talk where we'll talk about the bigger picture but in this talk, we'd like to specifically talk about our journey where we started with a small 64KB buffer object in GPU VRAM to Gigabytes of single VRAM buffer objects across GPUs. We started with /PROC/PID/MEM interface initially and then switched to a faster direct approach that only worked with large PCIE BAR GPUs but that was still slow. For instance, to copy 16GB of VRAM, it used to take ~15 mins with the direct approach on large bars and more than 45 mins with small bars. We then switched to using system DMA engines built into most AMD GPus and this resulted in very significant improvements. We can checkpoint the same amount of data within 10 seconds now. For this we initially modified libdrm but the maintainers didn't agree to change an private API to expose GEM handles to the userspace so we finally ended up make a kernel change and exporting the buffer objects in VRAM as DMABUF objects and then import in our plugin using libdrm.
    
    We would also like to to talk about how we further optimized it using multithreading and also our experience with compression using criu-image-streamer to save time and space further. We also encountered limitation of google protobuf in handling large vram buffer objects.
    
    Overall this was a very significant feature addition that made our work usable from a POC to handle real world huge machine learning and training workloads.
    
    Thank you
    Rajneesh
    
    Speakers: Mr Rajneesh Bhardwaj (AMD), Mr Felix Kuehling (AMD), Mr David Yat Sin (Mr)
    
    LPC - Fast Checkpoint Restore for GPUs.pdf
  - 09:50
    
    Break
  - 13
    
    Alternative ways to extract information about processes
    
    CRIU uses many different interfaces to get information about kernel resources,
    to extract sockets data sock_diag subsystem is used, for mounts/mount namespaces,
    procfs per-pid mountinfo files are used, to get some file type-specific info we
    use procfs fdinfo interface (which allows to get mnt_id from which file was opened,
    file flags and so on).
    
    One of the most important and time-consuming stages in CRIU dump is getting
    process memory mappings information. Let's discuss that problem and
    approaches to optimize the performance of this stage. There was a prototype
    implementation of netlink-based interface to get information about a task
    [1]. We suggest to use eBPF iterators framework [2] to create
    CRIU-optimized interface to get task VMAs data.
    
    Another interesting thing is mounts information acqusition. For simple cases
    mountinfo file seems sufficient. Previous year we introduced support of
    checkpoint-restoring nested containers. Main goal was to have ability to
    C/R OpenVZ containers with Docker containers inside. And here we met
    problem with overlayfs mounts. CRIU needs to get real overlayfs paths from
    the kernel (mnt_id+full path for each source directory) and these paths
    may be very long (like PAGE_SIZE). And this is the problem because of
    serious limitations which implied by mountinfo interface (limited size of lines,
    bad extendability). Some overlayfs-specific patches were proposed [3] earlier,
    but it's worth to have some universal approach to query mounts information for
    all file systems. There was a great subsystem called fsinfo [4] proposed by
    David Howells. But for some reasons it wasn't merged. There is idea to
    get some progress by creating some eBPF helpers which allows to get mounts
    information.
    
    Thanks a lot to Andrei Vagin for advices and help.
    
    Links:
    [1] https://github.com/avagin/linux-task-diag/commits/v5.8-task-diag
    [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/task_iter.c?h=v5.13#n472
    [3] overlayfs: C/R enhancments https://lkml.org/lkml/2020/10/4/208
    [4] fsinfo https://lwn.net/Articles/827934/
    
    Speakers: Alexander Mikhalitsyn (Virtuozzo), Andrei Vagin
  - 14
    
    Clossing Session
    
    Speaker: Christian Brauner
- GNU Tools Track GNU Tools track/Virtual-Room (LPC Virtual)
  
  GNU Tools track/Virtual-Room
  
  LPC Virtual
  
  150
  
  gnu-tools-notes-20210922.txt
  
  gnu-tools-notes-20210923.txt
  
  gnu-tools-track.dl.txt
  
  gnu-tools-track.txt
  - 15
    GCC's -fanalyzer option: what's new in GCC 12
    
    I'll be talking about the -fanalyzer static analysis option I added to
    GCC:
    
    overview of the analyzer and its internal implementation
    
    what I've changed so far for GCC 12
    
    my plans for further development of the analyzer
    
    ("Prepared project report": 25 minutes, including questions)
    
    Speaker: David Malcolm (Red Hat)
    
    2021-LPC-analyzer-talk.odp
    
    2021-LPC-analyzer-talk.pdf
  - 16
    
    Link-Time Points-To Analysis Implementation in GCC with Souffle
    
    Points-to analysis is a static code analysis that calculates the pointer-pointee relationship between expressions and static memory locations. The results of the points-to analysis may be used by multiple optimizations and analyses. Of particular interest a precise points-to analysis is necessary to perform data-layout optimizations at the level of alias sets. We use the high level, declarative, logic language Souffle to encode the semantics of a points-to analysis in few lines of code. The Souffle compiler allows us to synthesize a parallel C++ representation of the points-to analysis from the Souffle representation. In this talk we will go over the implementation of an intra-procedural, inclusion-based, field-insensitive, flow-insensitive, context-insensitive, points-to analysis which works with the existing link-time optimization (LTO) framework in GCC. While the current prototype is less precise than the existing points-to analysis our plan is to increase the level of precision of this implementation and use its results to enable future link-time data-layout optimizations.
    
    Speakers: Erick Ochoa (SBA Research), Christoph Müllner (SBA Research), Philipp Tomsich (VRULL)
  - 17
    Using the GCC regression test suite for LLVM
    
    LLVM has two main test suites:
    
    the regression test suite tests the compilation from source to IR; and
    
    the nightly test suite is a body of often large applications which are compiled and executed.
    
    However, there is no large body of tests of detailed functionality which is compiled right down to the target object code and executed. At previous conferences, we have described the changes we have made to allow the GCC test suite to be used for nightly public regression testing of LLVM for RISC-V. Here we will discuss the necessary transformations to the testsuite to support LLVM.
    
    Speaker: Mary Bennett (Embecosm)
  - 18
    
    Analyzing historical DejaGNU test result data with the Bunsen toolkit
    
    Bunsen is a toolkit for compact storage and analysis of DejaGNU test results. The toolkit includes a storage engine that compresses and indexes a large collection of test result logs in a Git repository, a Python library for querying and analyzing the test result collection, and a simple CGI service for accessing query results through a web browser.
    
    In this talk I will give an in-depth look at how Bunsen can be used to understand the current state of a project's testsuite. Based on my experience using Bunsen to collect and monitor test results from the SystemTap project, I will show how keeping a long-term repository of test results enables more sophisticated and useful analysis. In particular, I will show how Bunsen analysis scripts can help to locate significant regressions and filter out insignificant ones, identify nondeterministic ('flaky') testcases, and narrow down the commits that introduced a particular regression.
    
    Type: prepared presentation (~25min)
    
    Speaker: Serguei Makarov (Red Hat Inc.)
  - 09:00
    
    Coffee Break
  - 19
    
    DWARF extensions for optimized SIMT/SIMD (GPU) debugging
    
    Abstract:
    AMD has been working on adding support for GPU compute debugging to GDB. Early on, it became apparent that current DWARF would not be sufficient to support optimized SIMT/SIMD code, so we came up with extensions and generalizations that we intend to propose to DWARF 6. Although designed with GPUs in mind, the extensions are generic and can just as well be used to improve quality of debug information for CPUs and for any architecture. We've implemented the extensions in GDB, and are in the process of implementing them in LLVM. One interesting area that required extensions is DWARF expressions, support for which we're currently upstreaming to GDB. In this presentation we will give an overview of what were the problems we saw, and what we've done to address them. More information on the subject and on our proposed DWARF extensions can be found here: https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
    
    Speakers: Mr Tony Tye (Advanced Micro Devices), Zoran Zaric (AMD)
    
    DWARF_Extensions_for_Optimized_SIMT-SIMD_GPU_Debugging-LPC2021.pdf
  - 20
    
    Support for the CTF and BTF debug formats in the GNU Toolchain
    
    CTF (Compact C Type Format) is a debugging format whose main (but not only) purpose is to convey type information of C program constructs. BTF is a similar format used in the Linux kernel to support the portable execution of BPF programs. Both formats share a common ancestor and show some remarkable similarities. However they are not the same format, their application goals are different, are developed by different groups, and they use a different binary representation.
    
    We have added support for both formats to the GNU Toolchain. CTF is now fully supported in GCC, linker (with type deduplication), binary utilities (dumping the contents of .ctf sections in human readable format), a GNU poke description for editing encoded CTF, and GDB support. BTF is supported in GCC, mainly to be used by the BPF backend. There is however no support for BTF in Binutils at this point.
    
    In this talk we will show how these new debug formats have been implemented in GCC, highlighting how the implementation relies exclusively on the internal DWARF representation built by the compiler. This effectively makes DWARF the canonical internal representation for debugging info in GCC. This approach has worked well so far and looks very promising.
    
    We will also discuss the support for the CTF debug format in GDB, which includes support for both CTF sections and CTF archives (latter under review.)
    
    Speakers: Indu Bhagat, David Faust, Wei-min Pan
    
    lpc-2021-ctf-btf-gnu-toolchain.pdf
  - 21
    
    Update on BPF support in the GNU Toolchain
    
    BPF is a virtual machine that resides in the Linux kernel. Initially intended for user-level packet capture and filtering, BPF is nowadays generalized to serve as a general-purpose infrastructure also for non-networking purposes. BPF programs are often written manually, directly in assembly instructions. However, people often want to write their BPF programs in C. We recently added support for this virtual architecture to the GNU Toolchain, to complement the already existing support in LLVM.
    
    In this talk we will first be showing and discussing the latest developments related to this peculiar target. This includes the support for new instructions, atomics, architecture versioning, the BTF debugging format, and the support for the CO-RE (Compile-once, Run-Everywhere) mechanism in GCC, that allows to compile binary BPF programs that are portable between several kernel versions. Then we will show several on-going efforts, such as the addition of a verifier-aware testsuite to GCC (using a command-line access to the kernel verifier).
    
    Finally, we will be addressing a very interesting problem that is arising in this field of compiled BPF. In this model where C code is processed by an optimizing compiler to generate assembly, and then checked by a run-time always-changing verifier, driven by delicate heuristics, it can be very difficult to predict the behavior of the later just by looking at the C code, in non-trivial cases. This can be very frustrating, as our own in-house experience with DTrace2 demonstrates. We have identified some particular language constructs (like loops) and compiler optimization passes that are most prone to lead to this situation.
    
    Speakers: Jose E. Marchesi (GNU Project, Oracle Inc.), David Faust, Mr Guillermo Martinez (Oracle)
    
    bpf-in-gnu-toolchain-2021.pdf
- Kernel Summit Kernel Summit/Virtual-Room (LPC Virtual)
  
  Kernel Summit/Virtual-Room
  
  LPC Virtual
  
  400
  - 22
    Doing more with lore and b4
    
    While lore.kernel.org is a fairly new service, it has quickly become an indispensable workflow part for many maintainers. Tools like b4 allow to automate many aspects of maintainer duties:
    
    retrieving entire patch series
    
    tallying up and applying review trailers from thread follow-ups
    
    diffing updated series against previous versions
    
    sending templated thank-you notes for applied series and merged pull requests
    
    cryptographically verifying patches for attestation purposes (using DKIM and PGP)
    
    This talk will review some of the above functionality that may be already familiar to maintainers, but will also go through other features of public-inbox, and peek into what is coming in newer releases, such as:
    
    how to mirror lore.kernel.org locally (in part or in full)
    
    how to integrate public-inbox sources into your automated tools
    
    how to use anonymous imap service for reading mail without subscribing
    
    how to use saved searches to find and track interesting threads across multiple lists
    
    how to use local-email-interface (lei) command to get pre-filtered threads from multiple local and remote sources into your inbox
    
    Working with patches sent via email does not need to be frustrating or insecure, and we have tools to prove it.
    
    Speaker: Konstantin Ryabitsev (The Linux Foundation)
    
    Doing more with lore and b4.pdf
  - 23
    Linux kernel in Chrome OS - scaling to millions of users
    
    Jesse Barnes, Chrome OS baseOS (Firmware+Kernel) lead, and myself would like to present the current state of affairs of the Linux kernel on ChromeOS and the challenges we face, how we solve them and get your feedback.
    
    We can also talk about how our efforts can help upstream development, for example by running experiments in the field to compare approaches to a specific problem or area.
    
    Shipping ChromeOS to millions of users that span across hundreds of different platforms, multiple active kernel versions and across many different SoC architectures, introduces interesting challenges:
    
    Testing the upstream RC on as many platforms as we can as early as we can.
    
    Updating the Linux kernel on existing platforms (millions of users at a time).
    
    Managing technical “debt” and keeping the ChromeOS kernel as close as possible to the upstream
    kernel.
    
    Current pain points in dealing with upstream?
    
    We feel 45-60 minutes would be enough and will allow a discussion.
    
    Thanks a lot in advance,
    
    Alex Levin,
    ChromeOS platform tech lead.
    
    Speakers: Alex Levin, Mr Jesse Barnes
    
    Ksummit.odp
    
    Ksummit.pdf
  - 24
    
    Consolidating representations of the physical memory
    
    Linux kernel uses several coarse representations of the physical memory
    consisting of [start, end, flags] structures per memory region. There is
    memblock that some architectures keep after boot, there is iomem_resource
    tree and "System RAM" nodes in that tree, there are memory blocks exposed
    in sysfs and then there are per-architecture structures, sometimes even
    several per architecture.
    
    These abstractions are used by the memory hotplug infrastructure and
    kexec/kdump tools. On some architectures the memblock representation even
    complements the memory map and it is used in arch-specific implementation
    of pfn_valid().
    
    The multiplication of such structures and lack of consistency between some
    of them does not help the maintainability and can be a reason for subtle
    bugs here and there. Moreover, the gaps between architecture specific
    representations of the physical memory and the assumptions made by the
    generic memory management about the memory layout lead to unnecessary
    complexity in the initialization of the core memory management structures.
    
    The layout of the physical memory is defined by hardware and firmware and
    there is not much room for its interpretation. Regardless of the particular
    interface between the firmware and the kernel a single generic abstraction
    of the physical memory should suffice and a single [start, end, flags] type
    should be enough. There is no fundamental reason it is not possible to
    converge per-architecture representations of the physical memory, like
    e820, drmem_lmb, memblock or numa_meminfo into a generic abstraction.
    
    Memblock seems the best candidate for being the basis for such generic
    abstraction. It is already supported on all architectures and it is used as
    the generic representation of the physical memory at boot time. Closing the
    gaps between per architecture structures and memblock is anyway required
    for more robust initialization of the memory management. Addition of simple
    locking of memblock data for memory hotplug, making the memblock
    "allocator" part discardable and a mechanism to synchronize "System RAM"
    resources with memblock would complete the picture.
    
    Extending memblock with necessary funcitonality and gradually bridging the
    gap between the current per-architecure physical memory representation and
    the generic one will imporve robustness and maintainablity of the early
    memory management.
    
    Speaker: Mike Rapoport (IBM)
    
    LPC21 Consolidating representations of the physical memory.pdf
  - 25
    Rust for Linux
    
    The Rust for Linux project is adding support for the Rust language to the Linux kernel. This talk describes the work done so far and also serves as an introduction for other kernel developers interested in using Rust in the kernel.
    
    It covers:
    
    A quick introduction of the Rust language within the context of the kernel.
    
    How Rust support works in the kernel: overall infrastructure, compilation model, the standard library (core and alloc), etc.
    
    How Documentation for Rust code looks like.
    
    Testing Rust code (unit tests and self tests).
    
    Overview of tooling (e.g. compiler as a library, custom linters, Miri, etc.).
    
    Explanation of coding guidelines (e.g. automatic formatting) and policies we follow (e.g. the SAFETY comments).
    
    How kernel driver code looks like in Rust.
    
    Speaker: Miguel Ojeda
    
    2021-09-20 - Linux Plumbers Conference - Rust for Linux.pdf
- LPC Refereed Track Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  - 26
    Rust in the Linux ecosystem
    
    Rust is a systems programming language that is getting stronger support by many companies and projects over time, thanks to its memory-safety innovations (e.g. the safe/unsafe split, the borrow checker, etc.).
    
    This talk covers:
    
    A quick introduction to the Rust language.
    
    What exactly means "safety" in the context of Rust, and which kind of issues Rust prevents (e.g. data races, use-after-free, etc.) and which not (e.g. race conditions, memory leaks, etc.)..
    
    What is undefined behavior, how modern optimizers exploit it and how Rust reduces the risk via the safe/unsafe split.
    
    How many CVEs are due to memory safety and what companies say about their cost.
    
    Where Rust is being used today, e.g. Rust for Linux, Rustls, inside companies, etc.
    
    Who supports Rust, e.g. the Rust Foundation, the Prossimo project, major companies, etc.
    
    Speaker: Miguel Ojeda
    
    2021-09-20 - Linux Plumbers Conference - Rust in the Linux ecosystem.pdf
  - 27
    
    rustc_codegen_gcc: A gcc codegen for the Rust compiler
    
    The Rust programming language is becoming more and more popular: it's even considered as another language allowed in the Linux kernel.
    That brought up the question of architecture support as the official Rust compiler is based on LLVM.
    This project, rustc_codegen_gcc, is meant to plug the GCC backend to the Rust compiler frontend as a relatively low-effort: it's a shared library reusing the same API provided by the Rust compiler as the cranelift backend.
    As such, it could be used by some Linux projects as a way to provide their Rust softwares to more architectures.
    This talk will present this project and its progress.
    
    Speaker: Antoni Boucher
    
    rustc_codegen_gcc GitHub
    
    talk.pdf
  - 28
    
    GCC Front-End for Rust
    
    GCC Rust is a front-end project for the GNU toolchain, a work-in-progress alternative to the official Rustc compiler. Being part of GCC, the compiler benefits from the common compiler flags, available backend targets and provides insight into its distinct optimiser's impact on a modern language.
    
    This project dates back to 2014 where Rust was still ~0.8, but the language was subject to frequent change making an effort too challenging to keep up. More recently, the core language is stable, and in early 2019 the development restarted. Since then, the project has laid out the core milestone targets to create the initial MVP with freely available status reports and is part of Google Summer of Code 2021 under the GCC organisation.
    
    The GNU toolchain has been the foundation of the Linux ecosystem for years, but the official Rustc compiler takes advantage of LLVM for code generation; this leaves a gap in language availability between the toolchains. GCC Rust will eliminate this disparity and provide a compilation experience familiar to those who already use GCC.
    
    As of 2021, GCCRS gained sponsorship from Open Source Security, inc and Embecosm to drive the effort forward. With this support, the project has gained mentorship from the GCC and Rust community.
    
    In this talk, I will introduce the compiler, demonstrate its current state and discuss the goals and motivations for the project.
    
    Speaker: Philip Herron (Embecosm)
    
    GCC Rust Github
    
    GCC Rust Homepage
    
    GCC Rust Zulip
    
    Rust-GCC-LPC-Sept-2021.pdf
  - 29
    
    Formalizing Kernel Synchronization Primitives with PREEMPT_RT
    
    In certain corners of the Linux Kernel, manual locking and lockless-synchronization primitives are developed instead of using the existing (and default) kernel locking APIs. This is obviously frowned upon, but still exists for historical reasons or because developers think that the subsystem in question is special enough to warrant manual synchronization primitives.
    
    Adopting the PREEMPT_RT patch for mainline quickly exposes such cases, as these manual synchronization mechanisms are usually written with invalid assumptions regarding PREEMPT_RT that either negatively affect preemptibility or directly result in livelocks.
    
    In the non-mainline parts of the PREEMPT_RT patch, such offending call sites were directly dealt with using "#ifdef" shortcuts. This can be an acceptable solution for an external patch, but for mainline the bar is much higher: the kernel is first surveyed for similar patterns in other subsystems and then, in cooperation with the locking subsystem maintainers, new official kernel locking mechanisms are created for such call sites.
    
    This is better for the kernel ecosystem in general, and better for the call sites themselves. Using official kernel locking mechanisms comes with the benefits of reliability, thorough reviews, and lockdep validation.
    
    In this session, four cases will be presented — from actual experiences while pushing some of the remaining parts of the PREEMPT_RT patch mainline: new sequence counter APIs, modified APIs for disabling tasklets, local locks, and generic patterns to avoid using the low-level in_irq/softirq/interrupt() macros in non-core kernel code.
    
    The session will end with a discussion of some of the remaining manual locking mechanisms yet to be converted for proper mainline PREEMPT_RT inclusion.
    
    Speaker: Ahmed S. Darwish
    
    lpc_2021_preempt_rt_formalizing_synchronization_primitives.pdf
    
    seqcount_call-sites_survey.ods
    
    tasklet_unlock_wait_call-sites_survey.ods
- Open Printing MC Microconference3/Virtual-Room (LPC Virtual)
  
  Microconference3/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Open Printing microconference focuses on improving and modernizing the way we print in Linux.
  - 30
    
    Introduction
    
    Over the years OpenPrinting has been actively working on improving and modernizing the way we print in Linux. We have been working on multiple areas of printing and scanning. Especially driverless print and scan technologies have helped the world do away with a lot of hassles involved in deciding on the correct driver to use and to install the same. Users can now just plug in their printer and do what they need.
    
    So what next in Open Source Printing?
    
    Speaker: Aveek Basu
    
    OP_Presentation_Introduction_LinuxPlumbers2021.pdf
  - 31
    
    CUPS 2.4/2.5
    
    Go through the changes in CUPS 2.4.x, including printer sharing changes for mobile, OAuth support as a replacement for Kerberos, Printer Applications as a replacement for printer drivers, TLS/X.509 changes, and CUPS in containers (snap, Docker, others?) Discuss specific needs and timeframes WRT Kerberos->OAuth, drivers->Printer Applications, X.509 management, and deploying CUPS in containers going forward
    (Continuation of discussions at the OpenPrinting Summit, active development in the OpenPrinting CUPS Github repository)
    
    Speaker: Michael Sweet (Lakeside Robotics Corporation)
    
    lpc-cups-2021.pdf
  - 32
    
    CUPS 3.0
    
    Discuss proposed CUPS 3.0 design work from prior presentations. Discuss future CUPS development: identify supported platforms, key printing system components, areas of responsibility, schedules, goals, organizational issues, and milestones. Discuss integration with Printer Applications and application stores like the Snap Store.
    (Continuation of discussions at the OpenPrinting Summit)
    
    Speaker: Michael Sweet (Lakeside Robotics Corporation)
    
    lpc-cups-2021.pdf
  - 08:30
    
    Break
  - 33
    Print Management GUI
    
    In the new printing (and scanning) architecture available printers are not defined by CUPS queues any more but by IPP services, being network printers, Printer Applications, and IPP-over-USB for USB printers. CUPS queues are simply auto-created corresponding to these IPP services. So it does not make sense to have a printer setup tool which lists the available CUPS queues and allows creating them. Instead, we need a tool which lists IPP services and for each service gives access to configure it, via buttons opening the web interface and also GUI for IPP System Service.
    
    For legacy devices which do not provide IPP services by themselves, we need a tool to discover them and to find Printer Applications for them, both locally installed or installable, like in the Snap Store.
    
    Design thoughts and discussion up to now
    
    GS0C 2021 project: GUI for listing and managing available IPP services
    
    GSoC 2020 Project: Linux GUI application to admin MF devices using IPP System Service
    
    Snap Store Feature Request: Hardware-associated snaps - Snap Store search by hardware signature
    
    We will discuss the details and the integration of these tools in the desktop environments.
    
    Speaker: Till Kamppeter (OpenPrinting / Canonical)
    
    Print-Management-GUI.odp
    
    Print-Management-GUI.pdf
  - 34
    
    Common Print Dialog Backends
    
    Already some years ago we introduced the concept of the Common Print Dialog Backends (OPenPrinting GitHub: CPDB, CUPS backend) where we separate the print dialog GUI from the support code for the actual print technology (like CUPS, IPP, …) via a D-Bus interface, so that GUI toolkits and the print technology support can be developed and released independently.
    
    This especially allows for better support of the fast-paced changes in the printing technology vs. the long development cycles of the GUI toolkits. Also new print technologies, like cloud print services can be added easily, with the appropriate backend provided in a Snap in the Snap Store.
    
    Now this concept gets important again as the printing architecture is under heavy development with all things IPP, CUPS 2.4.x, 3.x, …
    
    Here we especially discuss about the adoption into common GUI toolkits like GTK and Qt.
    
    Speaker: Till Kamppeter (OpenPrinting / Canonical)
    
    Common-Print-Dialog-Backends.odp
    
    Common-Print-Dialog-Backends.pdf
  - 09:50
    
    Break
  - 35
    
    Printer/Scanner Driver Design and Development
    
    Classic printer and scanner drivers are replaced by Printer/Scanner Applications which emulate IPP-based network devices. We also have implemented most of the supporting code to easily create such Printer/Scanner Applications (PAPPL), a library for retro-fitting classic PPD-based CUPS drivers (pappl-retrofit), and Printer Applications retro-fitting PostScript PPDs (Snap Store), Ghostscript drivers (Snap Store), and HPLIP's printing (Snap Store).
    
    In this session we want to help developers get started with the design, creation, and Snap-packaging of Printer/Scanner Applications, Especially we also want that printer/scanner developers create native Printer/Scanner Applications and not retro-fits of their classic CUPS/SANE drivers (Tutorial from Google Season of Docs 2021). Updates on the development progress in the monthly news posts.
    
    Speaker: Till Kamppeter (OpenPrinting / Canonical)
    
    Designing-and-Packaging-Printer-and-Scanner-Drivers.odp
    
    Designing-and-Packaging-Printer-and-Scanner-Drivers.pdf
  - 36
    
    Scanning in PAPPL
    
    Devices are discovered by DNS-SD. Adding support for pairing scanners with printers, since the typical use case (multi-function printer) will have the scan-specific TXT keys added to the printer, and the printer-dns-sd-name value coming from the printer. IPP scanners generally will not have their own DNS-SD records since they are paired with
    IPP printers. IPP scanner registrations don't use the same TXT keys as printers.Scan-specific keys are added as IPP scanner registration consists, for the pairing API to associate the scanner with the printer. The client polls the scanner's properties with a get-printer-attributes IPP request on the scanner URI. For this pappl_scanner_t object is
    implemented and scan-specific header files are added with the updated attributes and capabilities of a scanner- changing print to input, and equivalent driver functions are added for scan, as that in printing. The user sets options like scan area, resolution, quality, color, ADF mode, ... and requests the scan.
    
    Speaker: Bhavna Kosta
    
    Scanning in PAPPL.pdf
    
    Scanning in PAPPL.pptx
  - 37
    
    Closing Session
    
    Speaker: Aveek Basu
- Scheduler MC Microconference1/Virtual-Room (LPC Virtual)
  
  Microconference1/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Scheduler microconference focuses on deciding what process gets to run when and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics at the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.
  - 38
    Scheduler Microconference
    
    Title: Scheduler Microconference
    
    The scheduler is an important functionality of the Linux kernel, deciding what process gets to run when and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics at the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.
    
    For example, at last year's Scheduler MC, we discussed core scheduling which is now on its way to being merged [1]. The scheduling fairness patches were merged [2], NUMA topology limitations fixes were added to the kernel [3]. Not only some progress was made in the direction of accepting patches, but also some topics were proved to be not feasible, like “Flattening the CFS runqueue,” and this was facilitated by the conference format.
    
    This year, we think the following topics will lead to a productive microconference:
    
    Cgroup interface and other updates for core-scheduling [1]
    
    Cgroup and SCHED_DEADLINE [4]
    
    Capacity Awareness – For busy systems
    
    Interrupt Awareness
    
    Load Balancing
    
    Wakeup [5] [6] [7]
    
    Periodic [5] [6]
    
    NUMA load balancing
    
    Come and join us in the discussion of controlling what tasks get to run on your machine and when. We hope to see you there!
    
    Attendees list:
    
    Peter Zijlstra peterz@infradead.org
    
    Thomas Gleixner tglx@linutronix.de
    
    Steven Rostedt rostedt@goodmis.org
    
    Vincent Guittot vincent.guittot@linaro.org
    
    Ingo Molnar mingo@redhat.com
    
    Juri Lelli juri.lelli@redhat.com
    
    Daniel Bristot de Oliveira bristot@redhat.com
    
    Dietmar Eggemann dietmar.eggemann@arm.com
    
    Sebastian Andrzej Siewior bigeasy@linutronix.de
    
    Valentin Schneider valentin.schneider@arm.com
    
    Clark Williams williams@redhat.com
    
    Paul E. McKenney paulmck@kernel.org
    
    Ben Segall bsegall@google.com
    
    Mel Gorman mgorman@suse.de
    
    Joel Fernandes joel@joelfernandes.org
    
    Quentin Perret qperret@google.com
    
    Aubrey Li aubrey.li@intel.com
    
    Paul Turner pjt@google.com
    
    Rik van Riel riel@surriel.com
    
    Ricardo Neri ricardo.neri-calderon@linux.intel.com
    
    Catalin Marinas catalin.marinas@arm.com
    
    Qais Yousef qais.yousef@arm.com
    
    Patrick Bellasi patrick.bellasi@matbug.net
    
    Morten Rasmussen morten.rasmussen@arm.com
    
    Viresh Kumar viresh.kumar@linaro.org
    
    Phil Auld pauld@redhat.com
    
    Waiman Long longman@redhat.com
    
    Josef Bacik josef@toxicpanda.com
    
    Frederic Weisbecker fweisbec@gmail.com
    
    Links:
    [1] https://lore.kernel.org/lkml/20210422120459.447350175@infradead.org/T/
    [2] scheduling fairness commits:
    
    6e7499135db7 ("sched/fair: Reduce busy load balance interval")
    
    e4d32e4d5444 ("sched/fair: Minimize concurrent LBs between domain level")
    
    2208cdaa56c9 ("sched/fair: Reduce minimal imbalance threshold")
    
    5a7f55590467 ("sched/fair: Relax constraint on task's load during load balance")
    
    [3] numa topology commits:
    
    620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort")
    
    585b6d2723dc ("sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2")
    
    [4] https://lore.kernel.org/lkml/cover.1610463999.git.bristot@redhat.com/
    [5] https://lore.kernel.org/linux-arm-kernel/20210420001844.9116-5-song.bao.hua@hisilicon.com/T/
    [6] https://www.spinics.net/lists/kernel/msg3894298.html
    [7] https://www.spinics.net/lists/kernel/msg3914884.html
    
    Speakers: Dhaval Giani (Oracle), Daniel Bristot de Oliveira (Red Hat, Inc.), chris hyser, Juri Lelli (Red Hat), Vincent Guittot (Linaro)
  - 39
    
    Much Ado About… Migrations!
    
    The Linux scheduler shuffles tasks around on the various CPUs all the time, as mandated by the implementation of a combination of policies and heuristics. In general, this works well for many different workloads and lets us achieve a more than acceptable compromise among often conflicting goals, such as maximum throughput, minimum latency, reasonable energy consumption, etc. Furthermore, for the cases that really have special requirements, there are knobs to poke --such as different scheduling policies, priorities, affinity, up to CPU isolation-- for steering the scheduler toward any desired behavior.
    
    Nevertheless, we believe that there are cases where a less "migration prone" attitude from the scheduler could be beneficial, e.g., CPU-bound tasks (possibly HPC workloads) or the virtual CPUs of a virtual machine (at least in some circumstances). These cases would benefit from having tasks a bit more "sticky" to the CPUs where they are running, but for which pinning or a custom policy would be infeasible for the user to be configured. Or maybe it's fine to pin or change the priority, but then this means that we need to know what is causing the unwanted migrations, in order to roll out the best counter-measures. For instance, if I can figure out that my task is often migrated because it is preempted by others, I can think about rising its priority and/or rebalancing (or reconsidering) the load on my system.
    
    We therefore started our investigation around migrations. Basically, with a task migration event as our starting point, we wanted to see if it was possible to figure out what other events caused it to actually occur, and how far back in the chain of such events we could get. We are using a combination of existing (the various tracing facilities) and new (e.g., Sched struct retriever) tools and techniques. We wanted to start really simple and looked at what happens to a very basic main(){while(1);} task, and discovered that on a CPU with multiple cores and multiple threads, it migrates among different cores a bit more than what we expected. If we switch off hyperthreading, though, cross-cores migrations disappear too...
    
    So, even if we are still at the beginning, the tools we are using are still work-in-progress and the one above is just one example, we want to present the current status of this activity to the community, in case anyone else is also interested or has any feedback.
    
    Speakers: Francesco Ciraolo (University of Turin), Dario Faggioli (SUSE), Enrico Bini (University of Turin)
    
    much_ado_about_migrations_LPC21.pdf
  - 07:35
    
    Break
  - 40
    
    Overeager pulling from wake_wide() in interrupt heavy workloads
    
    On the CFS wakeup path, wake_wide() doesn't always behave itself very well in interrupt-heavy workloads. We have systems configured with static IRQ bindings because IRQs are served faster on certain CPUs due to hardware topology. We then noticed on these systems that wakeups kept pulling tasks to the socket serving network IRQs while leaving the other socket nearly idle on a read-only workload from YCSB, an open source database benchmark. On a lightly loaded system with two 32-core sockets, wake_wide() led the scheduler to wake affine most of the time. Wake affine is a two-pass process involving both wake_wide() and wake_affine(), but wake_wide() is the more dominant factor than wake_affine() in our workloads. Periodic and idle load balancing must work to undo wake affine's overeager pulling, but ultimately network interrupts are so frequent in YCSB that wake affine wins out.
    
    So far, we've gotten mixed results when trying to address the performance hit these issues cause. Disabling wake_affine() causes the benchmark to improve by 10-15% on a lightly loaded system (fewer DB connections) but slow down by up to 17% on more heavily loaded systems (more DB connections). wake_wide() works well when the waker and wakee are related, but we need a better heuristic for wakeups in interrupt heavy workloads, where the interrupt may or may not be related to the wakee.
    
    A better heuristic ideally should be able to determine which CPU's cache is warmer for the wakee and doesn't cause excessive pulling.
    
    Speaker: Libo Chen (Oracle)
    
    lpc21_wakeup_pulling_libochen.pdf
  - 41
    
    New challenges in using LLC sched domain on wakeup path
    
    AMD and ARM server architectures further complicate the issue with wake_wide() overeager pulling (see other abstract).
    
    An LLC domain can span a whole socket on an Intel server but are significantly smaller on AMD ZEN due to its CCXs. For example, on ZEN 2, each CCX has only 4 cores. When binding network IRQs to such a CCX, we can consistently reproduce a scenario in which over 50 iperf tasks pile up there.
    
    Some ARM servers may suffer from the opposite problem of not having LLC domains at all because they don't expose L3 cache, also called SLC, to the kernel or support hyperthreading. wake_wide() and select_idle_sibling() rely on the existence of LLC domains to make smart decisions about wake affine and balancing load within an LLC domain. If there is no LLC domain, wake affine never happens and select_idle_sibling() won't try to look for an idle CPU within an LLC domain. In other words, a task will be woken up on its previous CPU even if it shares cache with the waker or the previous CPU is busy. Is this what we want? I don't think so, it's not optimal in some cases even if it helps in others. For instance, in our read-only YCSB workloads with static IRQ binding, always waking up on the previous CPU performs better on lightly loaded systems but worse on heavily loaded systems. So I think we should consider how to improve the use of LLC sched domains in the wakeup path on these architectures.
    
    Speaker: Libo Chen (Oracle)
    
    lpc21_sd_llc_libochen.pdf
  - 08:30
    
    Break
  - 42
    
    Challenge of selecting an idle CPU
    
    Several proposals have been tried to change the policy of the wake up path regarding the selection of an idle CPU in the scheduler:
    - Consider new topology levels
    - Speedup and optimize idle cores and/or CPUs selection
    - Better estimate how much effort worth spending to look for an idle CPUs
    - and more others
    This talk will summarize the current ongoing proposals and discuss the best way to move forward
    
    Speakers: Barry Song Bao Hua, Aubrey Li, Srikar Dronamraju, Vincent Guittot (Linaro)
    
    LPC2021_scheduler_mc_selecting_idle_cpu.pdf
  - 09:05
    
    Break
  - 43
    
    Remote charging in the CPU controller
    
    CPU-intensive kthreads aren't generally accounted in the CPU controller, so they escape weight and bandwidth settings when they do work on behalf of a task group.
    
    This is a problem in at least three places in the kernel. Padata multithreaded jobs (link1, link2, link3) may be started by a user task, so helper threads should be bound by the task's task group controls. Async memory reclaim (kswapd, cswapd) should be accounted to the cgroup that the memory belongs to, and similarly net rx should be accounted to the task groups of the corresponding packets being received. There are also general complaints from Android.
    
    Each use case has its own requirements. In padata and reclaim, the task group to account to is known ahead of time, but net rx has to spend cycles processing a packet before its destination task group is known. Furthermore, the CPU controller shouldn't throttle reclaim or net rx in real time since both are doing high priority work. These make approaches that run kthreads directly in a task group, like cgroup-aware workqueues or a kernel path for CLONE_INTO_CGROUP, infeasible. The proposed solution of remote charging can accrue debt to a task group to be paid off (or forgiven) later, addressing both of these issues.
    
    Prototype code has shown some of the ways this might be implemented and the tradeoffs between them. Here's hoping that an early discussion will help make progress on this longstanding problem (link1, link2, link3).
    
    Speaker: Daniel Jordan (Oracle)
    
    remote_charging_cpu_controller_daniel_jordan.pdf
  - 09:40
    
    Break
  - 44
    
    Per-task I/O boost tracking
    
    Sugov implements a rather simplistic concept of boosting I/O-bound
    tasks, through tracking I/O wakeups reported on each CPU and adjusting a
    synthetic boost value to potentially influence upcoming frequency changes.
    The actual boost value depends on a number of different conditions, like
    timings of the task wake-ups and CPU frequency update requests or the
    CPUfreq policy.
    This makes things rather fuzzy and exposes the following drawbacks which
    might result in unexpected lost of I/O boost build-ups or undesired CPU
    frequency spikes:
    
    1) Sugov does not differentiate between I/O boost request sources so it
    can't detect multiple unrelated tasks that do have sporadic I/O wake-
    ups.
    
    2) As the boost value is being maintained per CPU, boost accumulated on
    one CPU might be lost upon task migration.
    
    3) There is no guarantee that the task(s) that did trigger the I/O boost
    is/are still runnable on that CPU.
    
    4) Relevant task uclamp restrictions are not being taken into account.
    
    5) No notion of dependency on the actual device's performance and
    throughput. I.e. boosting CPU frequency might turn out to be
    pointless in case the device cannot cope with the increasing IO
    request rate.
    
    This presentation shows how these shortcomings could be solved or at
    least mitigated by moving from per-core to per-task I/O boost tracking
    implementation.
    
    Speakers: Ms Beata Michalska (Arm), Dietmar Eggemann
    
    lpc2021_sched_mc_io_wait_boost_tracking.pdf
  - 10:15
    
    Break
  - 45
    
    Improving responsiveness of interactive CFS tasks using util_est
    
    One of the most significant metrics for good user experience on a mobile
    device is how quickly the system can react to load changes.
    
    Util_est is used in mainline to create a more stable signal for per-task
    demand, which is the maximum of the task util_est and PELT utilization
    (known as the task utilization).
    In case PELT utilization becomes higher than util_est, the behaviour of
    that task is changing and it needs more resources than previously allocated.
    The responsiveness of the task can be improved by boosting the task
    utilization during this time beyond its PELT utilization.
    
    This presentation describes an implementation of this idea and shows how
    it improves behaviour on an Android device.
    
    Speaker: Mr Vincent Donnefort (Arm)
    
    lpc-2021-sched-ramp-boost.pdf
- Keynote: Keynote by Jon Maddog Hall Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  - 46
    
    KEYNOTE: What a long, strange trip it's been....
    
    This year is the 30th anniversary of the Linux Kernel project, and for most of you the history of the Linux Kernel is well known. While this talk will honor much of that, it also hope to also bring in the histories of other projects that affected Linux and Computer Science, with recognition of the past, humor of the present and looking forward to the future.
    
    Speaker: Jon "maddog" Hall (Linux Professional Institute)
    
    20210920LinuxPlumbersConferenceFinal.pdf
Tuesday 21 September
- BPF & Networking Summit Networking and BPF Summit/Virtual-Room (LPC Virtual)
  
  Networking and BPF Summit/Virtual-Room
  
  LPC Virtual
  
  150
  
  The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
  
  This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
  - 47
    
    Bringing TSO/GRO and Jumbo frames to XDP
    
    XDP is designed for maximum performance which is why certain driver use-cases are not supported (e.g. Jumbo frames or TSO/LRO). The single buffer per-packet design defines a simple and fast memory model and allows eBPF Direct Access (DA) to packet data. Both of them are essential for performance. However, it is the high time we fill the gap with the networking stack and enable non-linear frame support for XDP. There are multiple use-cases for XDP multi-buff like TSO/GRO, Jumbo frames or packet header split across multiple buffers. In this talk, we will present our design for non-linear frames in XDP with the objective to support TSO/GRO or Jumbo frames and, at the same time, to not slow down the single buffer per-packet use case.
    
    Speakers: Eelco Chaudron (Red Hat), Toke Høiland-Jørgensen (Red Hat), Jesper Dangaard Brouer (Red Hat), Lorenzo Bianconi (Red Hat)
    
    Slides (PDF)
    
    Video (Youtube)
  - 48
    
    Watching the super powers
    
    BPF programs are critical system components performing core networking functionality, system audit logs, tracing, and runtime security enforcement to list a few. Charged with such crucial tasks, how do we audit the BPF subsystem itself to ensure system bugs are noticed and malicious attackers can not subtly manipulate these components, inject new programs, or quietly run their own BPF programs?
    
    One proposal is to sign BPF programs following the model used for years to sign kernel modules. In this model the BPF programs loaded into the kernel are signed and verified to ensure only authorized programs can be loaded. Although we support such efforts we believe they are insufficient to actually provide meaningful security guarantees. Unlike kernel modules of old BPF programs tend to have a control plane that are tightly coupled with the application
    where signing the BPF program only covers the most obvious attacks.
    
    In this talk, using real-world examples, we show signing BPF programs provides only minimal improvement over the current model. Instead we propose a robust audit system and enforcement of BPF system calls to ensure access to critical control paths and enforcing loaders of said programs are known. Further, by scanning programs at load time we can do fine grained permissions, e.g. only allowing specific BPF helpers and maps to be exposed in the targeted file systems. Finally, by doing runtime auditing and enforcement we can provide fine grained per user policies based on the trust worthiness of the user. How do we propose to build such a platform? Using BPF of course! By loading a core set of BPF programs in early boot we will show how to implement the proposed model.
    
    Speaker: John Fastabend (Isovalent)
    
    Slides (PDF)
    
    Video (Youtube)
  - 49
    
    Improving the eBPF Developer Experience with Rust
    
    Rust is becoming an increasingly popular choice as a systems programming language. In fact, it's been the #1 most loved language on Stack Overflow for the last 6 years. Aside from being fast, type safe and memory safe, its tooling is excellent which yields high developer productivity. It has been used to write embedded systems software, it is central to the WebAssembly ecosystem, and it is very close to being used inside the Linux Kernel.
    
    eBPF offers many exciting possibilities, but getting started developing eBPF programs is hard. While there are many eBPF libraries that target writing userspace applications in most popular programming languages, very few of them also seek to improve the experience of writing, building and debugging the eBPF program itself.
    
    Aya is an eBPF library built for exactly this purpose. Using Aya, we seek to improve the eBPF developer experience with Rust!
    
    In this talk, we will demonstrate how Aya can be used to quickly develop an eBPF application as well as covering plans for new features to further improve the experience.
    
    Learn how to:
    - Quickly start a new eBPF Project
    - Compile Rust programs to eBPF bytecode
    - Generate bindings to kernel types using BTF
    - Allow seamless sharing of code between eBPF and user space
    - Load eBPF programs from user-space and interact with maps
    
    Speakers: Dave Tucker (Red Hat), Alessandro Decina (Deepfence)
    
    Slides (PDF)
    
    Video (Youtube)
  - 50
    
    Ahead-of-time compiled bpftrace programs
    
    bpftrace was originally designed with a dynamic compilation model. While that model has worked fairly well, new developments and concerns in the eBPF ecosystem have prompted re-evaluation of the original design.
    
    First, BPF is becoming more widely used so performance is more of a concern. Running LLVM to generate bytecode on the fly is somewhat costly, especially for bpftrace-enabled data collection in production. Binary weight is also an issue because bpftrace currently ships with LLVM and clang libraries.
    
    Second, signed BPF programs is making its way into the kernel in response to security concerns. bpftrace is not immune to those security concerns so bpftrace must have an answer as well if it is to remain relevant in more secure environments.
    
    Finally, CO-RE, the building blocks for portable BPF programs have become production ready and are shipping in many distros. bpftrace can build on top of these pieces to deliver ahead-of-time compiled bpftrace programs. These AOT programs will faster to run and smaller in binary size.
    
    This talk will go into the ongoing work to enable AOT bpftrace programs. There are a lot of moving pieces because the existing code has made broad assumptions about the compilation model. Hopefully by the end of this talk, participants will have a better idea about what work has been accomplished, what remains to be done, and what unsolved issues still need to be resolved.
    
    Speaker: Daniel Xu (Facebook)
    
    Slides (PDF)
    
    Video (Youtube)
  - 51
    DSA switches: domesticating a savage beast
    
    The DSA subsystem was originally built around Marvell devices, but has since
    been extended to cover a wide variety of hardware with even wider views of
    their management model. This presentation discusses the changes in DSA that
    took place in the last years for this wide variety of switches to offer more
    services, and in a more uniform way, to the larger network stack.
    
    Summarized, these changes are:
    
    Acknowledging switches which only have DSA tags for control plane packets,
    and modifying the bridge driver to accept termination of data plane packets
    from these switches.
    
    Support for unoffloaded upper interfaces.
    
    Support for more cross-chip topologies than the basic daisy chain, while
    maintaining the basic principle that network interfaces under the same bridge
    can forward from one to another, and interfaces under different bridges
    don't.
    
    The data plane and the control plane
    
    The original DSA architecture of exposing one virtual network interface for
    each front-facing switch port, and not exposing virtual network interfaces for
    the ports facing inwards (CPU ports, DSA/cascade ports) has remained unchanged
    to this day. DSA network interfaces should not only be conduit interfaces for
    retrieving ethtool statistics and registering with the PHY library, but they
    should be fully capable of sending and receiving packets. This is accomplished
    via the DSA tagging protocol drivers, which are hooks in the RX and TX path of
    the host Ethernet controller (the DSA master) which multiplex and demultiplex
    packets to/from the correct virtual switch interface based on switch-specific
    metadata that is placed in the packets.
    
    In this model, the basic function of a network switch from a hardware
    designer's perspective, which is to switch packets, is an optional feature from
    the Linux network stack's perspective, and was added years after the original
    design had been established.
    
    Behind the seemingly uniform implementation of DSA tagging protocols and switch
    drivers, which are tightly managed by the DSA framework, lie many differences
    and subtleties that make the feature set exposed by two different DSA switches
    to the network stack very different.
    
    The majority of network switches capable of management have some sort of
    distinction between the data plane packets and the control packets.
    
    At the most basic level, control packets, which must be used for link-layer
    control protocols like STP and PTP, have the ability to target a specific
    egress port and to override its STP state (inject into a BLOCKING port). These
    packets typically bypass the forwarding layer of the switch and the frame
    analysis stage of the ingress (CPU) port and are injected directly into the
    egress port. The implications are that metadata such as QoS class and VLAN ID
    must be specified by the operating system driver directly as part of the DSA
    tag, and that hardware address learning is not performed on the CPU port.
    
    On the opposite side of the spectrum, data plane packets do not perform STP
    state override, are subject to hardware address learning on the CPU port, but
    also cannot be steered towards a precise destination port, since they are also
    subject to the forwarding rules of the switch.
    
    At the extreme, there exists a DSA_TAG_PROTO_NONE tagging protocol, which
    admits defeat and does not attempt to multiplex/demultiplex virtual switch
    interfaces from the DSA tag, and all network I/O through such a switch takes
    place through the DSA master which is seen as a switched endpoint. The network
    interfaces registered for the switches are only used for control operations
    (ethtool, PHY library) and are "dead" to the network stack both for control
    plane and for data plane packets. These are the "unmanaged" switches.
    
    Finally, in some switch designs, injecting a control packet is an expensive
    operation which cannot be sustained at line rate, and the bulk of the traffic
    (the data plane packets) should be injected, from the hardware designer's
    perspective, directly through the DSA master interface, with no DSA tag.
    These are the "lightly managed" switches, and their virtual DSA interfaces are
    similarly "dead" to the network stack except for link-local packets.
    
    The most basic and common approach with this type of hardware is to simply set
    up a user space configuration to perform the traffic termination from the
    switching domain on the DSA master itself. For some packets to target a single
    switch port, the user is required to install a bridge VLAN on the switch port
    which is egress-tagged towards the CPU port, then create an 8021q upper with
    the same VLAN ID on top of the DSA master, and send/receive traffic through the
    8021q upper of the DSA master. This approach is, however, undesirable because
    bridging DSA interfaces with non-DSA (foreign) interfaces is impossible, which
    is an important use case for boards with a switch and a Wi-Fi AP (home routers).
    Interfaces that are DSA masters cannot be added to a bridge either.
    
    A slightly better integrated way of achieving the same result is the relatively
    new software-defined DSA tagging protocol named tag_8021q, which can bring both
    the lightly managed and unmanaged switches closer to the user model exposed by
    DSA switches with hardware support for a DSA tagging protocol.
    
    The tag_8021q protocol is fundamentally still sending data plane packets from
    the perspective of the hardware, so there are things it cannot accomplish, like
    STP state override. Additionally, the DSA framework has traditionally not
    enforced any meaningful distinction between data plane and control plane
    packets, since originally, the assumption was that all packets injected by the
    software network stack should be control packets.
    
    To unify the hardware and the software notions, and to use these chips in the
    way they were meant to, the network stack must be taught about data plane
    packets. The tag_8021q model breaks down when DSA switch interfaces offload a
    VLAN-aware bridge, which is in fact their primary use cases. This is because
    the source port of the switch cannot be retrieved based on the VLAN ID by the
    tagging protocol driver on RX, because the VLANs are under the control of the
    bridge driver, not DSA, and there is no guarantee that a VLAN is uniquely
    installed on a single switch port. So bridging with foreign interfaces becomes
    equally impossible.
    
    The decisive changes which made these switches correctly offload a VLAN-aware
    bridge come in the form of not attempting to report a precise source port on RX
    for data plane packets, just a plausible/imprecise one. As long as some
    requirements inside the software bridge's ingress path are satisfied (valid STP
    state, VLAN ID from the packet is in the port's membership list), the bridge is
    happy to accept the packet as valid, and process it on behalf of the imprecise
    DSA interface that was reported.
    
    Complications arise due to the fact that the software bridge might learn the
    MAC SA of these packets on a potentially wrong port, and deliver those packets
    on the return path towards the wrong port. Additionally, due to bandwidth
    constraints, DSA interfaces do not synchronize their hardware FDB with the FDB
    of the software bridge, so the software bridge does not have an opportunity to
    figure out the real source port of imprecise packets.
    
    To give DSA the chance to right a wrong, the bridge driver was modified to
    support TX forwarding offload. With this feature, the software bridge avoids
    cloning an skb which needs to be flooded to multiple ports, and sends only one
    copy of the packet towards a single network interface from each "hardware
    domain" that the flooded packet must reach. The port driver is responsible with
    looking up its hardware FDB for that packet and replicate the packet as needed.
    This is a useful feature in itself, because with switches with a large port
    count, multicast traffic on the bottlenecked link between the DSA master and
    the CPU port is reduced, and packets are replicated inside the hardware.
    But with the lightly-managed and unmanaged switches, it makes the imprecise RX
    work correctly, since the TX is also imprecise. So even though the software
    bridge did learn the MAC SA of the packets on the wrong source port, that
    source port is in the same hardware domain with the right port, and even though
    the software FDB is incorrect, the hardware FDB isn't. So DSA drivers for
    lightly-managed and unmanaged switches have a chance to properly terminate
    traffic on behalf of a VLAN-aware bridge, in a way that is compatible with
    bridging with foreign interfaces, and with a user space interaction procedure
    that is much more uniform with DSA drivers that always send and receive packets
    with a DSA tag.
    
    Unoffloaded software upper interfaces
    
    Recently, DSA has also gained support for offloading other virtual network
    interfaces than the Linux bridge. These are the hsr driver (which supports the
    HSR and PRP industrial redundancy protocols) and the bonding/team drivers
    (which support the link aggregation protocol).
    
    Not all switches are capable of offloading hsr and team/bonding, and DSA's
    policy is to fall back to a software implementation when hardware offload
    cannot be achieved: the bandwidth to/from the CPU is often times good enough
    that this is not impractical.
    
    However, DSA's policy could not be enforced right away with the expected
    results, due to two roadblocks that led to further changes in the kernel code
    base.
    
    To not offload an upper interface means for DSA that the physical port should
    behave exactly as it would if it was a standalone interface with no switching
    to the others except the CPU port, and which is capable of IP termination.
    
    But when the unoffloaded upper interface (the software LAG) is part of a
    bridge, the bridge driver makes the incorrect assumption that it is capable of
    hardware forwarding towards all other ports which report the same physical
    switch ID. Instead, forwarding to/from a software LAG should take place in
    software. This has led to a redesign of the switchdev API, in that drivers must
    now explicitly mark to the bridge the network interfaces that are capable of
    autonomous forwarding; the new default being that they aren't. In the new
    model, even if two interfaces report the same physical switch ID, they might
    yet not be part of the same hardware domain for autonomous forwarding as far as
    the bridge is concerned.
    
    The second roadblock, even after the bridge was taught to allow software
    forwarding between some interfaces which have the same physical switch ID, was
    FDB isolation in DSA switches. Up until this point, the vast majority of DSA
    drivers, as well as the DSA core, have considered that it is enough to offload
    multiple bridges by enforcing a separation between the ports of one bridge and
    the ports of another at the forwarding level. This works as long as the same
    MAC address (or MAC+VLAN pair, in VLAN-aware bridges) is not present in more
    than one bridging domain at the same time. This is an apparently reasonable
    restriction that should never be seen in real life, so no precautions have been
    taken against it in drivers or the core.
    
    The issue is that a DSA switch is still a switch, and for every packet it
    forwards, regardless of whether it is received on a standalone port, a port
    under a VLAN-unaware bridge or under a VLAN-aware one, it will attempt to look
    up the FDB to find the destination. With unoffloaded LAGs on top of a
    standalone DSA port, where forwarding between the switched domain and the
    standalone port takes place in software, the expectation that a MAC address is
    only present in one bridging domain is no longer true. From the perspective of
    the ports under the hardware bridge, a MAC address might come from the outside
    world, whereas from the perspective of the standalone ports, the same MAC
    address might come from the CPU port. So without FDB isolation, the standalone
    port might look up the FDB for a MAC address and see that it could forward the
    packet directly to the port in the hardware bridge domain, where that packet
    was learned by the bridge port, shortcircuiting the CPU. But the forwarding
    isolation rules put in place will prevent this from happening, so packets will
    be dropped instead of being forwarded in software.
    
    Individual drivers have started receiving patches for FDB isolation between
    standalone ports and bridged ports, but it is possible to conceive real life
    situations where even FDB isolation between one bridge and another must be
    maintained. Since the DSA core does not enforce FDB isolation through its API
    and many drivers already have been written without it in mind, it is to be
    expected that many years pass until DSA offers a uniform set of services to
    upper layers in this regard.
    
    Switch topology changes
    
    Traditionally, the cross-chip setups supported by DSA have been daisy chains,
    where all switches except the top-most one lack a dedicated CPU port, and are
    simply cascaded towards an upstream switch. There are two new switch topologies
    supported by DSA now.
    
    The first is the disjoint tree topology. A DSA tree is comprised of all
    switches directly connected to each other which use a compatible tagging
    protocol (one switch understands the packets from the other one, and can
    push/pop them as needed). Disjoint trees are used when DSA switches are
    connected to each other, but their tagging protocols are not compatible.
    As opposed to one switch understanding another's, tag stacking takes place, so
    in software, more than one DSA tagging protocol driver needs to be invoked for
    the same packet. In such a system, each switch forms its own tree. Disjoint
    trees were already supported, but the new changes also permit some hardware
    forwarding to take place between switches belonging to different trees. For
    example, be there an embedded 5 port DSA switch that has 3 external DSA
    switches connected to 3 of its ports. Each embedded DSA switch interface is a
    DSA master for the external DSA switch beneath it, and there are 4 DSA disjoint
    trees in this system. For a packet to be sent from external switch 1 to
    external switch 2, it must be forwarded towards the CPU port. In the most basic
    configuration, forwarding between the two external switches can take place in
    software. However, it is desirable that the embedded DSA switch that is a
    master of external switches 1 and 2 can accelerate the forwarding between the
    two (because the external switches are tagger-compatible, they are just
    separated by a switch which isn't tagger-compatible with them). Under some
    conditions, this is possible as long as the embedded DSA switch still has some
    elementary understanding of the packets, and can still forward them by MAC DA
    and optionally VLAN ID, even though they are DSA-tagged. With the vast majority
    of DSA tagging protocols, the MAC DA of the packets is not altered even when a
    DSA tag is inserted, so the embedded DSA master can sanely forward packets
    between one external switch and another. This is one of the only special cases
    where DSA master interfaces can be bridged (they are part of a separate bridge
    compared to the external switch ports), because in this case, the DSA masters
    are part of a bridge with no software data plane, just a hardware data plane.
    The second requirement is for both the embedded and the external switches to
    have the same understanding of what constitutes a data plane packet, and what
    constitutes a control plane packet: STP packets received by the external switch
    should not be flooded by the embedded switch. Due to the same reason that the
    embedded switch must still preserve an elementary understanding of the MAC DA
    of packets tagged with the external switch's tagging protocol, this will also
    be the case, since typical link-layer protocols have unique link-local
    multicast MAC addresses.
    
    The second is the H tree topology. In such a system, there are multiple
    switches laterally interconnected through cascade ports, but to reach the CPU,
    each switch has its own dedicated CPU port. It turns out that to support such a
    system, there are two distinct issues.
    
    First, with regard to RX filtering, an H tree topology is very similar in
    challenges to a single switch with multiple CPU ports. Hardware address
    learning on the CPU port, if at all available, is of no use and leads to
    addresses bouncing and packet drops. All MAC addresses which need to be
    filtered to the host need to be installed on all CPU ports as static FDB
    entries. This has led to the extension of the bridge switchdev FDB notifiers to
    cover FDB entries that are local to the bridge, and which should not be
    forwarded.
    
    Secondly, in an H topology it is actually possible to have packet loops with
    the TX forwarding offload feature enabled, because TX data plane packets sent
    by the stack to one switch might also be flooded through the cascade port to
    the other switch, where they might be again flooded to the second switch's CPU
    port, where they will be processed as RX packets. Currently, drivers which
    support this topology need to be individually patched to cut RX from cascade
    ports that go towards switches that have their own CPU port, because the DSA
    driver API does not have the necessary insight into driver internals as to be
    able to cut forwarding between two ports only in a specific direction.
    
    Future changes
    
    One of the most important features still absent from DSA is the support for
    multiple CPU ports, the ability to dynamically change DSA masters and the
    option to configure the CPU ports in a link aggregation group. However, with
    many roadblocks such as basic RX filtering support now out of the way, this
    functionality will arrive sooner rather than later.
    
    There is also the emerging topic of Ethernet controllers as DSA masters that
    are aware of the DSA switches beneath them, which is typical when both the
    switch and the Ethernet controller are made by the same silicon vendor.
    Right now DSA can freely inherit all master->vlan_features, such as TX
    checksumming offloads, but this does not work for all switch and DSA master
    combinations, so it must be refined and only the known-working master and
    switch combinations inherit the extra features.
    
    On the same topic of DSA-aware masters, SR-IOV capable masters are expected to
    still work when attached to a DSA switch, but the network stack's model of this
    use case is unclear. VFs on top of a DSA master should be treated as switched
    endpoints, but the VF driver's transmit and receive procedures do not go
    through the DSA tagging protocol hooks, and these packets are therefore
    DSA-untagged. So hardware manufacturers have the option of inserting DSA tags
    in hardware for packets sent through a VF that goes through a DSA switch. It is
    unclear, however, according to which bridging domain are these VFs being
    forwarded. An effort should be made to standardize the way in which the network
    stack treats these interfaces. It appears reasonable that DSA switches might
    have to register virtual network interfaces that are facing each VF of the
    master, in order to enforce their bridging domain, but this makes the DSA
    master and switch drivers closely coupled.
    
    On the other hand, letting other code paths than the DSA tagging protocol
    driver inject packets into the switch risks compromising the integrity of the
    hardware, which is an issue that currently exists and needs to be addressed.
    
    As a conclusion, taming DSA switches and making them behave completely in
    accordance with the network stack's expectations proves to be a much more
    ambitious challenge than initially foreseen, thus the fight continues.
    
    Speaker: Vladimir Oltean (NXP Semiconductors)
    
    Paper (PDF)
    
    Slides (PDF)
    
    Video (Youtube)
- Confidential Computing MC Microconference2/Virtual-Room (LPC Virtual)
  
  Microconference2/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Confidential Computing microconference focuses on solutions to the development of using the state of the art encyption technologies for live encryption of data, and how to utilize the technologies from AMD (SEV), Intel (TDX), s390 and ARM Secure Virtualization for secure computation of VMs, containers and more.
  - 52
    
    Confidential Computing MC Welcome
    
    Speaker: Joerg Roedel (SUSE)
  - 53
    
    Live Migration of TD Guest
    
    The Intel Trust Domain Extension (TDX) technology extends VMX and MKTME to enhance guest data security by isolating guests from host software, including VMM/hypervisor. Live migration support for such isolated guests (i.e. TDs) facilitates the deployment of TD guests in the cloud.
    This talk presents the QEMU/KVM design of TDX live migration and initial PoC results for the migration performance evaluation. A common framework is added to the QEMU migration to support TD guests and other similar technologies (e.g. SEV guests). For TD guest live migration, the guest shared memory pages are migrated in plaintexts. The guest private memory pages, vCPU states and TD scope states are encrypted via a migration key when they are exported by KVM from the TDX module. A migration stream in the workflow has a KVM device created and the device creates shared memory between KVM and the QEMU migration thread to transport the encrypted guest states.
    
    Speaker: Wei Wang (Intel Corp.)
    
    TDX Live Migration_Wei Wang.pdf
  - 54
    
    Live Migration of Confidential VMs
    
    Discussion on Live Migration of AMD SEV encrypted VMs.
    
    Link to the latest posted (KVM) patch for SEV live migration :
    https://lore.kernel.org/lkml/cover.1623174621.git.ashish.kalra@amd.com/
    
    Discussions on Guest APIs, specifically if the APIs can cover both
    AMD SEV and Intel TDX platforms and exploring common interfaces
    which can be re-used for both the above platforms, for example,
    exploring a common hypercall API interface, with reference
    to the posted KVM patch-set.
    
    Link to related discussion on the same topic:
    https://lore.kernel.org/lkml/YJv5bjd0xThIahaa@google.com/
    
    SEV Live Migration Acceleration uses an alternative migration
    approach relying on a Migration Helper (MH) running in guest
    context. The fast migration for encrypted virtual
    machines typically use a Migration Handler that lives in OVMF.
    
    As part of this microconference, we can have additional
    discussions on the design and development of the MH, especially,
    the suggested approach to use KVM/Qemu Mirror VM concept to
    run the MH in a Mirror VM/vCPU which runs in parallel to the
    primary encrypted VM in the same Qemu process context.
    
    Links to posting for the above on KVM and Qemu development
    lists : https://lore.kernel.org/lkml/SN6PR12MB276727DE9AFD387B11C404418E3E9@SN6PR12MB2767.namprd12.prod.outlook.com/
    
    Speaker: Ashish Kalra
    
    Live migration of confidential guests_LPC2021.pdf
  - 55
    
    TDX Linux guest
    
    Intel TDX is an upcoming confidential computing platform for running encrypted guests on untrusted hosts on Intel servers. It requires para virtualization to do any required emulation inside the guest. There are some unique challenges, in particular in hardening the Linux guest code against untrusted host input through MMIO, port and other IO, which is a new security challenge for Linux. The guest has to "accept" all memory and to get acceptable boot performance this acceptance has to be done lazily. We'll give an overview of the current TDX status, talk about the challenges and hope for a good discussion.
    
    Speakers: Andi Kleen, Sathyanarayanan Kuppuswamy, Elena Reshetova
    
    TDX Plumbers overview.pdf
  - 08:25
    
    Break
  - 56
    
    Debug Support for Confidential VMs
    
    Debug Support for AMD SEV Encrypted VMs.
    
    Discussion on QEMU debug support for memory encrypted guests like AMD SEV/Intel TDX.
    Debug requires access to the guest pages, which are encrypted when SEV/TDX is enabled.
    
    Discussion on exploring common interfaces which can be re-used for both
    AMD SEV and Intel TDX platforms with regard to encrypted guest memory access for
    debug in Qemu.
    
    Latest posted patches on qemu-devel list from the Intel TDX team:
    [RFC][PATCH v1 00/10] Enable encrypted guest memory access in QEMU
    https://lore.kernel.org/qemu-devel/20210506014037.11982-1-yuan.yao@linux.intel.com/
    
    Link to the last posted patch-set from AMD:
    https://lore.kernel.org/qemu-devel/cover.1605316268.git.ashish.kalra@amd.com/
    
    Original discussion thread on qemu-devel list :
    https://lore.kernel.org/qemu-devel/20200922201124.GA6606@ashkalra_ubuntu_server/
    
    Speaker: Ashish Kalra
    
    DEBUG support for Confidential Guest_LPC2021.pdf
  - 57
    Running Confidential Containers
    
    Nowadays, containers are a private and public cloud commodity. Isolating and protecting containerized workloads not only from each other but also from the infrastructure owner is becoming a necessity.
    
    In this presentation we will describe how we’re planning to use confidential computing hardware implementations to build a confidential containers software stack. By combining the hardware encryption and attestation capabilities that these new ISAs provide, the proposed software architecture aims at protecting both container data (downloaded from container image registries and generated at runtime) and code from being seen or modified by cloud providers and owners.
    
    As the Kata Containers project already uses hardware virtualization to provide a stronger container isolation layer, we will first explain why and how we want to use the Kata runtime and agent as the foundation for running confidential containers.
    
    Then we will look at the container specific requirements that we need to take into account for building that software stack. Short boot times, low memory footprint or the inherently dynamic, ephemeral and asynchronous nature of container workloads are some of the technical challenges that we’re facing when it comes to running confidential containers. The final part of the talk will go through some of the technical solutions we’re building to address those challenges. In particular we will speak about:
    
    Transparent memory and cpu state encryption: As the Kata runtime can run on top of heterogeneous nodes, running different confidential computing implementations (TDX, SEV, etc), we have to build a small framework for transparently enabling the underlying encryption technologie whenever a confidential container is scheduled on a given node.
    
    Container image service offload: The entire container software ecosystem assumes container image layers can be downloaded, unpacked and mounted from the host itself. This obviously breaks the confidential computing security model and that brings the need for offloading at least part of the container image management from the host to the guest.
    
    User space attestation: Adding container images to the initial guest image and measurements can have a large impact on boot time and also break the dynamic and ephemeral nature of container workloads in e.g. a Kubernetes container orchestration context. As a consequence container images must be downloaded, and either verified or decrypted from the guest user space Kata agent, who then becomes responsible for triggering the container image credential provisioning through e.g. remote attestation.
    
    Speaker: Samuel Ortiz
  - 58
    
    Confidential Computing with Secure Execution (IBM Z)
    
    As confidential computing gains traction, several technologies that are based on a secure hypervisor are emerging.
    Besides SEV (AMD), PEF (Power), and TDX (Intel), IBM Z's Secure Execution enables running a guest that even an administrator cannot look into or tamper with.
    At the same time, it becomes desirable to run an OCI container workload in a secure context.
    
    The Kata Containers runtime is based on VMs and thus, Secure Execution can be leveraged.
    Initially, Kata had the goal of protecting the host from malicious guests, but the vice versa approach is now being discussed and worked on, with some patches landed, but other patches required in adjacent projects like containerd.
    
    I work in IBM's Linux on Z department, enabling Kata on the IBM Z and LinuxONE platform, including Secure Execution.
    I propose a talk where first, a general overview of Secure Execution is given: what the threat and security models are and how a user would go about running a protected workload.
    This helps the audience learn about a confidential computing solution that is distinct from discussed x86 approaches, in that images to be launched in Secure Execution are encrypted and can only be decrypted in a secure context, as opposed to x86 firmware attestation approaches.
    It is then described how Secure Execution maps to the challenges in confidential computing including Kata and Kubernetes, concerning the need to control and provide certain resources from the host.
    
    Note: Samuel Ortiz of Apple has also proposed to speak about general confidential computing challenges in Kata in this microconference.
    Even though I will introduce Kata and confidential computing so that the talk makes sense on its own, it is probably better if I speak after him.
    
    Speaker: Jakob Naucke (IBM Corp.)
    
    Confidential Computing with Secure Execution (IBM Z), Linux Plumbers Conference 2021, Confidential Computing Microconference, Jakob Naucke, IBM.pdf
  - 09:40
    
    Break
  - 59
    
    Deploying CVMs at scale via Linux
    
    We’ll enumerate pain points that we’ve encountered in deploying (or trying to deploy) Linux CVMs on Google’s public cloud, called Google Compute Engine (GCE), which is built on top of Linux. Example pain points include RMP violations crashing host machines, kexec and kdump not working on SNP-enabled hosts, guest kernel SWIOTLB bugs, incomplete/lacking test infrastructure, and more! Then, as a group, we can see what problems are interesting to the wider community, and discuss how to prioritize them.
    
    Speaker: Marc Orr (Google)
    
    Deploying Confidential VMs via Linux.pdf
  - 60
    
    Attestation and Secret Injection for Confidential VMs & Containers/Pods
    
    Attestation is an important step in the setup of a confidential enclave in a public cloud environment. Through this process a guest owner can externally validate the software being run in their enclave before any confidential information is exposed. In this talk, we discuss the design and challenges of measuring and validating a guest enclave, and safely injecting guest owner secrets into the enclave. Our discussion will focus on the AMD SEV architectures (SEV, SEV-ES, and SEV-SNP) and how their hardware-enforced attestation and pre-attestation procedures map onto the deployment of guest VMs and confidential containers (i.e., Kata Containers).
    
    By attending this talk, you will gain an understanding of the attestation and measurement features of the AMD SEV architectures, as well as the challenges of doing attestation for confidential VMs and containers/Pods in a public cloud. In addition, we will overview other attestation approaches such as those of Intel TDx, SecureBoot, and other software-based techniques.
    
    Speakers: Jim Cadden (IBM Research), James Bottomley (IBM Research)
    
    Linux Plumbers 2021 - Attestation and Secret Injection.pdf
  - 61
    Securing trusted boot of confidential VMs
    
    Confidential Computing can enable several use-cases which rely on the ability to run computations remotely on sensitive data securely without having to trust the infrastructure provider. One required building block for this is verifiable control flow integrity on the remote machine: ensuring that the running compute is doing what it's supposed to.
    
    With hand-written Intel SGX the code surface is usually limited, but with a fully fledged VM it becomes more difficult. One example for this is securing the control flow of the VM's boot process:
    
    In the last years we have seen multiple projects securing the boot process of confidential VMs (cVMs) by allowing to boot from encrypted disk images. These approaches usually rely on the injection of a secret during the boot process. While this enables use-cases like hiding the raw disk image from the platform provider (i.e. the entity running the hardware and hypervisor stack), we are not capable of creating hardware-backed proofs of the measurement (i.e. hash) of the code (and data) which is being executed on that cVM, also called remote attestation.
    
    Modern x86 extensions (like AMD SEV-SNP, Intel TDX) allow measurement of the initial boot image before VM startup and cryptographically bind this measurement in a remotely verifiable attestation document.
    The question that now arises is how much code surface the initial measurement should contain and if existing firmwares/bootloaders can be used securely. Taking AMD SEV-SNP as reference hardware we implemented two working proof-of-concepts:
    a) Minimal firmware based on existing software: Here we leveraged the work which has been done on OMVF and grub for enabling booting of encrypted images. In that case the attested firmware only consists of the OMVF firmware and the grub bootloader. We extended grub to perform a measurement of the operating system (OS) image during loading and assert the measurement with the known good value baked into the attested firmware. In our case the verified linux image is a EFI unified kernel image which allows us to cryptographically bind the kernel image as well as the initramfs image, and the kernel command line. This approach adds OMFV and grub to the audit surface and makes it hard to provide control flow guarantees. For example without extra hardening the OMFV's EFI shell or the grub shell can be easily used to load a malicious OS image.
    
    b) Custom firmware with OS embedding: In that case the attested firmware also consists of the entire operating system. Using the rust hypervisor firmware as a basis we added support for linking in a disk blob into the firmware. This is done by a virtual "block device" reading from a known in-memory location. With that we get a single measurement over the entire software stack, including the operating system (and potential applications/data). A caveat here is that so far we rely on an ELF binary to be loaded by the VMM (QEMU) and not a flat rom image. Hence, the attested measurement of the hardware will deviate from the direct hash of the firmware file being loaded, requiring some extra steps to verify the attestation. Also measuring a large firmware might be time consuming on the slow Secure Processor (found in AMD SEV-SNP).
    
    The Microconference topic we are proposing would consist of:
    
    Explaining the problem statement: Control flow integrity of the VM boot process in a post-attestation world
    
    Quickly going over the two approaches we have taken
    
    Discussion of:
    
    problems of our work and possible improvements
    
    other on-going community efforts and collaboration opportunities
    
    remaining attacks to break integrity (e.g. configuration provided by a malicious hypervisor,..)
    
    Speakers: Stefan Deml, Andras Slemmer (decentriq)
    
    securing_trusted_boot.pdf
- File Systems MC Microconference4/Virtual-Room (LPC Virtual)
  
  Microconference4/Virtual-Room
  
  LPC Virtual
  
  150
  
  The File system microconference focuses on a variety of file system related topics in the Linux ecosystem. Interesting topics about enabling new features in the file system ecosystem as a whole, interface improvements, interesting work being done, really anything related to file systems and their use in general. Often times file system people create interfaces that are slow to be used, or get used in new and interesting ways that we did not think about initially. Having these discussions with the larger community will help us work towards more complete solutions and happier developers and users overall.
  
  lpc-2021-filesystem-mc.pdf
  
  project_id.pdf
  - 62
    Efficient buffered I/O
    
    Files are currently managed in PAGE_SIZE units. As DRAM and storage capacities increase, the overhead of managing all these pages becomes more significant. The memory folio patchset lets us cache files in larger units.
    
    In this session, we shall discuss:
    
    Filesystem changes needed to work with folios instead of pages
    
    Converting from buffer_heads to iomap
    
    Using the netfs API
    
    Future changes to the filesystem - pagecache API
    
    Speaker: Matthew Wilcox (Oracle)
  - 63
    
    Idmapped Mounts
    
    File ownership is a global property on most systems that have a uid and gid concept. On POSIXy systems the chown*() syscall family allows to change the owner of a file or directory. If the ownership of a file is changed it will be changed globally affecting each user on the systems equally. But various use-cases exist where this can be problematic:
    - Portable home directories that are used on different computers where the user is assigned a different uid and gid.
    - Filesystems that allow to merge or unionize multiple filesystems are often shared between different users.
    - Containers making use of user namespaces also affect file ownership.
    - Avoiding the cost of recursive ownership changes.
    Idmapped mounts solve these problems and others by allowing mounts to change file. This talk we will take a look at how idmapped mounts work, outline the work we've done and what is still left to do and potential new ideas to make this an even more powerful concept.
    
    Speaker: Mr Christian Brauner
    
    christian_brauner_idmapped_mounts.pdf
  - 08:15
    
    Break
  - 64
    
    Atomic file writes: Who really wants this?
    
    I would like to chair a discussion at LPC to discuss atomic file writes for userspace applications. Do we want to expose such a capability to programs, and if so, how?
    
    I propose filesystem implementations provide a general-purpose interface in software. As proposed, the FIEXCHANGE_RANGE system call requires the ability to exchange the contents of two files, with a promise that once we commit to the exchange, it must either succeed completely.
    
    Atomic file writes can be performed by creating a temporary file, cloning the contents, making arbitrary updates to the temporary file, and calling FIEXCHANGE_RANGE to commit the changes. There are no restrictions on length, number of updates, etc.
    
    The ability to exchange the contents of files atomically is a requirement for online repair of XFS metadata; upon finishing the functionality I realized that we could expose it to userspace to provide atomic file updates.
    
    NOTE: This is a separate topic from enabling userspace to access hardware atomic writes. That is a simple matter of making the advertised device capabilities (and alignment/size restrictions) discoverable and adding a flag to io_uring/pwritev2 for directio writes.
    
    Speaker: Darrick Wong (Oracle)
  - 65
    
    File System Shrink
    
    File system shrink allows a file system to be reduced in size by some specified size blocks as long as the file system has enough unallocated space to do so. This operation is currently unsupported in xfs. Though a file system can be backed up and recreated in smaller sizes, this is not functionally the same as an in place resize. Implementing this feature is costly in terms of developer time and resources, so it is important to consider the motivations to implement this feature. This talk would aim to discuss any user stories for this feature. What are the possible cases for a user needing to shrink the filessystem after creation, and by how much? Can these requirements be satisfied with a simpler mkfs option to backup an existing file system into a new but smalller filesystem? In the cases of creating a rootfs, will a protofile suffice? If the shrink feature is needed, we should further discuss the APIs that users would need.
    
    Beyond the user stories, it is also worth discussing implementation challenges. Reflink and parent pointers can assist in facilitating shrink operations, but is it reasonable to make them requirements for shrink? Gathering feedback and addressing these challenges will help guide future development efforts for this feature.
    
    Speaker: Allison Henderson
    
    FS Shrink Slidesv3.pdf
  - 09:30
    
    Break
  - 66
    
    Bad Storage vs. Filesystems
    
    The focus of this session is on mitigating the effects of unreliable storage devices. This author works at a cloud vendor (as is fashionable now), and one of the large story arcs of the past few years has been that storage devices do not seem as reliable as we thought even a few years ago.
    
    Specifically, I've observed that as the world moves from direct-attached spinning rust to software-defined storage on cheap devices, we increasingly must deal with large devices that corrupt data, temporarily stop responding (due to problems on the network/control plane/hypervisor/whatever), or have some odd means to request re-reads
    
    XFS sort of mitigates some of these problems by enabling sysadmins to configure its response to certain kinds of hardware errors (mostly EIO and ENOSPC). Other filesystems lack these control knobs; how might we standardize them? The block layer has some retry capabilities, but no filesystems touch them. We don't have a general corrupted-read retry mechanism, and have not succeeded in adding one.
    
    So what I want to know is: Who cares? Are sysadmins and users happy with the current patchwork? Do they accept the defaults? Would they like more control or better communication between layers?
    
    Speaker: Darrick Wong (Oracle)
    
    _Bad_ Storage vs File Systems.pdf
  - 10:15
    
    Break
  - 67
    XFS Roadmap Planning
    
    This is a BOF for people to get together to discuss unresolved issues in the community and to talk about the roadmap for new feature development and ongoing technical debt payoff. We have not had such a forum since LSFMM in 2018.
    
    Roadmap topics include:
    
    Shrink
    
    Online Repair
    
    Reflink and Reverse Mapping on the Realtime Device
    
    The future of RT and Quota
    
    Parent Pointers
    
    Centralizing Administration Tools
    
    This forum is open to all filesystem developers, though the focus is very obviously on XFS.
    
    Speaker: Darrick Wong (Oracle)
- GNU Tools Track GNU Tools track/Virtual-Room (LPC Virtual)
  
  GNU Tools track/Virtual-Room
  
  LPC Virtual
  
  150
  
  gnu-tools-notes-20210922.txt
  
  gnu-tools-notes-20210923.txt
  
  gnu-tools-track.dl.txt
  
  gnu-tools-track.txt
  - 68
    
    gprofng - The next generation GNU profiler
    
    Prepared presentation
    
    In this talk we present an overview of gprofng, a next generation profiling tool for Linux.
    
    This profiler has its roots in the Performance Analyzer from the Oracle Developer Studio product. Gprofng is a standalone tool however and specifically targets Linux. It includes several tools to collect and view the performance data. Various processors from Intel, AMD, and Arm are supported.
    
    The focus is on applications written in C, C++, Java, and Scala. For C/C++ we assume gcc has been used to build the code. In the case of Java and Scala, OpenJDK and compatible implementations are supported.
    
    Among other things, another difference with the widely known gprof tool is that gprofng offers full support for shared libraries and multithreading using Posix Threads, OpenMP, or Java Threads.
    
    Unlike gprof, gprofng can also be used in case the source code of the target executable is not available. Another difference is that gprofng works with unmodified executables. There is no need to recompile, or instrument the code. This ensures that the profile reflects the actual run time behaviour and conditions of a production run.
    
    After the data has been collected, the performance information can be viewed at the function, source, and disassembly level. Individual thread views are supported as well. Through command line options, the user specifies the information to be displayed. In addition to this, a simple, but yet powerful scripting feature can be used to produce a variety of performance reports in an automated way. This may also be combined with filters to zoom in on specific aspects of the profile.
    
    One of the very powerful features of gprofng is the ability to compare two or more profiles. This allows for an easy way to spot regressions for example.
    
    In the talk, we start with a description of the architecture of the gprofng tools suite. This is followed by an overview of the various tools that are available, plus the main features. A comparison with gprof will be made and several examples are presented and discussed. We conclude with the plans for future developments. This includes a GUI to graphically navigate through the data.
    
    Speakers: Mr Ruud van der Pas (Oracle), Mr Vladimir Mezentsev (Oracle)
    
    talk.gprofng.lpc.2021.ruud.pdf
  - 69
    
    Complex Divide Improvements in libgcc
    
    This talk will discuss the methods used in constructing the recent improvement in complex divide in libgcc where the gross error rate dropped from more than 1 per 100 tests to less than 1 per 10 million tests. The change in accuracy is platform independent while the modest performance loss varies with platform. We also discuss flaws and likely areas for addressing reducing remaining small errors.
    
    Speaker: Patrick McGehearty (Oracle)
    
    Complex_divide.pdf
  - 70
    
    Limitations of tuning glibc malloc on larger systems.
    
    The malloc library provided by glibc offers considerable flexibilty in deciding when to use mmap for larger allocations and when to use sbrk/trim. The default settings for the decision thresholds are reasonable for many applications. Three tunables are available to adjust these settings. The limits on these settings have not been changed since 2006. Server class systems now have much more memory available and other performance tradeoffs have changed dramatically in the last 15 years. We propose significant increases to the limitations on the MALLOC_MMAP_THRESHOLD_ tunable. (current default = 128K; current maximum 32M). This change will not affect existing usage while allowing select applications to improve their malloc performance, sometimes dramatically.
    
    Speaker: Patrick McGehearty (Oracle)
    
    limitations_malloc.pdf
  - 71
    Indirect External Access
    
    On systems with copy relocation:
    - A copy in executable is created for the definition in a shared library at run-time by ld.so.
    - The copy is referenced by executable and shared libraries.
    - Executable can access the copy directly.
    
    Issues are:
    - Overhead of a copy, time and space, may be visible at run-time.
    - Read-only data in the shared library becomes read-write copy in executable at run-time.
    - Local access to data with the STV_PROTECTED visibility in the shared library must use GOT.
    
    On systems without function descriptor, function pointers vary depending on where and how the functions are defined.
    
    If the function is defined in executable, it can be the address of function body.
    
    If the function, including the function with STV_PROTECTED visibility, is defined in the shared library, it can be the address of the PLT entry in executable or shared library.
    
    Issues are:
    - The address of function body may not be used as its function pointer.
    - ld.so needs to search loaded shared libraries for the function pointer of the function with STV_PROTECTED visibility.
    
    Here is a proposal to remove copy relocation and use canonical function pointer:
    1. Accesses, including in PIE and non-PIE, to undefined symbols must use GOT.
    2. Read-only data in the shared library remain read-only at run-time
    3. Address of global data with the STV_PROTECTED visibility in the shared library is the address of data body.
    4. For systems without function descriptor,
    - All global function pointers of undefined functions in PIE and non-PIE must use GOT.
    - Function pointer of functions with the STV_PROTECTED visibility in executable and shared library is the address of function body.
    
    Speaker: H.J. Lu (Intel)
    
    indirect-extern-access-LPC-2021.pdf
  - 09:00
    
    Coffee Break
  - 72
    Enable intel LAM in linux
    
    Intel LAM (Linear Address Masking) Extension allows software to locate metadata in data pointers and dereference them without needing to mask the metadata bits. It supports:
    
    LAM_U48: Activate LAM for user data pointers and use of bits 62:48 as masked metadata.
    
    LAM_U57: Activate LAM for user data pointers and use of bits 62:57 as masked metadata.
    
    I am presenting a proposal to enable Intel LAM in Linux:
    
    Only LAM enabled Linux on LAM processors can provide LAM features.
    
    Every piece of OS must be LAM enabled, starting from kernel, toolchain, libraries, …
    
    A binary is LAM enabled only if all its components are LAM enabled.
    
    LAM enabled OS is backward compatible. The same LAM-enabled OS binary can run on LAM and legacy processors.
    
    Provide LAM features only on LAM processors.
    
    Minimum performance loss on legacy processors.
    
    Speaker: H.J. Lu (Intel)
    
    LAM-LPC-2021.pdf
  - 73
    OpenACC "kernels" improvements
    
    The existing implementation of the OpenACC "kernels" construct in GCC
    is unable to cope with many language constructs found in real HPC
    codes which generally leads to very bad performance. This talk
    presents upcoming changes to the "kernels" implementation that improve
    the performance significantly:
    
    A more unified internal representation of "kernels" and "parallel"
    regions as a foundation for the other improvements.
    
    Data-dependence analysis based on Graphite.
    
    Improvements to Graphite (e.g. runtime alias checking) to enable its
    use on more code.
    
    Language Frontend (e.g. delinearization of array accesses for
    Fortran) and Middle-end changes (e.g. a "omp_data_optimize" pass to
    derive synthetic OpenACC "private" clauses on "kernels") that enable
    Graphite to analyze more code.
    
    Speaker: Frederik Harwath (Siemens EDA)
    
    LPC.pdf
  - 74
    
    BoF: Offloading with OpenMP & OpenACC
    
    BoF to discuss topics related to concurrency and offloading work onto AMD and NVIDIA accelerators using OpenMP and OpenACC.
    
    In particular the implementation of the missing OpenMP 5.0 & 5.1 features, including memory allocators, unified shared memory, C++ attributes, etc.
    
    Related topics and trends can also be discussed, be it base language concurrency features, offloading without using OpenMP/OpenACC, other accelerators.
    
    Speakers: Andrew Stubbs (Mentor Graphics / CodeSourcery), Tobias Burnus (Mentor, A Siemens Business), Jakub Jelínek (Red Hat Czech s.r.o.), Thomas Schwinge (Siemens Digital Industries Software)
- LPC Refereed Track Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  - 75
    
    Understanding motivations, goals and challenges faced by the Linux Kernel contributors
    
    Motivation to contribute and barriers faced by newcomers and contributors to join and stay in Open Source Software projects have been intriguing researchers since the early 2000s. The literature on motivation was updated on recent work and showed that for more than 55% of contributors who answered the questionnaire, the motivation shifted after joining. Those contributors joined OSS for one reason and continued for another reason. The world is dynamic, and so are we. If the reasons to participate can change from past to present, what about the future? Linux Kernel wants to keep contributors around by understanding what they seek for their future, ultimately influencing communities' sustainability.
    We will present results from a survey with Linux Kernel contributors that was created to understand why people participate in Linux Kernel projects, their goals for the future, and what would make them leave or continue contributing. The survey is part of a Diversity & Inclusion initiative to attract and retain a diverse set of contributors in Linux Kernel.
    
    Speaker: Bianca Trinkenreich (Northern Arizona University)
    
    Understanding motivations, goals and challenges faced by Linux Kernel contributors.pdf
  - 76
    
    Adding features to perf using BPF
    
    The availability of BPF allows the improvement of preexisting perf features or the
    addition of new ones without requiring kernel changes.
    
    The first use of BPF to augment perf is to use BPF programs to profile other BPF
    programs with 'perf stat', this is already upstream and set the stage for
    further uses. This provides functionality similar to 'bpftool prog profile' while reusing
    lots of 'perf stat' features that were developed and improved by the perf tooling
    community.
    
    Then we had bperf, to share hardware performance counters, aggregate data in BPF
    maps that then get read by 'perf stat' as if it was a normal perf event that then reuses
    all the perf tooling features.
    
    Some improvements, such as scaling cgroups perf monitoring were first attempted by
    modifying the kernel. But after several attempts, one is being made using BPF with
    encouraging results. It works by hooking into cgroup scheduling and doing aggregation
    that is made available to 'perf stat' via bperf.
    
    Such use of BPF for aggregating information in the kernel instead of changing the perf
    subsystem was well received by a perf kernel maintainer, which is encouraging.
    
    Future work will use BPF to enable perf_events when some specific trigger condition
    takes place, so that only a window determined by two probes gets sampled.
    
    Also being considered is the conversion of some perf subcommands that analyze tracepoints
    like perf sched/lock/etc to use BPF to aggregate things in the kernel instead of passing
    vast amounts of data for aggregation in userspace while keeping the existing, familiar
    tooling interface.
    
    This shows how the perf and BPF communities are working together to improve Linux tooling,
    provide ways to scale profiling and to improve observability of BPF programs, it is
    expected that by presenting this talk we get suggestions for further improvements.
    
    Speakers: Arnaldo Carvalho de Melo (Red Hat Inc.), Song Liu, Namhyung Kim
    
    Adding features to perf using BPF
  - 77
    
    Overview of memory reclaim in the current upstream kernel
    
    It is generally known that Linux memory reclaim uses LRU ordered lists to decide which pages to evict to free memory for new pages. It might be less known that there are separate lists for file (page cache) and anonymous pages, and that both are further split in active and inactive parts. There are however lots of subtle details of how the relative sizes of these four lists are balanced, and things also changed recently with e.g. addition of workingset refault detection.
    This talk will summarize the current reclaim implementation in detail, and also major proposed changes such as multigenerational LRU.
    
    Speaker: Vlastimil Babka (SUSE)
    
    Presentation slides
  - 78
    
    Strange kernel performance changes - analysis and mitigation
    
    0day bot has reported many strange kernel performance changes that the bisected culprit commits have nothing to do with the benchmark, which make patch authors confused or even annoyed. Debug shows these mostly are related to the random code/text alignments changes, false sharing, or adjacent cacheline prefetch, which is caused by the commit, as all components of kernel are flatly linked together.
    
    There have been around 20 reported cases checked (all discussed on LKML, like[1][2][3][4]), and this talk will try to:
    * analyze and categorize these cases
    * discuss the debug methods to identify and root cause
    * discuss ideas about how to mitigate them and make kernel performance more stable.
    
    Some patches has been merged, some are to be posted, and some are under development and test. Will discuss them and get advice/feedback.
    
    [1].https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/
    [2].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
    [3].https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
    [4].https://lore.kernel.org/lkml/20210420030837.GB31773@xsang-OptiPlex-9020/
    
    Speaker: Mr Feng Tang
    
    Strange_kernel_performance_changes_lpc_2021.pdf
- RISC-V MC Microconference3/Virtual-Room (LPC Virtual)
  
  Microconference3/Virtual-Room
  
  LPC Virtual
  
  150
  
  The RISC-V microconference focuses on the development of RISC-V.
  
  lpc-2021-default.pdf
  
  timer-slides.pdf
  - 79
    
    RISC-V MC introduction
    
    lpc-2021-sponsors.pdf
    
    RISC-V MC at Plumbers 2021.pdf
  - 80
    
    The RISC-V platform specification
    
    The RISC-V platform specification[1] describes a minimum set of hardware/software requirements to ensure the interoperability of software across compatible platforms. Currently, it defines two platforms i.e. OS-A platforms capable of booting rich operating systems such as Linux, FreeBSD, Windows and M platform aimed to work with RTOS and baremetal. The platform specification is currently under public review and we are collecting feedback from the RISC-V community. We would like to discuss the specification in this forum as well to get a broader feedback which is imperative for the success of the platform specification.
    
    [1] https://github.com/riscv/riscv-platform-specs/blob/main/riscv-platform-spec.adoc
    
    Speakers: ATISH PATRA (Western Digital), Mr Kumar Sankaran (ventana micro), Rahul Pathak, Mayuresh Chitale
    
    Linux Plumbers 2021 Platform.2021-09-21.pdf
    
    RISC-V Platform specification
    
    riscv-platform-spec.pdf
  - 81
    
    Next Generation RISC-V Interrupt Support
    
    The RISC-V Advanced Interrupt Architecture (RISC-V AIA) and RISC-V Advanced CLINT (RISC-V ACLINT) are non-ISA specifications which define next generation interrupt controller, timer, and inter-processor interrupt (IPI) devices for RISC-V platforms. The RISC-V AIA and ACLINT devices will support wired interrupts, message signaled interrupts (MSIs), virtualized message signaled interrupts (virtual MSIs), flexible machine-level timer, machine-level IPIs, and supervisor-level IPIs.
    
    Both RISC-V AIA and ACLINT specifications are in final stages for being ratified and have been validated using QEMU, OpenSBI, Linux RISC-V, and Linux RISC-V KVM. This talk will involve an overview of RISC-V AIA and ACLINT specifications, detailed software status, and open items.
    
    Speaker: Anup Patel (Western Digital)
    
    Next_gen_riscv_interrupt_support_v3.pdf
  - 08:15
    
    break
  - 82
    
    ACPI for RISC-V
    
    RISC-V platform specification mandates the Advanced Configuration and Power Interface(ACPI) as the Hardware discovery mechanism for the server class platforms. There are some new ACPI tables that need to be defined for RISC-V. Code changes are required in qemu, tianocore(EDK2), and OS. This is still a work in progress but the talk will provide details about the planned specification updates and a demo with basic ACPI-enabled Linux kernel booting on Qemu virt platform.
    
    Speaker: Sunil V L
    
    LPC_2021_ACPI_RISCV_Sunil.pdf
  - 83
    What's the problem with D1 Linux upstream?
    
    D1 is Allwinner's first SoC based on the RISC-V ISA. It integrates the 64-bit C906 core of Ali T-Head, supports RVV, 1GHz frequency. Because some of the features are not included in the RISC-V spec, Linux upstream met some problems. Let's review and discuss the issues:
    
    Birdview of D1 & current status of the drivers (By Shaohua)
    
    About custom PBMT (Page Based Memory Type) in D1 for non-coherency
    SOC
    
    About DMA sync operations in D1
    
    About I-cache synchronization's acceleration in D1
    
    About vector 0.7.1 supported in D1
    
    About TLB synchronization's acceleration for T-HEAD c9xx-SMP
    
    Discuss the ALTERNATIVE framework
    
    Q & A (by Liu Shaohua, Guo Ren, Fu Wei)
    
    2 & 3 are minimum requirements for D1 bring up, let's focus on them first. 4 - 6 could help D1 work better and we just have a quick review of them. 7 is about alternative discussion, eg: how we use the errata_list.h for dma_sync ops.
    
    Speakers: Mr Ren Guo, Mr Wei Fu, Mr Shaohua Liu
    
    What's the problem with D1 upstream?
  - 09:45
    
    break
  - 84
    
    Puzzle for RISC-V ifunc
    
    ifunc is a widely used mechnish for specialized those performance
    critical functions in glibc, like memcpy, strcmp and strlen.
    It’s not used in upstream glibc for RISC-V yet, but with several new
    extensions becoming ratified soon, users will desire to have
    vector-optimized routines to boost their program.
    It’s a generic infrastructure for GNU toolchain, so we don’t need too
    much work to enable that in theory, but the real world isn’t so
    wonderful…
    
    Here is the list of the puzzle for the RISC-V ifunc, some is there and
    some is missing:
    - Relocation for ifunc.
    - https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/131
    - Mapping symbol.
    - https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/196
    - New asm directive to enable/disable any extension in specific code
    region (like .option rvc/.option norvc, but more generic one.)
    - https://github.com/riscv-non-isa/riscv-asm-manual/pull/67
    - New function target attribute for C/C++
    - e.g. int sse3_func (void) attribute ((target ("sse3")));
    - hwcap and hwcap2
    
    Most items are toolchain stuff, but last item for hwcap, it should
    coordinate between glibc and linux kernel to implement a new mechanism
    to discover the machine capability.
    
    Speakers: Kito Cheng (SiFive), Palmer Dabbelt (Google)
    
    Puzzle for RISC-V ifunc.pdf
  - 85
    
    Towards continuous improvement of code-generation for RISC-V
    
    Architectures competing with RISC-V have expended considerable time and resources on optimizing their development tools for improved performance on industry-standard benchmarks. For the future growth of the RISC-V ecosystem, a concerted effort to optimize the generated code for performance will be required. This effort will in a large part be independent of the underlying microarchitecture and can be distributed across our entire ecosystem, if we develop the necessary tools and infrastructure to assess for gaps, distribute the work and cooperate.
    
    We propose a data-driven methodology, based on the gathering and comparison of hot-block information, instruction-type histograms and dynamic instructions counts, to evaluate the performance of compilers for RISC-V using qemu. Based on example findings and data, we will illustrate the proposed workflow and how it can allow the prioritisation of potential optimizations based on an expected gain.
    
    We aim to motivate increased cooperation and the creation of a data-driven workflow built around standard tools (primarily plugins for qemu and analysis tools) to continuously monitor and improve the quality of the RISC-V compilers.
    
    Speakers: Philipp Tomsich (VRULL GmbH), Christoph Müllner (SBA Research)
    
    toolchain_improvement_philip.pdf
- Real-time MC Microconference1/Virtual-Room (LPC Virtual)
  
  Microconference1/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Real-time microconference focuses on finishing the last lap of getting the PREEMPT_RT patch set into mainline. Many of these missing pieces, however, are not at the core of real-time features (like locking, and scheduling), but instead, on other subsystems that compose the kernel, like file systems and memory management. Making this Linux subsystems compatible with PREEMPT_RT requires finding a solution that is acceptable by subsystem maintainer, without having these subsystems suffer from performance or complexity issues.
  - 07:00
    
    Welcome
    
    Welcome Message
  - 86
    
    Maintaining PREEMPT_RT: now and then
    
    This topic will present the current workflow for maintaining the PREEMPT_RT, and
    discuss the challenges of maintaining the PREEMPT_RT mode when the merge is done.
    
    Speaker: Sebastian Andrzej Siewior (Linutronix)
    
    lpc_2021_preempt_rt_maintain.pdf
  - 87
    
    rtla: an interface for osnoise/timerlat tracers
    
    The osnoise and timerlat tracers landed up in 5.14.
    
    In addition to the tracing aspects, these two tracers also report performance metrics relevant to real-time. However, it is not easy to manually parse these metrics.
    
    The rtla (real-time linux analysis) is a user-space interface for these tracers. It works by using the libtracefs to set up a tracing section and to collect data and trace information. It has an intuitive interface, and will also serve as the basis for the other real-time related tracers.
    
    The idea is to present and discuss this new tool in this MC topic.
    
    Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
    
    rtla_ an interface for osnoise_timerlat tracers.pdf
    
    rtla.mp4
  - 08:15
    
    Break
    
    This slot is reserved for a break, or a chat if you so wish.
  - 88
    Linux kernel support for kernel thread starvation avoidance
    
    ABSTRACT
    
    Running CPU-intensive high-priority real-time applications on a
    real-time Linux kernel (based on the PREEMPT_RT patchset) can lead to
    situations where the kernel's own housekeeping tasks such as per-cpu
    kernel threads get starved out, resulting in system instability
    (hangs/unresponsive system). The Real-Time Throttling feature in the
    Linux kernel is ineffective in addressing this problem as it does not
    protect low-priority real-time kernel threads (such as ktimersoftd).
    The stalld userspace daemon was introduced to solve this problem, and
    is quite effective in principle; but it has a number of limitations
    that makes it hard to use in practice, especially in production
    deployments. We propose implementing stalld-like starvation avoidance
    for kernel threads directly in the Linux kernel, to address all the
    practical limitations of stalld. This design scales well with the
    number of CPUs, has minimal monitoring overhead (CPU usage), and
    compartmentalizes the fault-domain such that a misbehaved or
    misconfigured real-time application does not bring down the entire
    system.
    
    INTRODUCTION
    
    The Telco industry is undergoing a major revamp of its infrastructure
    at the edge (cell towers) as well as the core datacenter, in order to
    meet the demands of 5G networking. As part of this effort, the
    underlying infrastructure called the Radio Access Network (RAN), which
    was traditionally implemented in hardware (FPGAs) for low-latency
    predictable real-time response, is being replaced with
    software-defined RAN applications running on real-time Linux kernel
    (PREEMPT_RT). These soft real-time applications involve running
    CPU-intensive high-priority real-time tasks, to meet the stringent
    latency requirements as defined by the 5G/3GPP specification.
    
    There are a number of challenges that the Linux real-time stack needs
    to address to support this new class of workloads. This proposal
    focuses on system stability issues when running CPU-intensive
    high-priority real-time applications on the PREEMPT_RT Linux kernel
    and highlights the open issues and proposes a novel design to address
    the limitations of existing solutions by implementing kernel-thread
    starvation avoidance in the Linux kernel.
    
    PROBLEM STATEMENT
    
    In the Telco/5G Radio Access Network (RAN) usecase, deploying the
    application involves running high-priority CPU hogs such as "L1 app"
    (based on Intel's FlexRAN and DPDK Poll-Mode-Driver). These
    latency-sensitive tasks are bound to isolated CPUs and they run
    infinite polling loops (in userspace) with high real-time priority
    (typically SCHED_FIFO/90+). In this scenario, even if the L1 app RT
    tasks don't invoke kernel services by themselves, generic (non-RT)
    workloads running on non-isolated CPUs (such as Kubernetes control
    plane tasks) can cause per-CPU kernel threads to wake up on every CPU.
    However, such kernel threads on isolated CPUs running the L1 app RT
    tasks will get starved out, since the L1 app never yields the CPU.
    
    One of the consequences of starving out essential kernel threads is
    system-wide hangs. As an example, if a container gets destroyed (from
    non-isolated CPUs), the corresponding network namespace teardown code
    in the Linux kernel queues callbacks on per-CPU kworkers, and invokes
    flush_work(), thus expecting the per-CPU kworker on every CPU to
    participate in the teardown mechanism. As a result, the container
    destroy will get hung indefinitely due to kthread starvation on CPUs
    running the L1 app RT tasks. Furthermore, since this code path holds
    the rtnl_lock, any other task that touches kernel networking will end
    up getting stuck in uninterruptible sleep ('D' state) too (eg: sshd,
    ifconfig, systemd-networkd etc.), thus cascading to a system-wide
    hang.
    
    This pattern of kernel subsystems invoking per-CPU kernel threads for
    synchronization is quite pervasive throughout the Linux kernel, and
    the resulting kthread starvation issues go well beyond the specific
    networking scenario highlighted above. Furthermore, even essential
    real-time configuration tools and debugging utilities such as tuned
    and ftrace/trace-cmd themselves rely on kernel interfaces that can
    induce such starvation issues.
    
    EXISTING SOLUTIONS AND LIMITATIONS
    
    The community tried to address the problem of system instability
    caused by running CPU-intensive high priority real-time applications
    in LPC Real-Time microconference 2020 by introducing stalld. The
    stalld userspace daemon monitors the system for starving tasks (both
    userspace and kernel threads), and revives them by temporarily
    boosting them using the SCHED_DEADLINE policy. It achieves this
    revival and system stability by operating within tolerable bounds of
    OS-jitter as configured by the user.
    
    We have been using stalld along with RAN applications and it has been
    quite effective in maintaining system stability. However, we have also
    come across a number of limitations in stalld, owing to its design as
    well as the choice to implement starvation monitoring and boosting in
    userspace. We would like to bring out stalld's pain-points and then
    discuss a prototype that we have developed to address these concerns,
    by implementing stalld-like kernel-thread starvation avoidance
    directly in the Linux kernel.
    
    Limitations of stalld in resolving kthread starvation:
    
    1. Stalld does not scale with the number of CPUs
    
    Stalld spawns a pthread for every CPU to monitor and boost starved
    tasks on the respective CPU. However, in RAN usecases, due to the
    use of CPU isolation, all of stalld's threads are forced (bound) to
    run only on the housekeeping CPUs, which are typically a small
    subset of the available CPUs in the system. For example, on a 20 CPU
    server with CPUs 2-19 isolated to run RT tasks, potentially 20
    stalld threads compete for CPU time on housekeeping CPUs 0-1, trying
    to monitor and boost starved tasks on all the 20 CPUs.
    
    2. Stalld can get starved itself
    
    Since stalld runs as a normal priority task, higher priority tasks
    (or even a high volume of similar priority tasks) running on the
    housekeeping CPUs can starve out stalld itself. Attempting to solve
    this problem by turning stalld into an RT application is risky, as
    it can make the situation worse -- since all of stalld's per-CPU
    monitoring threads put together can potentially consume all the
    available CPU time on the housekeeping CPUs (depending on how
    aggressive the stalld configuration is), real-time stalld can end up
    causing starvation itself!
    
    3. Stalld's logging is unreliable
    
    On systemd-based Linux installations, stalld logs its output related
    to starvation conditions and boosting events to journalctl logs via
    systemd-journald. However, in most situations involving system-wide
    hangs, systemd-journald gets stuck in uninterruptible state too,
    leaving no trace of stalld's execution flow and boosting decisions.
    
    4. Trade-off between time-to-respond vs CPU consumption
    
    One of the other concerns with stalld's design is the use of per-CPU
    threads for starvation monitoring and boosting, which can be CPU
    intensive. To address this problem, stalld supports a
    single-threaded mode of operation to monitor the entire system, but
    trades-off the time-to-respond to starvation conditions in exchange
    for lesser CPU consumption. However, this is a tricky trade-off for
    the system administrator in practice, since typical starvation
    issues arise from per-CPU kthreads woken on every CPU and demand
    quick boosting/revival on every CPU for system stability.
    
    Considering these limitations of stalld for practical deployments, we
    have developed a prototype design to address these concerns by
    implementing stalld-like kernel-thread starvation avoidance directly
    in the Linux kernel.
    
    DESIGN OF PROPOSED SOLUTION (IN-KERNEL KTHREAD STARVATION AVOIDANCE)
    
    Our design to address the limitations of stalld builds on the
    following key insights:
    
    1. Compartmentalize the fault-domains of the RT application & the OS
    
    System-wide hangs (as described above) are almost always caused by
    starving kernel threads, which may be the result of a misconfigured
    real-time application. However, ensuring that kernel threads never
    starve (using an in-built starvation-avoidance algorithm in the
    kernel) will keep the OS stable, while limiting the hangs or
    starvation issues to the misbehaving application itself. A
    misconfigured RT application can no longer bring down the entire OS.
    
    2. Starvation avoidance via (per-CPU) scheduler-hooks scales well
    
    In a typical real-time RAN application deployment, CPU isolation is
    used to move all movable tasks to housekeeping CPUs, so as to run
    the real-time application on the isolated CPUs. In such a
    configuration, the only remaining kernel threads on the isolated
    CPUs are non-migratable per-CPU kthreads such as ktimersoftd,
    per-CPU kworkers etc., and those are the ones that are likely to get
    starved out. Therefore, the problem of identifying starved
    kernel-threads and reviving them via priority boosting is naturally
    CPU-local, and it can be implemented without the need for
    system-wide monitoring or cross-CPU coordination.
    
    The Linux kernel scheduler uses a per-CPU design for scalability.
    Hence, implementing per-CPU kernel thread starvation avoidance by
    directly hooking onto the scheduler should automatically scale well.
    
    3. Kernel-based design lends itself to an elegant implementation
    
    Implementing starvation monitoring and revival for kernel-threads in
    the Linux kernel itself offers a number of surprising benefits,
    including the ability to elegantly side-step entire problem classes
    altogther, as compared to a userspace solution, as noted below.
    
    3A. Efficiency
    
    -
    The in-kernel implementation allows hooking the starvation
    avoidance algorithm to specific events of interest within the
    scheduler (such as task wakeups) which helps minimize unnecessary
    periodic monitoring activity, thus saving CPU time.
    
    3B. No risk of starving the starvation avoidance mechanism
    
    -
    In NOHZ_FULL mode, a single task can effectively monopolize the
    CPU without ever entering the kernel; but luckily this also means that
    there is no chance of starvation since there is only one task
    eligible to run on that CPU. Waking up any other task targeted for
    that CPU will invariably invoke the scheduler, which gives the
    opportunity to run starvation avoidance as needed.
    
    This design also side-steps problems that arise with userspace
    solutions such as deciding the scheduling policy and priority at
    which stalld runs so as to not get starved itself.
    
    IMPLEMENTATION OUTLINE
    
    We have developed a prototype that implements the design envisioned
    above by using scheduler hooks in the Linux kernel as well as hrtimer
    callbacks. A brief outline is presented below.
    
    When a task gets enqueued into a CPU's runqueue, the "stall monitor"
    code arms a starvation-detection hrtimer (if not already armed) to
    fire after a (user-configurable) starvation-threshold, iff the task
    that was enqueued was a kernel thread.
    
    Once the starvation-detection timer fires, the stall monitor code
    checks if the set of runnable kernel threads on that CPU have been
    starving for the threshold duration. If it detects starvation, it
    arranges to boost the kernel threads (one-by-one) using the
    SCHED_DEADLINE policy in the irq_exit() path of the hrtimer interrupt,
    and arms a deboost hrtimer to fire after the (user-configurable) boost
    duration.
    
    The deboost timer's callback restores the scheduling policy and
    priority of the boosted kernel thread to its original settings.
    
    We are still working on revising this basic design and implementation,
    and we are looking forward to share more details at the conference and
    seek the Linux real-time community's invaluable feedback for further
    improvements or better alternatives.
    
    CONCLUSION
    
    The Telco Radio Access Network (RAN) for 5G is an exciting avenue that
    brings a new class of real-time workloads to Linux. While the Linux
    real-time stack based on the PREEMPT_RT patchset has been used with
    great success for decades with tightly controlled real-time
    applications and system configuration, the Telco/RAN usecase
    challenges the status quo by demanding lower real-time latency than
    ever before, while co-existing with non-real-time workloads as generic
    (i.e., not tightly controlled) as Kubernetes.
    
    One of the major pain-points faced by the industry in running these
    workloads on Linux is instability of the underlying OS itself, often
    times triggered by the very tools that are used for Linux real-time
    system configuration, tracing and debugging! In this proposal, we
    discussed the most promising current solution in this problem space,
    namely stalld, and highlighted its limitations as observed in
    practical deployment scenarios. We proposed an alternative design that
    addresses these limitations by implementing stalld-like kernel thread
    starvation avoidance in the Linux kernel itself.
    
    We are looking forward to the Linux community's insightful feedback on
    our design, as well as invaluable suggestions more broadly on solving
    OS stability issues for RAN-like usecases that involve running
    CPU-intensive high priority real-time tasks.
    
    Speakers: Sharan Turlapati, Srivatsa Bhat (VMware)
    
    Kernel-Thread-Starvation-Avoidance-LPC2021.pdf
  - 89
    
    Next steps for futex2
    
    The community has an agreement that a new futex syscall is needed to add new features to help with performance and scalability issues. However, after some patches proposed with different implementations approaches, the path to get it merged is not clear. The goal of this session is to get maintainers and developers together to figure out which is the best approach to make progress in the new interface.
    
    Speaker: André Almeida (Collabora)
    
    futex_lpc2021.pdf
  - 09:35
    
    Break
  - 90
    
    printk: kthreads and atomic consoles for mainline
    
    Since 2018 there has been a dedicated effort to rework printk. Originally fueled by the need to make printk real-time friendly, the task quickly evolved to address many other existing problems within the printk subsystem. Since 5.8 there has been a steady flow of these improvements getting merged into mainline, but several RT-critical pieces are still remaining: sync mode, kthread printers, atomic consoles, pr_flush().
    
    In this session we will take a look at these needed features, talk about why their current PREEMPT_RT implementation is not acceptable for mainline "as is", and discuss the plan for moving forward.
    
    Speaker: John Ogness (Linutronix GmbH)
    
    lpc2021_rtmc_ogness.pdf
  - 91
    
    PREEMPT_RT: status and Q&A
    
    In this talk, Thomas Gleixner will present the status of the PREEMPT_RT, along
    with a section of questions and answers regarding the upstream work and the
    future of the project.
    
    Speaker: Thomas Gleixner
Wednesday 22 September
- Android MC Microconference3/Virtual-Room (LPC Virtual)
  
  Microconference3/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Android microconference focuses on cooperation between the Android and Linux communities.
  - 92
    
    Android MC Introduction
  - 93
    
    Generic Kernel Image (GKI) update
    
    Shortly after last year's Plumbers Conference the initial version of Generic Kernel Images (GKI) shipped in products based on the 5.4 kernel and Android 11. Devices that shipped with a 5.4 kernel are compatible with GKI. Kernel developers can replace the system image with the publicly available GSI image and replace the boot image with GKI and the device will boot and run Android 11.
    
    In Android 12 devices running the 5.10 kernel, the product kernel is GKI, which means kernel fragmentation is nearly eliminated. Kernel development on Android devices will be much easier and much of the difficulty delivering security patches to devices in the field is removed. With a single core-kernel, and an upstream-first process, the gap between the Android kernel and mainline Linux is drastically reduced.
    
    This session will be a brief discussion on the status of GKI in Android 12 followed by Q&A.
    
    Speaker: Todd Kjos (Google)
    
    2021 LPC GKI.pdf
  - 94
    
    Uclamp cgroup usage challenges in Android
    
    Android has been benefiting from extensive use of the cgroup V1 interface to boost important tasks (the top-app and foreground groups) and limit unimportant ones (background). Our recent investigations have shown that combining CPU shares in addition to the newly introduced util-clamp feature can improve user-visible jank specifically in cases where background load is high. Unfortunately, util-clamp and CPU shares are both attached to the CPU controller which constrains userspace's ability to classify tasks and drive these features independently. The issue becomes even bigger when we plan to migrate to cgroup v2. In addition to this, the util-clamp max aggregation can be ineffective because of co-scheduling leading to sub-optimal energy consumption. This talk will describe those problems in more detail and dis
    
    Speakers: Wei Wang (Google LLC), Quentin Perret (Google)
    
    LPC Android MC - Uclamp cgroup usage challenges in Android.pdf
  - 95
    
    FS stacking with FUSE: performance issues and mitigations
    
    Stacking file systems based on FUSE are intended to go through complicated code paths implemented by the FUSE service, to enforce special access policies or manipulate data at runtime, based on what is the request received by the FUSE file system and the data in the lower file system.
    Android relies on FUSE to enforce fine-grained access policies depending on file contents and requesting users, and may modify file contents at run-time.
    
    These benefits come with the cost of an increased overhead to traverse the whole FUSE pipeline, worsened by the multiple switches between kernel-space and user-space. FUSE performance may result in less than 30% compared with direct access to the lower file system files.
    
    FUSE passthrough is a first solution that has been proposed upstream to reduce this performance gap, allowing the FUSE service to provide some files with direct access to the lower file system, that would be internally rerouted by the FUSE driver. This solution is already available in a number of Android devices, but is still under discussion in the mailing list.
    
    Another work-in-progress extension to FUSE passthrough is the extension of the FUSE driver with additional logic based on BPF, still managed by the FUSE service. This solution aims at extending the FUSE passthrough performance benefits also to file system operations and improving the FUSE driver flexibility and updatability with BPF programs without the need for modifying the kernel.
    
    Speakers: Alessio Balsini (Google), Paul Lawrence (Google)
    
    2021 LPC_ FS stacking with FUSE_ performance issues and mitigations.pdf
  - 96
    
    dm-snapshot in userspace
    
    dm-snapshot was a huge step forward for Android updates, but it can have greatly outsized disk space requirements for relatively small binary patches. Since dm-snapshot is closely tuned to the underlying exception store, it is not easily amenable to custom storage formats.
    
    We have addressed this by implementing dm-snapshot in userspace via a new "dm-user" device-mapper module (like FUSE, but for block devices). Since the entire OS runs off this block device, performance is a primary motivation over NBD/iSCSI, and we are interested in how to achieve high-performance userspace block devices in the upstream kernel.
    
    Speakers: Kailash Akilesh (Google), David Anderson (Google)
    
    2021 LPC_ dm-snapshot in user space.pdf
  - 08:20
    
    15min Break
  - 97
    
    Thermal core usage challenges in Android
    
    The Android community has been using thermal core infrastructure for both Tj and Tskin solutions for years, and many thermal DVFS features from various Vendors/OEMs are built upon this, which usually require changes in the thermal governor and framework. However, with GKI introduced, it is now forbidden for OEM to put a custom thermal governor as a module which limits the solution and sometimes leads to sub-optimal code which combines an in-driver governor. In addition, there are many learnings from use of the thermal core infrastructure for both Tskin and Tj solutions together, especially on the way they interact with each other. This talk will describe those problems in more detail and discuss potential solutions and improvements to the current situation.
    
    Speaker: Wei Wang (Google LLC)
    
    LPC Android MC - Thermal core usage challenges in Android.pdf
  - 98
    
    Allocator attribution/metadata tagging for shared buffers
    
    A discussion on how we can find a cost-effective solution to attribute shared buffers to their allocating processes. Other than being useful for memory accounting/debugging, this could also lead the way to a solution to set limits on how much memory a process can allocate.
    
    Speaker: Hridya Valsaraju
    
    2021 LPC Allocator Attribution for DMA-BUFs in Android.pdf
  - 99
    fw_devlink: Remaining issues to resolve and future improvements
    
    fw_devlink has been enabled as "on" by default since version 5.13 and ensures a device's probe() is never called before its supplier devices have successfully bound to a driver.
    
    This talk will be focusing on any remaining issues and future improvements. Some of the topics that'll be discussed include:
    
    Are there any unresolved issues on specific hardware due to fw_devlink=on?
    
    Are there any DT properties that should be parsed by fw_devlink?
    
    What debugging improvements are needed for fw_devlink?
    
    Plan to use fw_devlink to do topologically sorted probing of devices. Any concerns?
    
    Are we ready to do fully asynchronous probing for DT based systems?
    
    Anyone wants to add ACPI support?
    
    Speaker: Saravana Kannan
    
    LPC 2021 - fw_devlink.pdf
  - 100
    
    Android drivers in Rust
    
    The Rust for Linux project is adding support for the Rust language to the Linux kernel. We have a partial implementation of the Android Binder driver, as well as PL061 GPIO and NVMe drivers in Rust. Our goal is to make Rust available to kernel developers so that drivers can be written more expeditiously, with most potential memory bugs caught at compile-time, while at the same time preserving performance characteristics.
    
    We show brief examples of how this is achieved and what real drivers look like in Rust, contrasting them with their C counterparts. We'd then like to discuss concerns, objections, potential unforeseen difficulties, general feedback, etc. that members of the community may have. We're also interested in hearing about existing pain points when writing drivers in C so that we can try to improve the experience in Rust.
    
    Speakers: Wedson Almeida Filho, Miguel Ojeda
    
    android drivers in rust.pdf
  - 09:55
    
    15min Break
  - 101
    
    Speculative page faults
    
    Most Android vendors currently ship kernels that include Laurent Dufour's speculative page fault patchset from about 2.5 years ago. The patch set was rejected upstream at that time, due to code complexity, but provides a significant benefit to application startup times. I have been working on a new spin on the same basic idea, and came up with a patchset version which is (IMO) simpler and more bisectable. I would like to discuss performance results and gauge what our options are for upstreaming this.
    
    Speakers: Michel Lespinasse (Facebook), Mr Laurent Dufour
    
    Speculative Page Faults.pdf
  - 102
    Improving Community AOSP Devboard/Device Collaboration
    
    While there are only a small number of devboards in AOSP, a number of vendors and community members have created external projects to enable their devices against AOSP.
    
    Some examples:
    
    The GloDroid project: https://github.com/glodroid/glodroid_manifest
    
    android-rpi: https://github.com/android-rpi
    
    PocoF1 AOSP: https://github.com/pundiramit/device-xiaomi-beryllium/blob/master/README.md
    
    OnePlus AOSP: https://github.com/calebccff/android_device_generic_sdm845
    
    After seeing some of the excellent work being done in the GloDroid project and realizing there is a fair amount of duplicated effort in keeping a device current with AOSP, I thought there might be a better opportunity for devboard vendors and community members who are focusing on AOSP to collaborate.
    
    I'll cover my thoughts on what sort of collaboration might be useful, along with some of the potential pitfalls, and see what interest or ideas folks have on how we might work together and share more experience and knowledge as a community.
    
    I hope have discussion on the topic from GloDroid maintainers, LineageOS developers, as well as other Linaro and Google developers and hopefully more.
    
    Speaker: John Stultz (Linaro)
    
    Improving AOSP Devboards Collaboration.pdf
  - 10:50
    
    Followup Discussion
- BPF & Networking Summit Networking and BPF Summit/Virtual-Room (LPC Virtual)
  
  Networking and BPF Summit/Virtual-Room
  
  LPC Virtual
  
  150
  
  The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
  
  This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
  - 103
    
    Towards a BPF Memory Model
    
    This talk will review the goals and requirements for a BPF memory model and look at more recent work on deriving memory-model litmus tests from example BPF programs. These examples will cover ordering within and among BPF programs, but also ordering with the kernel code that BPF programs can interact with.
    
    Speaker: Paul McKenney (Facebook)
    
    Slides (PDF)
    
    Video (Youtube)
  - 104
    
    Self-healing Networking with Flow Label
    
    The report covers the use of flow label in modern network environment and the effect of TCP hash 'rethinking' upon negative routing event on the operations.
    
    Speaker: Alexander Azimov (Yandex)
    
    Slides (PDF)
    
    Video (Youtube)
  - 105
    Defragmenting the Loader Landscape
    
    Since the cilium/ebpf pure Golang library was last presented at LPC 2019, a lot has changed. eBPF is now seemingly on everyone's radar, the eBPF Foundation is a thing, and more people are using and writing Go-based tools and services than ever. What does this mean for the library and the ecosystem around it? Who uses it, who's been contributing, and which use cases does the library enable today?
    
    In this talk, we'll mainly discuss the following topics:
    
    Short overview of known users and open-source projects that depend on the library
    
    A proposal to facilitate generating UAPI bindings in other programming languages
    
    A proposal for a common test suite all compliant BPF loader(s) must implement
    
    Q&A will open after each proposal.
    
    Speakers: Timo Beckers (Isovalent), Lorenz Bauer (Cloudflare)
    
    Slides (PDF)
    
    Video (Youtube)
  - 106
    BPF-datapath extensions for Kubernetes workloads
    
    With the rapid adoption of Cilium as the BPF-based datapath for Kubernetes as
    well as integration into popular devops tooling such as kind [0] which allows
    for running local Kubernetes clusters using Docker container 'nodes', we see
    more advanced use (and corner) cases which have not yet been tackled from an
    BPF and networking angle. Therefore, in this slot, we discuss on various loosely
    coupled issues in the networking stack which we are working on in the context
    of Cilium's BPF datapath:
    
    Mixed cgroup v1/v2 interference related to BPF cgroup programs
    
    TCP socket pacing for Pods out of the init network namespace
    
    Managed neighbor entries for load-balancer backends
    
    Wildcarded map lookups for Cilium's n-Tuple PCAP Recorder [1]
    
    We will provide a brief overview of the use cases related to the above, and give
    an outline for kernel extensions we are looking into.
    
    [0] https://kind.sigs.k8s.io/
    [1] https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
    
    Speakers: Daniel Borkmann (Isovalent), Martynas Pumputis (Isovalent)
    
    Slides (PDF)
    
    Video (Youtube)
  - 107
    
    bpfilter - BPF based firewall
    
    Motivation
    
    Iptables has become a synonym of a firewall in Linux world. Although there is a
    nftables which is supposed to replace iptables, iptables will exist for
    decades more because of its popularity and ubiquity.
    
    With the growing widespread use of BPF technology and its benefits there is a
    temptation to apply the technology for the firewalling purposes.
    
    Problem Statement
    
    Despite its advantages iptables is also known for its dark side - performance
    and security related issues. What if it's possible to keep the iptables' ABI
    and replace its implementation with something more performant and secure by
    nature?
    
    Such an approach will keep the existing solutions to work and remove an
    overhead of switching to a new technology.
    
    Approach
    
    There was a RFC patchset back in 2018 which proposed a BPF based firewall -
    bpfilter. From a bird's eye view bpfilter is a compiler implemented as a user
    mode helper kernel module. bpfilter analyses an iptables' ruleset and
    synthesizes an equivalent BPF program. When bpfilter kernel module is loaded
    it starts a userspace process that has an IPC with its kernelspace part. Most
    of the bpfilter functionaility is implemented in the userspace process what
    significantly simplifies its development and improves security. The kernel part
    hooks into the kernel iptables ABI and transparently for the userspace consumer
    passes control to the userspace process. bpfilter userspace process "compiles"
    iptables' ruleset into a BPF program and passes control back to the kernel.
    This approach allows to transparently replace iptables' implementation without
    breaking its consumers and gain all the benefits of BPF ecosystem.
    
    Results
    
    While the initial patchset was abandoned in 2021 there was an attempt to
    ressurect the patchset. Two versions of the updated patchset were submitted to
    the bpf@ mailing list and the third iteration is in the process of preparation.
    Currently bpfilter is able to process basic rules in INPUT and OUTPUT chains
    and translate them into equivalent XDP and TC programs. bpfilter has an easy
    way to add new matches and extensions in terms of iptables.
    
    Conclusions
    
    The idea to treat a firewall as a compiler is seductive - as such an approach
    provides more opportunities for performance optimisations due to a more precise
    context. Combining it with the existing BPF performance and security features
    and putting on top of it its userspace nature - this might sound as the next
    firewall for Linux.
    
    Speaker: Dmitrii Banshchikov (Facebook)
    
    Slides (PDF)
    
    Video (Youtube)
- GNU Tools Track GNU Tools track/Virtual-Room (LPC Virtual)
  
  GNU Tools track/Virtual-Room
  
  LPC Virtual
  
  150
  
  gnu-tools-notes-20210922.txt
  
  gnu-tools-notes-20210923.txt
  
  gnu-tools-track.dl.txt
  
  gnu-tools-track.txt
  - 108
    
    GCC Steering Committee, GLIBC, GDB, Binutils Stewards Q&A
    
    The annual GNU Toolchain mindfulness and meditation session. A cordial Question and Answers session with the GCC Steering Committee, GLIBC, GDB and Binutils Stewards also will be entertained.
    
    Speaker: David Edelsohn (IBM Research)
  - 109
    
    GCC support for the Darwin AArch64 ABI
    
    This is a lightning talk.
    
    One of the hurdles necessary to overcome for the M1 Darwin GCC port is
    supporting the Darwin ABI specification. GCC is designed to process
    argument passing the same way, regardless of whether the argument is
    named or variadic. This however does not leave scope to accommodate the
    Darwin modifications to the AArch64 ABI, which specifies that named
    stack-allocated arguments are passed naturally aligned, but variadic
    arguments are passed word-aligned.
    
    To overcome this, we propose extending the GCC target hook API to carry
    the additional information necessary to let the backend make its own
    decisions about stack layout. The extension will not affect existing
    targets, and is opt-in by nature.
    
    The second issue we tackled was support for the GCC nested function
    extensions to the C language. This is traditionally implemented using
    trampolines injected onto the stack at runtime, which requires an
    executable stack. Since Darwin's stack is not-executable, and the
    target doesn't make use of function descriptors, we required a
    solution to support nested function calls that didn't require changing
    the ABI.
    
    Our preliminary plan is to generate the trampolines into an mmaped
    executable page: The trampolines will be generated when required
    within a function, and deallocated when the control leaves the
    enclosing scope.
    
    Speakers: Maxim Blinov (Embecosm), Andrew Burgess (Embecosm), Iain Sandoe
  - 110
    
    Sharing Cache - optimizing for a single core vs a multi-core system
    
    Recent x86 processors support "non_temporal" stores which bypass the cache when storing data. It is widely understood that normal stores to cache are appropriate when it is likely that the data may be needed before the cache is full. It is also understood that stores of large blocks of data which exceed the available cache allow the overall application to run faster when the block of stores bypass the cache, leaving other locally used data in the cache. A recent change (since reverted) tuned the library routine for memcpy to optimize based on best results assuming a single core was the sole user of the cache instead of allowing for multi-core server chips which have multiple cores sharing a chip. The specifics of the two cases will be presented followed by discussion of how similar single core vs multi-core optimizations might be handled in standard software libraries.
    
    Speaker: Patrick McGehearty (Oracle)
    
    optimize memcpy multi or single core.pdf
  - 111
    
    Security improvements in GCC
    
    There are multiple security features that have been requested for the Linux Kernel for a long time (https://outflux.net/slides/2020/lpc/gcc-and-clang-security-feature-parity.pdf). This wishlist includes wipe call-used registers on return, auto-initialization of stack variables, unsigned overflow detection, etc …
    
    Some of these security features have been available in CLANG, or other compilers for some time. The lack of these features in GCC makes it less competitive than other compilers regarding security.
    
    For over a year, we have been working hard in order to make GCC comparable with, or even better than other compilers in this area.
    
    The focus of this talk is on two security features that we have recently implemented in GCC11, or that we are currently working on for GCC12.
    
    The first feature is called "wipe call-used registers on return”. This is a technique to mitigate ROP (Return-Oriented Programming) and addresses the register erasure problem as mentioned in the "SECURE project and GCC” talk at Cauldron 2018 (https://gcc.gnu.org/wiki/cauldron2018#secure).
    
    This project has been completed and the corresponding patch has been committed to GCC11. In this patch, we have added the new "-fzero-call-used-regs” option, plus the new function attribute “zero_call_used_regs", to GCC.
    
    To improve kernel security, this new feature is now used in the Linux Kernel. See https://patchwork.kernel.org/project/linux-kbuild/patch/20210505191804.4015873-1 keescook@chromium.org for more details.
    
    The second feature is called "stack variables auto-initialization”. It is a technique to provide automatic initialization of automatic variables. LLVM has supported the -ftrivial-auto-var-init=pattern/zero option to provide this functionality. This is currently implemented as a plugin in the Linux kernel source tree, but ideally the compiler supports it natively, without the need for an external plugin.
    
    This project is ongoing. The 7th version of the patch has been submitted to GCC upstream for review and discussion (https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576341.html).
    
    In the talk we provide a high-level overview of these two features. This includes a description of the issues, the motivation, major considerations, some interesting implement
    
    Speaker: Qing Zhao
    
    LPC_security_gcc_temp.pdf
  - 09:00
    
    Coffee Break
  - 112
    
    PowerPC BOF
    
    Discuss topics related to the rs6000 / Power / PowerPC toolchain, including support for Power10.
    
    Speakers: Bill Schmidt (IBM Corporation), Segher Boeesenkool (IBM Corporation)
  - 113
    The GNU C Library BoF
    
    The GNU C Library is used as the C library in the GNU systems and most systems with the Linux kernel. The library is primarily designed to be a portable and high performance C library. It follows all relevant standards including ISO C11 and POSIX.1-2008. It is also internationalized and has one of the most complete internationalization interfaces known.
    
    This BoF aims to bring together developers of other components that have dependencies on glibc and glibc developers to talk about the following topics:
    
    Observability: LD_AUDIT, PLT optimizations, and interposition.
    
    Planning for glibc 2.35 and what work needs to be done between August 2021 and January 2022.
    
    Planning for glibc 2.36 and what work needs to be done between January 2022 and July 2022.
    
    ... and more.
    
    Speaker: Mr Carlos O'Donell (Red Hat)
    
    lpc-2021-glibc-bof.odp
    
    lpc-2021-glibc-bof.pdf
- LPC Refereed Track Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  - 114
    io_uring: BPF controlled I/O
    
    io_uring is an asynchronous I/O API crafted for efficiency, where one of the reasons for using shared rings is to reduce context switching. It got lots of of features since introduction, and pushing it further we want to give away some of the control over submitting and controlling I/O to BPF, minimising the number of context switches even more.
    
    It should keep the number of system calls to a minimum.
    
    Help to lower overhead on scheduling user processes to CPUs when they have not much to do and will go to sleep briefly.
    
    Be an alternative to submission queue polling for latency reduction not taking as much CPU time at the same moment.
    
    We'll go over the current design [1] and decisions, issues and plans, and hopefully it will engage a discussion and give impetus to curious minds to try it out and share ideas on how to tailor the API to fit their use cases.
    
    [1] https://lore.kernel.org/io-uring/a83f147b-ea9d-e693-a2e9-c6ce16659749@gmail.com/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b
    
    Speaker: Pavel Begunkov
    
    io_uring-BPF.pdf
  - 115
    
    The Case for Memory Segregation
    
    In this talk, we would like to propose adding roles to memory pages. We contend that the current monochromatic memory model cannot address modern systems' security and performance needs.
    
    We want to discuss two recent projects that perform memory segregation. DMA Aware Malloc for Networking (DAMN) that protects against DMA attacks (e.g., project thundeclap) while providing the same performance at +100Gb/s as with iommu=off.
    And also, a Memory Allocator for I/O (MAIO), which facilitates overhead free zero-copy networking for user-space applications.
    
    We implement memory segregation by adding extra metadata into Tail pages of huge (i.e., compound) pages. This additional metadata allows for
    fast address translation and any additional operations per segment role.
    The shared common DNA of both projects is a memory allocator that allots memory for specific operations. This memory can later be reclaimed by a simple put_page. While the memory pools are based on compound pages, different memory allocation techniques can be implemented, e.g., page_frag or single 4K page allocations. We take special care to ensure that the individual page ref_count is used by get/put_page used rather than the head pages ref_count.
    
    This form of memory segmentation isolates general kernel memory from segmented memory that a device or a user also uses. This isolation of memory is vital to facilitate fast and secure I/O operations.
    
    Both DAMN and MAIO are prime examples that demonstrate how to solve complex I/O problems by adding segmentation to the existing memory model.
    
    Speaker: Alex Markuze (VMware)
    
    lpc-2021-connected-penguins.odp
  - 116
    
    systemd-oomd: PSI-based OOM kills in systemd
    
    Following a previous talk, oomd: a userspace OOM killer, Facebook has since come up with a simplified interface for oomd that removes some of the barriers of configuring oomd. Integration with systemd allows more users to reuse their knowledge of configuring systemd daemons. And by removing some of the complexity of coming up with an OOM kill policy, we enable more users to do cgroup and PSI-based OOM kills.
    
    This talk will cover the key features of oomd that were preserved in systemd-oomd, what changes were made to ease the kill policy decisions, and how this translates to the settings we've adopted in Fedora. We will close with a discussion of future work for systemd-oomd.
    
    Speaker: Anita Zhang
    
    [LPC 2021] systemd-oomd_ PSI-based OOM kills in systemd.pdf
  - 117
    
    A maintainable, scalable, and verifiable SW architectural design model for the Linux Kernel
    
    Over the last years, many discussions took place in Linux Foundation's ELISA workgroup (elisa.tech) about possible approaches to qualify Linux for safety-critical systems. It is a consensus that one of the main challenges for the qualification of Linux is the lack of SW Architectural Design documentation, especially concerning the kernel internal components/drivers/subsystems. Such documentation is fundamental in functional safety as it provides the baseline required to assess the OS design against the allocated safety requirements (safety analysis). This Architectural Design is also necessary to evaluate the completeness and correctness of the associated test campaign.
    
    However, given the complexity of Linux, the challenge is finding a documentation format that is complete enough to justify the assessment while still keeping a maintainable granularity.
    
    This talk will present an SW architectural design model that, working at the granularity level of the single drivers/subsystems, uses a formal method (automata) to describe the interaction of a target subsystem/driver with the rest of the kernel, whereas a natural language description (kernel-doc headers) is used to describe the behavior of the target subsystem/driver itself.
    
    During the talk, the authors will present how to use computer-aided design tools to help to derive the automata models of target subsystems. They will also show how to take advantage of the proposed Runtime Verification Interface [1] to transform these models into runtime verification monitors that are usable either during the verification phase (to cross-verify the kernel and the documentation) or to monitor safety-related aspects of the system at runtime, avoiding unsafe states.
    
    The discussion of this topic in a more development centric conference (instead of a more safety related audience) is necessary to get the direct feedback of kernel developers/maintainers about the approach and the maintainability of the SW Architectural Design documentation.
    
    [1] https://lore.kernel.org/lkml/cover.1621414942.git.bristot@redhat.com/
    
    Speakers: Mr Gabriele Paoloni, Mr Daniel Bristot De Oliveira (Red Hat)
    
    Talk plumbers 2021 (1).pdf
- System Boot and Security MC Microconference4/Virtual-Room (LPC Virtual)
  
  Microconference4/Virtual-Room
  
  LPC Virtual
  
  150
  
  The System Boot and Security microconference focuses on the firmware, bootloaders, system boot and security around the Linux system. It also welcomes discussions around legal and organizational issues that hinder cooperation between companies and organizations to bring together a secure system.
  - 118
    
    System Boot and Security MC Introduction
    
    Speaker: Daniel Kiper
    
    Dynamic Root of Trust(D-RTM) on non-x86 architectures like ARM and POWER9.pdf
  - 119
    
    Writing Grub2 modules in Rust
    
    The grub2 bootloader is a trusted component of the secure boot process, including "traditional" GPG-based secure boot, UEFI-based secure boot, and the logical partition secure boot process being developed by IBM. Grub2 is mostly written in C and has suffered from a number of memory-unsafety issues in the past.
    
    Rust is a systems programming language suitable for low-level code. Rust can provide valuable tools for safer code: code in 'safe' Rust has stronger guarantees about memory safety, while 'unsafe' code has to be contained in specially marked sections. It is reasonably easy for Rust code to interoperate with C.
    
    Grub2 is based on a modular design. Potentially vulnerable components such as image and file-system parsers are written as individual modules. Can we progressively rewrite these modules in a safer language?
    
    I will discuss my progress enabling Rust to be used as a language for grub development, issues I have encountered, decisions we will have to make as the grub community, and next steps from here.
    
    Speaker: Daniel Axtens (IBM)
    
    grub-rust-plumbers.pdf
  - 120
    
    Firmware and Bootloader Logging
    
    In the bootloader as well as firmware, there is a lot of useful information on how the system is set up. However, there has been a lack of transportation in sending this information to the operating system. Initially, we designed a log to record messages from the GRUB2 bootloader so the TrenchBoot project could view how the platform was being setup during boot. After some discussion, we realized this could be useful for other projects and we could extend our design to work for other boot components. In this presentation, we will look at ways to collect information from the firmware and bootloader for the operating system.
    
    Speakers: Alec Brown, Daniel Kiper
    
    Firmware and Bootloader Log Specification.pdf
  - 121
    Linux and DRTM on Arm
    
    A specification for Dynamic Root of Trust for Measurement (DRTM) on the Arm architecture will be available Fall 2021. DRTM allows a system in a potentially unknown or untrusted state to boot an OS or hypervisor into a known and trusted state.
    
    This topic will present an overview of DRTM on Arm to provide context, followed by discussion around several topics that have implications for the Linux kernel:
    
    questions around the handoff from the dynamic launch to the Linux kernel
    
    the problem of UEFI RT services in the context of DRTM and Linux
    
    questions around supporting dynamic TPM localities on Arm systems
    
    Speaker: Stuart Yoder (Arm)
    
    Linux_DRTM_for_Arm.pdf
  - 122
    
    TrenchBoot Secure Launch upstreaming
    
    The ability to do a Trusted Computing Group (TCG) Dynamic Launch of a system has been commercially available in x86 processors since 2006 with the introduction of Intel TXT for Intel processors and by AMD-V for AMD processors. Over the years the technology has mainly been used by limited number of security-sensitive projects. The TrenchBoot Project has been working to make the underlying hardware technology more integrated and to be an out-of-the box solution usable by the general Open-Source Operating System user. Towards that goal the project has been working to upstream a into the Linux kernel the ability to be directly launched by a TCG Dynamic Launch in a unified manner. The first patchset submitted is focused in enable this approach for Intel TXT, with support for AMD and Arm to come soon after. This purpose of this topic is to engage the Linux developer community for feedback on the current patches and discuss ways in which progress towards merging could be made.
    
    Speakers: Ross Philipson (Oracle), Daniel Smith (Apertus Solutions, LLC)
    
    TB-LPC-2021.pdf
- Testing and Fuzzing MC Microconference2/Virtual-Room (LPC Virtual)
  
  Microconference2/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Testing and Fuzzing microconference focuses on advancing the current state of testing of the Linux kernel. We aim to create connections between folks working on similar projects, and help individual projects make progress.
  - 123
    Testing and Fuzzing MC Welcome
    
    The Linux Plumbers 2021 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel. We aim to create connections between folks working on similar projects, and help individual projects make progress.
    
    We ask that any topic discussions will focus on issues/problems they are facing and possible alternatives to resolving them. The Microconference is open to all topics related to testing & fuzzing on Linux, not necessarily in the kernel space.
    
    Potential topics:
    
    KernelCI: Improving user experience. (https://groups.io/g/kernelci/message/948)
    
    Growing KCIDB, integrating more sources. (https://lists.yoctoproject.org/g/automated-testing/message/855)
    
    Better sanitizers: KFENCE, improving KCSAN. (https://lwn.net/Articles/835367/)
    
    Using Clang for better testing coverage: Now that the kernel fully supports building with clang, how can all that work be leveraged into using clang's features?
    
    How to spread KUnit throughout the kernel?
    
    Testing in-kernel RUST code.
    
    Things accomplished from last year:
    
    KCIDB achieved multiple integrations, acting as a central collecting point for KernelCI, Red Hat's CKI, syzbot, ARM, Gentoo, Linaro's TuxSuite etc...
    
    KFENCE was successfully merged. (https://lwn.net/Articles/835367/)
    
    Clang: CFI, weeding out issued upstream, etc.
    
    KUnit started acting as the standard for some drivers. (https://www.youtube.com/watch?v=78gioY7VYxc)
    
    Confirmed to-be attendees:
    
    Sasha Levin
    
    Kevin Hilman
    
    Guillaume Tucker
    
    Alice Ferrazzi
    
    Veronika Kabatova
    
    Nikolai Kondrashov
    
    Antonio Terceiro
    
    Mark Brown
    
    Don Zickus
    
    Enric Balletbo
    
    Tim Orling
    
    Gustavo Padovan
    
    Bjorn Andersson
    
    Milosz Wasilewski
    
    Shuah Khan
    
    Martin Peres
    
    Arnd Bergmann
    
    Remi Duraffort
    
    Peter Zijlstra
    
    Daniel Stone
    
    Jan Lübbe
    
    Dmitry Vyukov
    
    Brendan Higgins
    
    Greg KH
    
    Anders Roxell
    
    Guenter Roeck
    
    Jesse Barnes
    
    Kees Cook
    
    Speakers: Sasha Levin, Mr Guillaume Tucker
  - 124
    
    Detecting semantic bugs in the Linux kernel using differential fuzzing
    
    Many bugs are easy to detect: they might cause assertions failures, crash our system, or cause other forms of undefined behaviour detectable by various dynamic analysis tools. However, certain classes of bugs, referred to as semantic bugs, cause none of these while still resulting in a misbehaving faulty system.
    
    To find semantic bugs, one needs to establish a specification of the system’s intended behaviour. Depending on the complexity of the system, creating and centralising such specifications can be difficult. For example, the “specification” of the Linux kernel is not found in one place, but is rather a collection of documentation, man pages, and the implied expectations of a vast collection of user space programs. As such, detecting semantic bugs in the Linux kernel is significantly harder than other classes of bugs. Indeed, many test suites are meant to detect regressions, but creating and maintaining test cases, as well as covering new features requires significant amounts of engineering effort.
    
    Differential fuzzing is a way to automate detection of semantic bugs by providing the same input to different implementations of the same systems and then cross-comparing the resulting behaviour to determine whether it is identical. In case the systems disagree, at least one of them is assumed to be wrong.
    
    syz-verifier is a differential fuzzing tool that cross-compares the execution of programs on different versions of the Linux kernel to detect semantic bugs. It was developed as part of the syzkaller project which also provides unsupervised coverage-guided kernel fuzzing.
    
    To generate programs, syz-verifier uses a declarative system call description language called syzlang. This allows generating valid random programs (sequences of system calls) the same way as syzkaller does. The programs are then dispatched for execution on different versions of the Linux kernel. After programs finish executing, the produced results (currently only errnos returned by each system call) are collected and verified for mismatches. In case a mismatch is identified, syz-verifier reruns the program on all kernels to ensure it is not flaky (i.e. consistently reproducible rather than triggered due to some background activity or external state). If the mismatch occurs in all reruns, syz-verifier creates a report for the program.
    
    Speakers: Mara Mihali, Marco Elver (Google), Dmitry Vyukov (Google)
    
    syz-verifier - Linux Plumbers 2021.pdf
  - 125
    
    Bare-metal testing using containerised test suites
    
    The traditional approach of testing software on real hardware usually involves creating a rootfs which contains the test suites that need to be run, along with its run-time dependencies (network, mounting drives, time synchronization, ...).
    
    Maintaining one rootfs per test suite is a significant packaging burden, but also prevents running multiple test suites back to back which slows down testing. The alternative is also not a clear win as this makes the creation of the rootfs harder when having conflicting requirements between test suites or if a test suite silently modifies some configuration which would impact other test suites, potentially leading to test failures being mis-attributed.
    
    Fortunately, Linux namespaces and OCI containers are now becoming commonplace and can now be used to package our test suites along with their dependencies, without having to integrate them all in one image. Provided that you have a well configured host kernel and OS, this enable running test suites in relative isolation thus reducing the chances of interference between test suites. Finally, the packaging problem can be alleviated by having the test suites provide releases as containers, thus allowing re-use in many CI systems without modifications.
    
    In this presentation, we will further present the benefits of containers, and introduce boot2container: A podman-powered initramfs that gets configured declaratively using the kernel command line, is deployable via PXE thanks to its small size (<20 MB), and that makes it easy to share files with/from the test machine via an S3-compatible object storage.
    
    Speaker: Martin Peres
    
    lpc2021-bare-metal-testing-using-containerised-test-suites.pdf
  - 126
    
    Common Test Report Database (KCIDB)
    
    Join to hear about the next KCIDB release, new features and plans, including the new report format and subscription/notification system. Provide feedback and discuss ideas for further development. Get help submitting your data or joining the development.
    
    KernelCI's KCIDB is an effort to unify kernel test reporting schema and protocol, and provide a service for aggregating, analyzing, reporting, and accessing test results received from various Kernel testing systems. We are already receiving data from ARM, Gentoo's GKernelCI, Red Hat's CKI, Google's Syzbot, Linaro's Tuxsuite, and of course the native KernelCI tests, and we're working on receiving data from more systems.
    
    See our dashboard at https://kcidb.kernelci.org/
    
    Speaker: Nikolai Kondrashov (Red Hat)
    
    Common Test Report Database (KCIDB) LPC2021.pdf
  - 127
    
    Testing the Red-Black tree implementation of the Linux kernel against a formally verified variant
    
    In this talk, we will show how to construct evidence of correctness through
    testing and formal verification. In our case study, we test the long-standing
    Red-Black tree implementation in the kernel against a variant in a functional
    programming language. This variant has been formally verified in the interactive
    theorem prover Isabelle [1]. To our surprise, the kernel Red-Black tree
    implementation is a variant that is not known in the literature of functional
    data structures so far. We are glad that we still found it to be correct with
    newly identified invariants for the correctness proof.
    [1] https://isabelle.in.tum.de/
    
    Speakers: Mr Mete Polat (Technische Universität München), Lukas Bulwahn (Elektrobit Automotive GmbH)
    
    LPC Verified Oracles.pdf
    
    Polat_2021_Testing-the-Red-Black-Tree-Implementation-of-the-Linux-Kernel-against-a-Formally-Verified-Variant.pdf
  - 08:55
    
    Break
  - 128
    
    New Smatch Developments
    
    Smatch is one of the main static analysis tools used in the kernel. These days simple static analysis checks are increasingly implementing in the compilers. For Smatch the new work is in more complicated cross function analysis that compilers cannot handle.
    
    This talk will give a brief introduction to the new Smatch Param/Key API which makes it easier to write advanced cross function checks and removes a lot of boilerplate code.
    
    Then it will cover Smatch's "Sleeping in atomic" check. Checking for sleeping in atomic bugs requires complicated cross function analysis. This is an example of an advanced check with a lot of moving parts.
    
    Finally, the talk will cover an in development check for race conditions. In some ways this is the most complicated Smatch check ever. Hopefully we can have a discussion about how to make this check better.
    
    Speaker: Dan Carpenter (Oracle)
    
    lpc-2021 (1).pdf
  - 129
    
    Fuzzing Device Interfaces of Protected Virtual Machines
    
    Both AMD and Intel have presented technologies for confidential computing in cloud environments. The proposed solutions — AMD SEV (-ES, -SNP) and Intel TDX — protect Virtual Machines (VMs) against attacks from higher privileged layers through memory encryption and integrity protection. This model of computation draws a new trust boundary between virtual devices and the VM, which in so far lacks thorough examination.
    To enable the scalable analysis of the hardware-OS interface, we present a dynamic analysis tool to detect cases of improper sanitization of input received via the virtual device interface. We detail several optimizations to improve upon existing approaches for the automated analysis of device interfaces. Our approach builds upon the Linux Kernel Library and clang’s libfuzzer to fuzz the communication between the driver and the device via MMIO, PIO, and DMA. An evaluation of our approach shows that it performs 570 executions per second on average and improves performance compared to existing approaches by an average factor of 2706.
    Using our tool, we analyzed 22 drivers in Linux 5.10.0-rc6, thereby uncovering 50 bugs and initiating multiple patches to the virtual device driver interface of Linux.
    
    Speakers: Felicitas Hetzelt (TU Berlin), Martin Radev, Robert Buhren, Mathias Morbitzer
    
    lpc21_fuzzing_device_interfaces_pvm.pdf
  - 130
    Testing in-kernel Rust code
    
    The Rust for Linux project is adding support for the Rust language to the Linux kernel. A key part of such an effort is how to approach testing for code written in the new language.
    
    It covers:
    
    A quick overview of testing in Rust: how testing usually looks like in Rust (unit tests, integration tests & documentation tests), what is provided by the language, standard library and tooling, etc.
    
    The current in-kernel testing support.
    
    Testing in the host vs. in the kernel vs. from userspace.
    
    What we are planning for the future and related work.
    
    Speaker: Miguel Ojeda
    
    2021-09-22 - Linux Plumbers Conference - Testing in-kernel Rust code.pdf
  - 131
    
    KUnit: New Features and New Growth
    
    The past year has been an exciting one for KUnit, but there's still a long way to go to test a project as large and complicated as the Linux kernel. In this talk, we'll go over what KUnit has been doing since last year, and discuss how we can increase KUnit’s adoption throughout the Linux kernel.
    
    We'll begin with an overview of new and improved features that have been added to KUnit, such as QEMU support in kunit_tool, SKIP test support, as well as improvements to documentation. We'll also touch on features and ideas that we have been experimenting with, and the challenges and opportunities they have presented.
    
    We will then discuss KUnit's growing use, before transitioning into how we can increase adoption of KUnit across different parts of the kernel: for example, by migrating suitable ad-hoc tests into KUnit. We'll also talk about the challenges of testing drivers and subsystems, and how we are trying to build up a comprehensive set of tests in a major Linux kernel subsystem as a model to show how it can be done in other subsystems.
    
    At this point, we will transition to having a group discussion about how we can grow KUnit usage across the kernel, what the complexities of testing different subsystems are, and which of these features and plans seem most useful to the community.
    
    Speakers: Brendan Higgins (Google LLC), David Gow (Fellow Contributor)
    
    KUnit_ New Features and New Growth.pdf
- Tracing MC Microconference1/Virtual-Room (LPC Virtual)
  
  Microconference1/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Tracing microconference focuses on improvements of the Linux kernel tracing infrastructure. Ways to make it more robust, faster and easier to use. Also focus on the tooling around the tracing infrastructure will be discussed. What interfaces can be used, how to share features between the different toolkits and how to make it easier for users to see what they need from their systems.
  - 132
    
    Introduction
    
    Short introduction to the Tracing Microconference
    
    Speaker: Steven Rostedt
  - 133
    
    DTrace based on BPF and tracing facilities: challenges
    
    The topic aims to present various challenges we have ran into during the implementation of DTrace on top of Linux tracing facilities such as BPF. We hope to have open discussion on how we can get around some of these challenges because they are likely to be things other projects will run into as well. In addition, we want to share some of the workarounds we came up with, and hopefully spark discussion on how to propose fixes rather than depending on creative workarounds.
    
    Speaker: Kris Van Hees (Oracle USA)
    
    LPC-2021-DTraceOnBPF-Challenges.pdf
  - 134
    Enabling user mode programs to emit into trace_event / dyn_event
    
    Summary
    We have many user processes today that run in various locations and control groups. To know
    every binary location for each version becomes a challenge. We also have common events that
    trace out among many processes. This makes using uprobes a challenge, but not impossible.
    However, having a way for user processes to publish data directly to trace_events enables a
    much easier path toward collecting and analyzing all of this data. We do not need to track
    per-binary locations, nor do we need to enter the control groups to find the real binary paths.
    
    Today the main way to create and get data into a trace_event from a user mode program is by
    using uprobes. Uprobes require the locations of each binary that wants to be traced in addition
    to all of the argument locations. We propose an alternative mechanism which allows for faster
    operation and doesn't require knowing code locations. While we could use inject and dynamic_events
    to do this as well, user processes don't have a way to know when inject should be written to.
    
    Knowing when to trace
    In order to have good performance, user mode programs must know when an event should be traced.
    Uprobes do this via a nop vs int3 and handle the break point in the die chain handler. To account
    for this a tracefs file called user_events_mmap will be created which will be mmap'd in each
    user process that wants to emit events. Each byte in the mmap data will represent 0 if nothing
    is attached to the trace_event, and non-zero if there is. It would be nice to use each bit of
    the byte to represent what system is attached (IE: Bit 0 for ftrace, bit 1 for perf, etc). This
    has the limitation however of only being able to support up to 8 systems, unless bit 7 is reserved
    for "other". User programs simply branch on non-zero to determine if anything wants tracing. To
    protect the system from running out of trace_events the amount of user defined events is limited
    to a single page. The kernel side keeps the page updated via the underlying trace_events register
    callbacks. The page is shared across all processes, it's mapped in as read only upon the mmap syscall.
    
    Opening / Registering Events
    Before a program can write events they need to register/open events. To do this an IOCTL is issued
    to a tracefs file called user_events_data with a payload describing the event. The return value of
    the IOCTL represents the byte within the mmap data to use to check if tracing is enabled or not. The
    open file can now be used to write data for the event that was just registered. A matching IOCTL is
    available to delete events, delete is only valid when all references have been closed.
    
    Writing Event Data
    Writing event data using the above file is done via the write syscall. The data described in each
    write call will represent the data within the trace_event. The kernel side will map this data into
    each system that is registered, such as ftrace, perf and eBPF automatically for the user.
    
    Event status pseudo code:
    
    page_fd = open("/sys/kernel/tracing/user_events_mmap"); status_page = mmap(page_fd, PAGE_SIZE); close(page_fd);
    
    Register event pseudo code:
    
    event_fd = open("/sys/kernel/tracing/user_events_data"); event_id = IOCTL(event_fd, REG, "MyUserEvent");
    
    Write event pseudo code:
    
    if (status_page[event_id]) write(event_fd, "My user payload");
    
    Delete event pseudo code:
    
    IOCTL(event_fd, DEL, "MyUserEvent");
    
    Speaker: Beau Belgrave
    
    User data to trace_events.odp
    
    User data to trace_events.pdf
  - 135
    
    Container tracing
    
    Providing adequate observability in containerized workloads is getting more important. Sophisticated instruments are being developed to understand what is really going on, but most of the effort approaches the problem from top to bottom, operating at the abstraction layers of container orchestration.
    
    What if we take the opposite approach and use Linux system tracing to unfold the container and look inside. How can we check what is being executed, what files are being open etc.
    
    We currently have a POC based on kprobes that traces the system calls of a Docker container. It is written in Python and can be easily adapted to any changing environment, but it would be nice to have a standard API that would just work for all containers (not only Docker). Can we standardize the tracing of containers? How can we expand this to tracing containers on more than one machine?
    
    Speaker: Yordan Karadzhov (VMware)
    
    container_tracing.pdf
  - 08:20
    
    Break
  - 136
    
    Tracepoints that allow faults.
    
    When invoked from system call enter/exit instrumentation, accessing user-space data is a common use-case for tracers. However, tracepoints currently disable preemption around iteration on the registered tracepoint probes and invocation of the probe callbacks, which prevents tracers from handling page faults.
    
    Discuss the use-cases enabled by allowing system call entry/exit tracepoints to take page faults, and what is missing to upstream this feature.
    
    https://lwn.net/Articles/835426/
    https://lwn.net/Articles/846795/
    
    Speaker: Mathieu Desnoyers (EfficiOS Inc.)
    
    lpc2021-tracing-mc-faultable-tracepoints.pdf
  - 137
    
    LTTng as a fast system call tracer
    
    Upstreaming the LTTng kernel tracer [1] (originally created in 2005) into the Linux kernel has been a long-term goal of the LTTng project.
    
    Today, various tracing technologies are available in the Linux kernel: instrumentation with tracepoints, kprobes, kretprobes, function tracing, performance counters through perf, as well as user-visible ABIs, namely Ftrace, Perf, and eBPF. There are however areas in which the LTTng kernel tracer has unique capabilities which other tracers lack.
    
    Efficiently tracing system call entry/exit while fetching system call input/output parameters from user-space is a use-case the LTTng kernel tracer can cover, thanks to its ring buffer design which allows preemption.
    
    Discuss the challenges and establish a roadmap towards upstreaming the pieces of the LTTng kernel tracer required to trace system calls into the Linux kernel.
    
    [1] https://lttng.org
    
    Speaker: Mathieu Desnoyers (EfficiOS Inc.)
    
    lpc2021-tracing-mc-lttng-fast-system-call-tracer.pdf
  - 138
    
    Eventfs based upon VFS to reduce memory footprint.
    
    Problem Statement:
    Linux tracing provides mechanism to have multiple instances of 'tracing' and each instance of tracing have individual events directory which is known as 'Eventfs Tracing Infrastructure' i.e. '/sys/kernel/debug/tracing/events'.
    
    'Eventfs Tracing Infrastructure' contains lot of files/directories although depending upon the Kernel config, still the number of files/directories ranges more than 10k which consumes memory in MBs. Further creating new instance of 'Linux tracing', creates its own copy of 'events'.
    
    Solution:
    As per the usage, it would be creating only the relevant directories/files at runtime and deletes them if it's not require anymore. This is based upon 'Virtual file system'.
    
    POC/Code:
    Please refer prototype code here:
    https://gitlab.com/akaher/linux-trace/-/commits/ftrace/eventfs/
    
    Speaker: Ajay Kaher
    
    Eventfs_split_v15.pdf
  - 09:50
    
    Break
  - 139
    
    Function tracing with arguments
    
    With the new DYNAMIC_FTRACE_WITH_ARGS feature that x86 (and hopefully soon other archs have), the function tracer callback gets all the registers needed to see the arguments by default (but not all registers). In theory, we can use something like BTF, which can describe the arguments of every function, and use it to trace them.
    
    Currently, BPF can do this on a function by function basis, where it retrieves the arguments via generated code (with the help from BTF). But for function tracing, generated code is not needed. Just a quick lookup of how the arguments are defined, and how to use the pt_regs to to retrieve them.
    
    Secondly, once the arguments are retrieved, a generic way to write this to the ring buffer would also be needed.
    
    All the functionality to do this is now available in the kernel (DYNAMIC_FTRACE_WITH_ARGS and BTF). How to implement it, is another question that needs to be solved, and this session will focus on that.
    
    Speakers: Steven Rostedt, Jiri Olsa
    
    lpc-2021-ftrace-args.pdf
  - 140
    
    Merging the return caller infrastructures
    
    Currently there's three infrastructures that can trace the exit of the function.
    
    kretprobes
    function_graph
    BPF direct trampolines
    
    Each one does it differently, and they can stumble over each other when they trace the same function call return. There should be a way that all three can somehow use the same infrastructure. At least maybe two of them?
    
    There's been prototypes to do this, but nothing satisfactory as of yet. Perhaps a meeting of the minds can help make this work?
    
    Speaker: Steven Rostedt
    
    lpc-2021-func-return-merge.pdf
Thursday 23 September
- BPF & Networking Summit Networking and BPF Summit/Virtual-Room (LPC Virtual)
  
  Networking and BPF Summit/Virtual-Room
  
  LPC Virtual
  
  150
  
  The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
  
  This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
  - 141
    
    Dynamic Encapsulation Using eBPF
    
    Prior to LWT (Lightweight Tunnels) and modern eBPF, the only way to send encapsulated packets to multiple destinations was achieved by creating multiple tunnel devices which didn’t scale well when thousands of different destinations were needed.
    
    In the past Google solved this problem by introducing custom patches on top of the ip gre device to allow sockets to provide the destination address and encapsulation protocol to change the encapsulation headers in flight, but thanks to advancement of eBPF this logic can be completely implemented outside of the kernel in a less intrusive way and with all of the benefits that come with eBPF.
    
    In this presentation I’m going to talk about how eBPF was used to encapsulate packets using the eBPF TC filter and the cgroup hooks, discuss what the differences are between this approach and LWT, explain how this feature was easily extended to support a more interesting feature: “encapsulation headers reflection” which is used to store the encapsulation headers of incoming traffic and reflect them on the responses making it transparent for the application. During the talk I'm also going to discuss the pain points found during the implementation which lead us to non obvious solutions.
    
    The goal is to have an open discussion about how the problem was solved and the obstacles faced and highlight possible eBPF/kernel features that would have been nice to have i.e BPF_MAP_TYPE_NS_STORAGE (namespace storage).
    
    Speakers: Brian Vazquez (Google), Coco Li (Google), Stanislav Fomichev (Google), Willem de Bruijn (Google)
    
    Slides (PDF)
    
    Video (Youtube)
  - 142
    BPF user experience rough edges
    
    This talk highlights a few rough edges in the overall BPF user experience that we have observed while building services with BPF at Cloudflare. We will showcase a set of problems, analyze their cause, and present possible workarounds. The goal of the talk is to share collected know-how with other users, and trigger discussions on potential improvements.
    
    Collected cases fall into two distinct categories:
    
    issues when running BPF with as few capabilities as possible,
    
    issues when loading generated BPF programs.
    
    Within the first group we are going to cover such topics as:
    
    locked memory limit (still relevant because present in LTS kernels),
    
    credentials control on BPF links,
    
    access control on BPF maps,
    
    accessing pinned objects under /sys/fs/bpf,
    
    incompatibility between existing socket maps.
    
    In the second category, we’ll cover various clang / LLVM optimizations that cause generated C to fail with only small input changes:
    
    optimized out packet bounds checks,
    
    stack spilling,
    
    register “mirroring”, where clang thinks they have the same value but not the verifier,
    
    inter generated code optimizations.
    
    We’ll also discuss how we’re switching to a hybrid static C & generated eBPF model, and fuzzing the eBPF generator.
    
    Speakers: Jakub Sitnicki (Cloudflare), Arthur Fabre (Cloudflare)
    
    Slides (PDF)
    
    Video (Youtube)
  - 143
    From XDP to Socket
    
    In this talk, we describe important challenges in L4 and L7 load balancing for the consistent routing of packets across hosts as well as across sockets within a host, once a packet is received in the XDP based L4LB. We then describe how we leverage recent additions on the BPF programs to address those challenges.
    
    Typically some form of Consistent Hashing is used to pick an end host for incoming packets within an L4 LB [2]. Such mechanisms, however, pose challenges in maintaining routing consistency over a long window of time without sharing routing states among the L4LBs. In Facebook, we devised a novel server-id based routing for completely stateless routing of both TCP and QUIC connection. For routing of TCP packets, we leverage tcp_hdr_opt [1] to encode server_id between the endpoints.
    
    ‘Zero downtime restart’ [3] supported by many L7 Proxies, such as Proxygen in Facebook, require lots of custom userspace solution for routing consistency, especially for UDP payloads. Further, maintaining uniform load across individual sockets and CPU cores in a host is not straightforward without custom solutions. We describe how we leverage SOREUSEPORT_SOCKARRAY to create a framework that allows us to efficiently and effectively address both problems by:
    a) Being able to make routing decision in picking up a socket on per packet (UDP) and per connection (TCP) basis
    b) Being able to granularly target individual CPU core to handle incoming packets
    This has allowed us to run at scale with minimal operation load and further simplify our implementation to execute disruption free restart of L7 proxy [3].
    
    References
    
    M. Lau. BPF TCP header option. https://lwn.net/Articles/827672/
    
    D. E. Eisenbud, C. Yi, C. Contavalli, C. Smith, R. Kononov, E. Mann-Hielscher, A. Cilingiroglu, B. Cheyney, W. Shang, and J. D. Hosein.
    Maglev: A fast and reliable software network load balancer. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), Mar. 2016
    
    U Naseer, L Niccolini, U Pant, A Frindell, R Dasineni, TA Benson.
    Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website
    SIGCOMM ’20: Proceedings of the Annual Conference of the ACM Special
    
    Katran - A high performance layer 4 load balancer. https://bit.ly/38ktXD7.
    
    Speakers: Udip Pant (Facebook), Martin Lau (Facebook)
    
    Slides (PDF)
    
    Video (Youtube)
  - 144
    
    A proof-carrying approach to building correct and flexible BPF verifiers
    
    The BPF verifier is an integral part of the BPF ecosystem, aiming to
    prevent unsafe BPF programs from executing in the kernel. Due to its
    complexity, the verifier is susceptible to bugs that can allow malicious
    BPF programs through. A number of bugs have been found in the BPF
    verifier, some of which have led to CVEs (1, 2, 3). These bugs are severe,
    since the verifier is on the critical path for ensuring kernel security.
    
    Due to its design, the verifier is also overly strict: it may reject many
    safe BPF programs because it lacks sophisticated analyses to recognize
    their safety. When a BPF program is rejected by the verifier, it can be a
    frustrating experience (4). To get their program accepted by the verifier,
    developers often have to resort to ad-hoc fixes, tweaking C source code
    or disabling optimizations in LLVM. This solution becomes brittle as
    developers write more complex BPF programs and new optimizations are
    introduced in LLVM.
    
    In this talk, we argue that a more systematic approach is to freeze
    the kernel side of the BPF verifier and move most of its complexity
    to user space. To do so, we introduce formal, machine-checkable proofs
    of the safety of BPF programs. Applications provide proofs that their
    BPF programs are safe, and a proof checker in the kernel validates the
    proofs. By decoupling proof validation from generation, this achieves
    two goals. First, the kernel side of the interface is fixed to be a
    specification of BPF program safety and the proof checker, avoiding
    the ever-growing complexity of the BPF verifier in the kernel. Second,
    applications can choose an appropriate strategy to generate proofs for
    their BPF programs. Since the proofs are untrusted, there is no risk of
    applications introducing bugs from complex proof strategies.
    
    We have been building a prototype BPF verifier using this approach.
    Our prototype uses the logic of the Lean theorem prover (5), which has
    been thoroughly analyzed (6) and has multiple independent implementations
    of proof checkers (7). We are developing two automated strategies for
    generating proofs. The first strategy mimics the current BPF verifier.
    It implements an abstract interpreter for BPF programs that uses ranges
    and tristate numbers to approximate sets of values of BPF registers. The
    second strategy uses symbolic execution to encode the semantics of a
    BPF program as boolean constraints, which are discharged using a SAT
    solver. Both strategies produce proofs that are validated by the proof
    checker, avoiding the possibility of introducing bugs like those that
    have been found in the current verifier.
    
    Our goal is to present an alternative approach to building the BPF
    verifier, and explore the advantages and limitations of this approach.
    We would like to start a discussion on ways to combine both approaches
    in a pragmatic way.
    
    Speakers: Luke Nelson (University of Washington), Xi Wang (University of Washington), Emina Torlak (University of Washington)
    
    Slides (PDF)
    
    Video (Youtube)
  - 145
    
    Pixie's eBPF Protocol Tracer
    
    We present Pixie’s protocol tracer, which uses eBPF to provide instant observability into application messaging without requiring code instrumentation. Pixie’s protocol tracer uses eBPF kprobes on networking-related system calls to capture communication data, which it then parses into protocol messages. The messages are inserted into structured data tables that are easily queried by application developers to help them gain insight into their application behavior.
    
    We contrast our syscall tracing approach against other approaches (e.g. libpcap and uprobes), and discuss pros and cons. We share what worked well with our approach, and also the challenges we faced, including eBPF-related challenges of tracing syscalls that have a multitude of usage patterns.
    
    Finally, we discuss the limitations of kprobe based tracing, in particular with respect to stateful protocols like HTTP/2 and encrypted connections like those that use TLS. We describe our complementary approach that uses eBPF uprobes on user-space libraries to capture the data in these scenarios.
    
    We hope the technical details presented here will be of value to the eBPF community, and we are eager to hear from the eBPF community about potential improvements and suggestions for future directions.
    
    Speakers: Omid Azizi (Pixie Labs), Yaxiong Zhao (Pixie Labs), Ryan Cheng (Pixie Labs), John P Stevenson (Pixie Labs), Zain Asgar (Pixie Labs)
    
    Paper (PDF)
    
    Slides (PDF)
    
    Video (Youtube)
- GNU Tools Track GNU Tools track/Virtual-Room (LPC Virtual)
  
  GNU Tools track/Virtual-Room
  
  LPC Virtual
  
  150
  
  gnu-tools-notes-20210922.txt
  
  gnu-tools-notes-20210923.txt
  
  gnu-tools-track.dl.txt
  
  gnu-tools-track.txt
  - 146
    
    Eliminating implicit function declarations
    
    The 1999 revision of ISO C removed implicit function declarations from the language. Instead, all functions must be declared (with or without a prototype) before they can be called. In previous language versions, a function f was implicitly declared as extern int f (); if the identifier f was used in a call expression (such as f (1, 2, 3.0)).
    
    When GCC switched the default to C99 mode, it was impossible to disable implicit function declarations by default because too many autoconf checks (and similar compile-time inspection) failed, which often resulted in successful compilation (and testing) of programs without the intended feature set.
    
    Over the years, not much progress has been made on this issue. For example, a GCC configure test was only fixed in 2019. Recently, Apple Xcode enabled -Werror=implicit-function-declaration by default, but apparently without fixing resulting problems across free software upstreams.
    
    This session intends to explore whether it is time to make a concerted effort at this problem, and how to approach it.
    
    Intended format: short prepared presentation (10 minutes), followed by discussion.
    
    Speaker: Florian Weimer (Red Hat)
    
    implicit-function-declarations.pp.pdf
  - 147
    
    Debugging offloaded kernels on AMD GPUs
    
    A demonstration of debugging OpenMP/OpenACC kernels using GDB, and a quick overview of the how it was achieved and what still needs to be done.
    
    Speaker: Andrew Stubbs (Mentor Graphics / CodeSourcery)
    
    LPC2021-rocgdbdemo.pdf
  - 148
    
    New mod/ref pass in GCC
    
    We discuss implementation of new inter-procedural mod/ref pass. The pass is collecting information about memory locations modified or read by a given function as well as information useful for points-to analysis (such as information about whether given parameter can escape to global memory or to return value of the function).
    
    First version of mod/ref pass was contributed to GCC 11 and is enabled by default. We also discuss improvements done for GCC 12 and some basic benchmarks.
    
    This is a joint work with David Čepelík.
    
    Speaker: Jan Hubicka (SUSE ČR)
    
    modref.pdf
  - 149
    
    GNU tool chain for CORE-V
    
    CORE-V is a family of RISC-V processor cores developed to commercially robust standards by the Open Hardware Group, a consortium of industrial and academic organizations.
    
    In the first part of this talk we give an update on the work on the GNU tool chain for the CV32E40P, the first of the CORE-V family with custom extensions for branching, autoincrement load/store, hardware loops, multiply accumulate and general CPU use. This is a joint effort by Embecosm and the University of Bologna, and has relied on the GVSoC simulator developed as part of the PULP project.
    
    The second part of the talk looks at the use of GVSoC as a GCC tool chain test target. GVSoC is a RISCV virtual platform, which is a fully open-sourced tool designed to drive future architectural research in the area of highly parallel and heterogeneous RISC-V based IoT platforms. Consisting of a highly configurable event-driven full-platform simulator, GVSoC is capable of performing extremely accurate timing simulations. By reaching 25 MIPS and 100% functional accuracy, the virtual platform supports simulating a broad range of hardware IP blocks, including standalone RISC-V cores, multi-core accelerator Clusters, memories, DMAs, and many other components. While efficient C++ models describe hardware IP blocks and flexible Python scripts instantiate components, a powerful built-in Instruction Set Simulator (ISS) enables simulating complete Parallel Ultra-Low-Power (PULP) systems.
    
    To support the GNU tools test suite targetting CV32E40P core execution, we expanded GVSoC ISS integrating the CORE-V Instruction Set Architecture (ISA) extensions. Along with it, we extended the DejaGnu testing framework, adding a custom baseboard that describes linker and compiler options. Lastly, we relied on a pre-compiled platform-dependent runtime linked by the DejaGnu tool at testing time to enable a faster execution.
    
    A central part of this work is that the tool chain should be upstreamed as a vendor variant, thus riscv32-corev-elf-gcc rather than riscv32-unknown-elf-gcc.We shall conclued this talk by looking at the work remaining before this can be submitted.
    
    A central part of this work is that the tool chain should be upstreamed as a vendor variant, thus riscv32-corev-elf-gcc rather than riscv32-unknown-elf-gcc. In this talk we will look at the work remaining before this can be submitted.
    
    Speakers: Jeremy Bennett (Embecosm), Ms Jessica Mills (Embecsom), Prof. Giuseppe Tagliavini (University of Bologna), Mr Nazareno Bruschi (University of Bologna), Enrico Tabanelli (University of Bologna)
  - 09:15
    
    Coffee Break
  - 150
    
    RISC-V BoF
    
    This is more of a placeholder than anything else: There's an email thread going around that was a bit inconclusive as to whether on not we should have one of these so I figured it'd be easier to just make one.
    
    Speakers: Palmer Dabbelt (Google), Jim Wilson (SiFive), Kito Cheng (SiFive)
  - 151
    
    BoF: Register pressure sensitivity in the gcc middle end
    
    There are a number of optimizations done in the middle end that would benefit from understanding the amount of register pressure. Unrolling, inlining, and parallel reassociation are some that come to mind immediately. I think it would be good to have a discussion about how these optimizations might get pressure information to know how aggressive they should be.
    
    Speaker: Aaron Sawdey (IBM)
    
    Middle end register pressure BoF 2021.pdf
- Kernel Dependability and Assurance MC Microconference1/Virtual-Room (LPC Virtual)
  
  Microconference1/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Kernel Dependability and Assurance Microconference focuses on infrastructure to be able to assure software quality and that the Linux kernel is dependable in applications that require predictability and trust.
  
  Conveners: Gabriele Paoloni, Shuah Khan (The Linux Foundation)
  - 152
    
    Kernel Dependability & Assurance Welcome
    
    Introduction to the track and welcome speakers and audience.
    
    Speakers: Gabriele Paoloni (Red Hat), Shuah Khan (The Linux Foundation)
  - 153
    
    Runtime redundancy and monitoring for critical subsystem/components
    
    Redundancy and diversity are a well recognized way to detect and control SW systematic failure. Runtime Verification Monitors provide a diverse redundancy mechanisms for critical components in the Kernel
    
    Speakers: Daniel Bristot de Oliveira (REd Hat), Mr Gabriele Paoloni (Red Hat)
    
    Redundancy_LPC_Safety_MC.pdf
  - 154
    
    Traceability and code coverage: what we have in Linux and how it contributes to safety
    
    This session will give an overview of Kernel CI and CKI projects, how to obtain code coverage figures, what are the current gaps and possible improvements in view of coverage and traceability requirements to be met in functional safety systems
    
    Speaker: Rachel Sibley (Red Hat)
    
    Code_Coverage_LPC_Safety_MC.pdf
  - 155
    
    Adding kernel-specific test coverage to GCC's -fanalyzer option
    
    I'm the author of GCC's static analysis pass, -fanalyzer. I've been experimenting with extending it to add kernel-specific diagnostics: detecting infoleaks and unsanitized syscalls at compile-time. I'd like to discuss these and other ideas for improving the test coverage of our kernel builds.
    
    Speakers: Carlos O'Donell (Red Hat), David Malcolm (Red Hat)
    
    2021-LPC-analyzer-mc.odp
    
    2021-LPC-analyzer-mc.pdf
  - 08:55
    
    Break
  - 156
    
    A bug is NOT a bug is NOT a bug: Differences in bug classes, bug tracking and bug impact
    
    Security and safety engineering, as well as quality management, share a common goal: Avoiding or eliminating bugs and complete bug classes in software. Hence, these fields of engineering may share methods, tools, well-known best practices, and development efforts during the software development. However, these fields of engineering also have
    different (partly competing) goals and priorities. Understanding these different goals and priorities of different stakeholders can be summarized as “A bug is NOT a bug is NOT a bug”.
    Let us go through what is there in the kernel community and discuss alignment of on-going and future work, in a structured moderated way.
    
    In this discussion, I would like to touch on:
    - Various attempts of defining “a bug” and its implications
    - Classifying “bugs” into bug classes
    - Assessing suitable bug tracking methods, tools and best practices for different bug classes.
    - Assessing impact for different bug classes and decisions in follow-up work to the bug fixing that may be taken depending on the bugs’ impact and the stakeholders’ priorities.
    
    Speakers: Lukas Bulwahn, Sudip Mukherjee
    
    LPC2021_Bulwahn_A-bug-is-NOT-a-bug-is-NOT-a-bug.pdf
  - 157
    
    Kernel cgroups and namespaces: can they contribute to freedom from interference claims?
    
    Freedom From Interference (FFI) is a key claim that must be satisfied in functional safety systems supporting applications with mixed criticality: this session introduces cgroups and namespaces to have an open discussion on how they can contribute to FFI.
    
    Speakers: Bruce Benson (Red Hat), Priyanka Verma (Red Hat)
    
    Cgroups_Namespaces_LPC_Safety_MC.pdf
  - 158
    
    Kernel testing frameworks
    
    This session gives you a overview of Kselftest and KUnit frameworks, how to use them for unit, regression testing.
    
    Speakers: Brendan Higgins (Google), Shuah Khan (The Linux Foundation)
    
    Kernel testing frameworks.pdf
  - 159
    
    Kernel Dependability & Assurance Wrapup
    
    Kernel Dependability & Assurance Wrapup
    
    Speakers: Gabriele Paoloni (Red Hat), Shuah Khan (The Linux Foundation)
    
    CFP to Speak at ELISA November Workshop.pdf
- Kernel Summit Kernel Summit/Virtual-Room (LPC Virtual)
  
  Kernel Summit/Virtual-Room
  
  LPC Virtual
  
  400
  - 160
    
    Integrating GitLab into the Red Hat kernel workflow
    
    The Red Hat kernel team recently converted their RHEL workflow from PatchWork to GitLab. This talk will discuss what the new workflow looks like with integrated CI and reduced emails. New tooling had to be created to assist the developer and reviewer. Webhooks were utilized to automate as much of the process as possible making it easy for a maintainer to track progress of each submitted change. Finally using CKI, every submitted change has to pass CI checks before it can be merged.
    
    We faced many challenges, especially around reviewing changes. Resolving those led to a reduction of email usage and an increase in cli tools. Demos of those tools will be included.
    
    Attendees will leave with an understanding of how to convert or supplement their workflow with GitLab.
    
    Speaker: Don Zickus (Red Hat)
    
    Red Hat and GitLab.pdf
  - 161
    
    Writing a fine-grained access pattern oriented lightweight kernel module using DAMON/DAMOS in 10 minutes
    
    DAMON and DAMOS
    
    DAMON[1] is a framework for general data access monitoring of kernel
    subsystems. It provides best-effort high quality monitoring results while
    incurring only minimal and upper-bounded overhead, due to its practical
    overhead-accuracy tradeoff mechanism. On a production machine utilizing 70 GB
    memory, it can repeatedly scan accesses to the whole memory for every 5ms,
    while consuming only 1% single CPU time.
    
    On top of it, a data access pattern-oriented memory management engine called
    DAMON-based Operation Schemes (DAMOS) is implemented. It allows clients to
    implement their access pattern oriented memory management logic with very
    simple scheme descriptions. We implemented fine-grained access-aware THP and
    proactive reclamation using this engine in three lines of scheme and achieved
    remarkable improvements[2].
    
    As of this writing (2021-05-28), the code is not in the mainline but available
    at its development tree[3], and regularly posted to LKML as patchsets[4,5,6].
    Nevertheless, the code has already merged in the public Amazon Linux kernel
    trees[7,8], and all Amazon Linux users can use DAMON/DAMOS off the box. We are
    also supporting the two latest upstream LTS stable kernels[9,10].
    
    Agenda
    
    In this talk, I will briefly introduce DAMON/DAMOS and present how you can
    write a fine-grained data access pattern oriented lightweight kernel module on
    top of DAMON/DAMOS. With the talk, I will write an example module and evaluate
    its performance on live. A data access-aware proactive reclamation kernel
    module for production use will also introduced as a use case. After that, I
    will discuss my future plans for improving DAMON and improving other kernel
    subsystems using DAMON/DAMOS.
    
    [1] https://damonitor.github.io (https://damonitor.github.io/)
    [2] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
    [3] https://https://github.com/sjp38/linux/tree/damon/master (https://https//github.com/sjp38/linux/tree/damon/master)
    [4] https://lore.kernel.org/linux-mm/20210520075629.4332-1-sj38.park@gmail.com/
    [5] https://lore.kernel.org/linux-mm/20201216084404.23183-1-sjpark@amazon.com/
    [6] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@amazon.com/
    [7] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
    [8] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
    [9] https://github.com/sjp38/linux/tree/damon/for-v5.4.y
    [10] https://github.com/sjp38/linux/tree/damon/for-v5.10.y
    
    Speaker: SeongJae Park
    
    daos_ksummit_2021.pdf
    
    Talk video
  - 162
    
    User Interrupts - A faster way to Signal
    
    User Interrupts is a hardware technology that enables delivering interrupts directly to user space.
    
    Today, virtually all communication across privilege boundaries happens by going through the kernel. This includes signals, pipes, remote procedure calls and hardware interrupt based notifications.
    
    User interrupts provide the foundation for more efficient (low latency and low CPU utilization) versions of these common operations by avoiding transitions through the kernel. User interrupts can be sent by another user space task, kernel or an external source (like a device).
    
    The intention is to describes the general infrastructure being developed to receive user interrupts and deep-dive into a single source: interrupts from another user task.
    
    The goal of this session is to:
    - Get feedback on the overall software architecture.
    - Discuss the main opens.
    
    Speaker: Sohil Mehta
    
    User_Interrupts_LPC_2021.pdf
    
    User Interrupts v1 RFC patches
  - 163
    Building a fast nvme passthrough
    
    New storage features, especially in NVMe, are emerging fast. It
    takes time and a good deal of consensus-building for a device-feature
    to move up the ladders of kernel I/O stack and show-up to user-space.
    This presents challenges for early technology adopters.
    
    The passthrough interface allows such features to be usable (at least
    in native way) without having to build block-generic commands,
    in-kernel users, emulations and file-generic user-interfaces. That said,
    even though passthrough interface cuts through layers of
    abstraction and reaches to NVMe fast, it has remained tied to
    synchronous ioctl interface, making it virtually useless for fast I/O path.
    
    In this talk I will present the elements towards building a scalable
    passthrough that can be readily used to play with new NVMe features.
    More specifically, recent upstream efforts involving:
    
    Emergence of per-namespace char interface, that remains
    available/usable even for unsupported features and new command-sets[1]
    
    Async-ioctl facility 'uring_cmd' that Jens proposed in io_uring [2].
    
    Async nvme-passthrough that I put up over 'uring_cmd' [3]
    
    Performance evaluation comparing this new interface with existing ones
    will be provided.
    I would like to gather the feedback on the design-decisions, and discuss
    how best to go about infusing more perf-centric advancements (e.g.
    async polling, register-buffer etc.) into this path.
    
    [1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@gmail.com/
    [2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@kernel.dk/
    [3] https://lore.kernel.org/linux-nvme/20210325170540.59619-1-joshi.k@samsung.com/
    
    Speaker: kanchan joshi
    
    lpc-2021-building-a-fast-passthru.pdf
- LPC Refereed Track Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  - 164
    Linux and Zephyr interoperability - the start of a beautiful relationship
    
    You're a company that is working on a suite of products that spans every conceivable gadget built for the smarthome - from a simple thermostat to security alarms, from set top boxes to internet gateways, mobile phones and tablets and even servers running in the cloud.
    
    Linux is a fairly obvious choice to build these products on top of - it scales well for devices with more than 128MB of RAM and storage. On devices at the resource-constrained end of the spectrum, Zephyr is quickly maturing into a competitive option, able to even run on devices with as little as a few hundred of KB of RAM and storage. Both ecosystems have diverse and active communities and an open governance model so it is a no-brainer to use them as the basis for the entire suite of products by the company.
    
    Considering Linux and Zephyr as two parts of a single product platform allows for a coherent view of both ecosystems by developers. You want to make sure that you can apply the same set of software configurations and policies across both ecosystems e.g. library versions, compatible protocol suites, security configurations, OTA mechanisms and even a single set of IP compliance tools.
    
    As an example, when you decide you want to secure all your network communications out-of-the-box in your product platform, you need to:
    
    Find an SSL library that'll fit both footprints
    
    Configure it to have a coherent set of modern ciphers compatible across the two
    ecosystems
    
    Get the various protocol libraries to build against the chosen SSL library
    
    Create key provisioning tools that can work across the two ecosystems
    
    Perform interoperability tests
    
    Now repeat this exercise across every key component of the OS - security policy, networking features, OTA, toolchain hardening, IP compliance tools and you end up with a meta-project that spans and contributes to both ecosystems.
    
    We've started to build such a open product platform with opinionated defaults that follow community best practices at https://ostc-eu.org. And this is our story about the challenges we've seen in getting to a coherent configuration that'll work across the entire suite of products across Linux and Zephyr and how we want to improve this interoperability in the future.
    
    Speaker: Amit Kucheria (Linaro)
    
    PDF Friendly - Amit Kucheria LPC2021_ Linux and Zephyr interoperability.pdf
  - 165
    
    The forefront of the development for NVDIMM on Linux Kernel
    
    NVDIMM (Non Volatile DIMM) is the most interesting device, because it has not only characteristic of memory but also storage.
    To support NVDIMM, Linux kernel provides three access methods for users.
    - Storage (Sector) mode
    - Filesystem DAX(=Direct Access) mode
    - Device DAX mode.
    
    In the above three methods, Filesystem DAX is the most expected access method,
    because applications can write data to the NVDIMM area directory,
    and it is easier to use than Device DAX mode.
    So, some software already uses it with official support.
    
    However, Filesystem-DAX is still "experimental status" in the upstream community due to some difficult issues .
    
    In this session, Yasunori Goto will talk to the forefront of the development of NVDIMM, and Ruan Shiyang will talk about his challenge with the latest status from Open Source Summit Japan 2020.
    
    Speakers: Mr Yasunori Goto (Fujitsu Ltd), Mr Shiyang Ruan (Nanjing Fujitsu Nanda Software Technology Co., Ltd)
    
    2021Sep23_The_Forefront_of_the_Development_for_NVDIMMM_on_Linux_kernel_Yasunori_Goto_Ruan_Shiyang.pdf
  - 166
    
    So you want to torture RCU?
    
    Let's face it, using synchronization primitives such as RCU can be frustrating. And it is only natural to wish to get back, somehow, at the source of such frustration. In short, it is quite understandable to want to torture RCU. (And other synchronization primitives as well, but you have to start somewhere!) Another benefit of torturing RCU is that doing so sometimes uncovers bugs in other parts of the kernel. You see, RCU is not always willing to suffer alone.
    
    This talk will give an overview of how to torture RCU using the rcutorture test suite. It will also present a few of rcutorture's tricks that permit short tests on a smallish number of modest systems to nevertheless provide some assurance that RCU will run robustly on billions of systems across the inner solar system.
    
    Speaker: Paul McKenney (Facebook)
    
    rcutorture.2021.09.22a.pdf
  - 167
    
    Protection Key Supervisor (PKS)
    
    Protection Key Supervisor provides fast, thread-specific manipulation of permission restrictions on kernel pages.
    
    Multiple patch sets have been reviewed recently targeting an initial use case to provide stray write protection to persistent memory.
    
    Persistent memory is mapped into the direct map and unlike regular DRAM it is particularly venerable to programming errors which would result in the corruption of data.
    
    Additional use cases have been explored and will be included in the presentation. Specifically for the hardening of page tables and other sensitive kernel data.
    
    Speaker: Rick Edgecombe (Co-worker)
    
    lpc-2021-PKS-22-Sept-2021.pdf
- Performance and Scalability MC Microconference3/Virtual-Room (LPC Virtual)
  
  Microconference3/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Performance and Scalability microconference focuses on enhancing performance and scalability in both the Linux kernel and userspace projects. In fact, one of the purposes of this microconference is for developers from different projects to meet and collaborate – not only kernel developers but also researchers doing more experimental work. After all, for the user to see good performance and scalability, all relevant projects must perform and scale well.
  
  Because performance and scalability are very generic topics, this track is aimed at issues that may not be addressed in other, more specific sessions. The structure will be similar to what was followed in previous years, including topics such as synchronization primitives, bottlenecks in memory management, testing/validation, lockless algorithms and RCU, among others.
  
  Linux Plumbers 2021 Preserving Hypervisor State Update.pdf
  - 168
    
    Optimize Page Placement in Tiered Memory System
    
    Traditionally, all RAM is DRAM. Some DRAM might be closer/faster than
    others, but a byte of media has about the same cost whether it is close
    or far. But, with new memory tiers such as High-Bandwidth Memory or
    Persistent Memory, there is a choice between fast/expensive and
    slow/cheap.
    
    We use the existing reclaim mechanisms for moving cold data out of
    fast/expensive tiers. It works well for that. However, reclaim does
    not work well for moving hot data which might be stuck in a slow tier
    since the pages near the top of the LRU are the most recently accessed
    only if there’s regular memory pressure on the slow/cheap tiers.
    
    Fortunately, NUMA Balancing can find recently-accessed pages
    regardless of memory pressure. We have repurposed it from being used
    for location-based optimization to being used for tier-based
    optimization. We have also optimized it for better hot data
    identification, such as to find frequently-accessed pages instead of
    recently-accessed pages, etc.
    
    We will show our findings so far, and discuss the remaining problems,
    potential solutions, and alternatives.
    
    The patchset email threads are as follows,
    
    https://lore.kernel.org/linux-mm/20210625073204.1005986-1-ying.huang@intel.com/
    https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
    
    Speaker: Ying Huang
    
    Optimize Page Placement in Tiered Memory System.pdf
    
    Optimize Page Placement in Tiered Memory System.pdf
    
    pmbench_optimizing_process.gif
  - 169
    
    "cat /proc/PID/maps": What Could Possibly Go Wrong?
    
    Large installations require considerable monitoring and control, and the occasional scan of procfs files is often the best tool for the monitoring job at hand. In cases where memory consumption is a concern, /proc/PID/{maps,numa_maps,smaps,smaps_rollup} can be quite helpful.
    
    To your monitoring, anyway.
    
    Unfortunately, some mm-related procfs files need to acquire the dreaded mmap_sem. This can be a problem if the Very Important Process being monitored needs to modify its address space. Especially if your monitoring software has been fenced into a highly CPU-constrained cgroups-based container, in order to avoid interfering with Very Important Processes. Except that all of these procfs files acquire sleeplocks that might also be acquired by your Very Important Process. Plus your monitoring software might be preempted while holding one of these sleeplocks, that after all being the whole point of the aforementioned container. This can (and does) result in severe performance degradation.
    
    Infrequently and intermittently.
    
    We therefore have an abusive stress test that forces this condition to occur on small systems in less than one minute's time [1].
    
    This proposal, if accepted, will demonstrate this test program and a few schemes intended to make procfs-based monitoring safe for Very Important Processes [2].
    
    [1] https://github.com/paulmckrcu/proc-mmap_sem-test
    [2] https://git.infradead.org/users/willy/linux-maple.git/shortlog/refs/heads/proc-vma-rcu
    
    Speakers: Paul McKenney (Facebook), Matthew Wilcox (Oracle)
    
    mmap_sem-procfs.2021.09.20d.pdf
  - 08:15
    
    Break
  - 170
    
    Design discussion and performance characteristics of Maple Tree
    
    The maple tree is an RCU-safe range-based B-Tree that was designed to fit a
    number of Linux kernel use cases. Most recently the maple tree has been sent
    upstream as a patch set that replaces the vma rbtree, the vma linked list, and
    the vmacache while maintaining the current performance level. This performance
    should improve as the RCU aspect of the tree is leveraged to remove mmap_sem
    contention.
    
    This talk will cover the performance aspects of the tree, some future ideas,
    and other areas beyond the VMA that would benefit from the tree.
    
    Speaker: Liam Howlett (Oracle)
    
    2021_09_Maple_Tree_Plumbers.pdf
  - 171
    Preserving state for fast hypervisor update
    
    It is currently possible to do fast hypervisor update by preserving virtual machine state in memory during reboot. This approach relies on using emulated PMEM, DAX, and local live migration technologies.
    
    As of today, there are a number of limitations with this approach:
    
    The interface to preserve VM memory is not very flexible. The size and location of PMEM must be determined prior to hypervisor boot and cannot be changed later. Setting PMEM size and location requires intimate knowledge of the memory layout of the physical machine and thus, the settings are not portable.
    
    Upstream kernel cannot preserve state of the devices. While there was work done by Intel in this direction, the work has not been upstreamed or discussed on public mailing lists. It also has some major limitations: 1) Intel IOMMU specific 2) reboot through firmware is not supported, only can work with kexec reboot 3) device state is preserved in a different memory from VM.
    
    There is no way to preserve states of virtual functions.
    
    In this presentation, we will show a demo of fast hypervisor update. We will have a discussion about how the three stated problems can be resolved.
    
    The goal is to be able to preserve virtual machine state and any devices that are attached to it through kexec reboot and if firmware supports through the firmware. Also, the approach should be expandable to work on any platforms with KVM and IOMMU support.
    
    Speaker: Pasha Tatashin
    
    Linux Plumbers 2021 Preserving Hypervisor State Update.pdf
  - 09:45
    
    Break
  - 172
    
    PKRAM feature development
    
    Preserved-over-kexec memory storage or PKRAM provides an API for saving memory pages of the currently executing kernel so that they may be restored after kexec into a new kernel. PKRAM provides a flexible way for doing this without requiring that the amount of memory used be a fixed size created a priori.
    
    One use case for PKRAM is preserving guest memory and/or auxillary supporting
    data (e.g. iommu data) across a kexec reboot of the host, and there is interest in extending it to work with emulated or real persistent memory.
    
    Let's discuss the current state of PKRAM, its limitations, and future direction.
    
    Speaker: Anthony Yznaga
    
    pkram-lpc-2021.pdf
  - 173
    
    Compact NUMA-aware Locks
    
    Lock throughput can be increased by handing a lock to a waiter on the
    same NUMA node as the lock holder, provided care is taken to avoid
    starvation of waiters on other NUMA nodes. This talk will discuss CNA
    (compact NUMA-aware lock) as the slow path alternative for the current
    implementation of qspinlocks in the kernel.
    
    CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
    organized in two queues, a main queue for threads running on the same
    node as the current lock holder, and a secondary queue for threads
    running on other nodes. Experimental results with micro and macrobenchmarks
    confirm that the throughput of a system with contended qspinlocks can increase
    up to ~3x with CNA, depending on the actual workload.
    
    Speakers: Alex Kogan (Oracle Labs), Dave Dice (Oracle Labs)
    
    cna-lpc21-new.pdf
- VFIO/IOMMU/PCI MC Microconference2/Virtual-Room (LPC Virtual)
  
  Microconference2/Virtual-Room
  
  LPC Virtual
  
  150
  
  The VFIO/IOMMU/PCI micro-conference focuses on coordination between the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems, and on the kernel code that enables these new system features that often require coordination between the VFIO, IOMMU and PCI sub-systems.
  - 174
    
    Page-Based Hardware Attributes (PBHA) on arm64
    
    Version 8.2 of the Armv8 architecture introduced some mysterious bits to the PTE entries used by the CPU and SMMU which result in IMPLEMENTATION DEFINED behaviours. These bits are known as Page-Based Hardware Attributes (PBHA) and their opaque nature has resulted in them being disabled upstream.
    
    This session will include a quick reminder of the arm64 MMU, before introducing the concept of PBHA and outlining some possible use-cases in hardware along with the challenges in supporting them in Linux. The hope is both to attract additional use-cases from the audience, but also to discuss the scope of support that may be possible upstream.
    
    Speaker: Will Deacon
    
    PBHA - LPC 2021.pdf
  - 07:30
    
    Break
    
    5 minute break
  - 175
    PCI Data Object Exchange (DOE), Component Measurement and Authentication (CMA) / SPDM 1.1 - Mediating access and related issues
    
    DOE (PCI ECN) provides a standard mailbox definition, so far used for query / response type protocols. There can be multiple instances of a DOE on each PCI function, and each instance can support multiple protocols. Currently we have published definitions of the Discovery, CMA, IDE (available from the PCI SIG) and CDAT protocols (available from UEFI forum). Some of these protocols are intended for Linux kernel access (e.g. CDAT), others are less clear but there are possible use cases (CMA, IDE).
    
    Patches to support DOE mailboxes in PCI extended config space have raised questions about how to ensure that these mailboxes, which may be of interest to various software entities (userspace / kernel / firwmare / TEE etc) can be safely used.
    
    The DOE design does not easily allow for concurrent use by different software entities (even if possible, we cannot rely on other software elements doing this safely), so it seems some level of mediation is required. The topics for discussion include:
    
    Do we want to enable any direct userspace access to these mailboxes or should we address on a per protocol basis (if at all)?
    
    Do we need to 'prevent' userspace being able to access these registers whilst the DOE is in use?
    
    How do we know the kernel should not touch a given mailbox (in use by other system software)? Perhaps a code first submission to ACPI to define a mediation mechanism? Is this sufficient for expected use cases? (What other suggestions do people have?)
    
    A very brief overview of DOE and proposed kernel support will be presented to make sure everyone is aware of the background - then straight into the discussion of the above questions.
    
    The PCI ECN defining CMA adds the ability (using a DOE mailbox) to establish the identity and verify the component configuration and firmware / executables.
    
    This is done using the protocols defined in the DMTF SPDM 1.1 specification: https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.1.1.pdf which is also used for the same purpose on other buses such as USB, but we are not aware of any work to support those buses yet. The design is extensible to other buses with an abstracted transport layer (via a single function pointer).
    
    The CMA use of the SPDM 1.1 protocol defines a certificate based public private key authentication mechanism including signed measurements of PCIe component state (firmware and other implementation defined elements) and setup of secure channels for continuing runtime measurement gathering and for other related PCI features such as Integrity and Data Encryption IDE.
    
    An initial implementation will be posted shortly for review, and there are a number of open questions that may benefit from a discussion in this forum:
    
    Is there a sufficiently strong case to support CMA natively in the kernel at all?
    Some approaches might push this facility into a trusted execution environment. VFs can implement CMA however, to provide this level of authentication and measurement, when in use by a VM. It would be useful to understand other use cases as they motivate the software design and testing.
    
    Approach to providing authentication of device certificates? SPDM uses x509 certificates and so relies on a chain of trust. What trust model should we apply? Current code assumes a separate keychain dedicated to CMA and root key insertion from userspace (probably initrd).
    
    Method of managing / verifying measurements. The nature of the measurements is implementation defined. In some cases they are not expected to change unless the firmware is flashed, but in others they may change with device configuration. Whilst closely related to the challenges of IMA for files, is it appropriate to reuse that subsystem and tooling?
    
    As it's related, is there interest in supporting kernel managed IDE (link encryption)?
    
    When do we actually want to make these measurements? (On boot, on driver probe, on reset, on first use of a particular feature, on demand from userspace etc?) Currently they are done on driver probe only.
    
    Other, more detailed questions can be addressed as part of normal discussion on list.
    
    References:
    https://lore.kernel.org/linux-pci/CAPcyv4i2ukD4ZQ_KfTaKXLyMakpSk=Y3_QJGV2P_PLHHVkPwFw@mail.gmail.com/
    https://lore.kernel.org/linux-pci/20210520092205.000044ee@Huawei.com/
    
    Speakers: Jonathan Cameron (Huawei Technologies R&D (UK)), Dan Williams (Intel Open Source Technology Center)
    
    DOE CMA SPDM Plumbers v5.pdf
  - 08:20
    
    Break
    
    10 minutes break
  - 176
    
    Shared Virtual Addressing (SVA) for in-kernel users
    
    Sharing virtual addresses between DMA and the user process is undoubtedly beneficial. It improves security by limiting DMA to the process virtual address space; The programming model is simplified by eliminating the need for explicit map/unmap operations with behind the scene IO page fault handling. Potential performance gains come after that.
    
    However, applying the same logic to kernel-SVA is not without controversy. The DMA API is the de facto way of doing kernel DMA. It already provides portability and security by means of IOVA. DMA API is IOMMU agnostic and does not support IOMMU specific key concept of Process Address Space ID (PASID) which SVA relies on. IOVA is supported by IOMMU with separate page tables than the CPU counterpart.
    
    In order to support SVA, IOMMU has to walk CPU page tables which undermines security if we allow sharing the entire kernel virtual address (KVA) space. IOTLB flush is also a gap since mmu_notifier is not available for kernel memory.
    
    This proposed session explores the multiple candidates that can make DMA API compatible, KVA usage safe, and address the gap of IOTLB synchronization.
    
    Speaker: Jacob Pan
    
    lpc-2021-kernel-svm-jp.pdf
  - 09:00
    
    Break
    
    5 minutes break
  - 177
    
    Status of Dynamic MSIx and IMS opens
    
    Current MSIx allows one chance to allocate all required interrupt resources. The rework in progress introduces a new API to allow adding new interrupt resources on demand. We will run through some of the options. VFIO usage isn't correct in its usage today. Quick review on proposed VFIO changes for MSI and MSIx to make sure there is proper feedback to VM's
    
    IMS has some unresolved opens.
    - DSA format vs. device-specific format.
    - Support for IMS layout in system memory
    
    No slides planned since we need Thomas for this discussion. :-)
    
    Speakers: Ashok Raj, Megha Dey
    
    LPC2021-DMSIX-IMS-discussion.pdf
    
    LPC2021-DMSIX-IMS-discussion.pptx
  - 09:35
    
    Break
    
    10 minutes break
  - 178
    
    Unified I/O page table management for passthrough devices, in-kernel API discussion between IOMMU core and /dev/iommu
    
    When a device is passed through to user space, DMAs from this device are untrusted by the kernel. I/O page tables must be enabled in the IOMMU so each assigned device can only access the I/O virtual address space that is created by respective device passthrough frameworks (VFIO, vDPA, etc.).
    
    Until now I/O page tables are considered as a device attribute, thus managed through VFIO/vDPA specific uAPIs. However this model doesn't scale toward advanced I/O virtualization usages, e.g. subdevice passthrough which requires more than one I/O page table per device, SVA virtualization which needs to support user-provisioned I/O page table (nested on a kernel page table), and I/O page faults which are necessary for improved memory utilization, etc. Better avoid reinventing the new wheel in every framework.
    
    Having an unified uAPI is the answer here. The proposal is generalizing things about I/O page table management via a new interface (/dev/iommu), while allowing passthrough frameworks to connect their devices with selected I/O page tables via a simple protocol. This approach allows VFIO/vDPA to focus on aspects about device management, leaving DMA isolation enforced through the generic interface. This talk is aimed to get consensus on the overall design choices and execution plan cross multiple subsystems.
    
    As we have reached a consensus on the /dev/iommu proposal (https://lore.kernel.org/linux-iommu/MWHPR11MB1886422D4839B372C6AB245F8C239@MWHPR11MB1886.namprd11.prod.outlook.com/), it's time to have some discussions on the in-kernel APIs between the IOMMU core and the /dev/iommu implementation. This discussion can provide some guidance for the developers who are going to implement /dev/iommu.
    
    Speakers: Kevin Tian (Intel), Baolu Lu
    
    lpc-dev-iommu.pdf
  - 10:30
    
    Break
    
    5 minutes break
  - 179
    Brain storm some of the features support in Linux for PCIe
    
    Certain PCIe features aren't handled well in Linux, for instance, hotplug doesn't seem to care about MRL status. There are other implications on features as the following:
    
    MPS/MRRS
    
    10b, 14b tag support
    
    Both need to be enabled to the entire path from the root port to the device. If a new device is hotplugged, how are MPS, 10b tag enabled throughout the path?
    
    Linux lacks support for Flattening Portal Bridge (FPB) to improve ability to manage resources in a more structured way.
    
    Speaker: Ashok Raj
    
    LPC2021-PCI.pdf
    
    LPC2021-PCI.ppt
Friday 24 September
- Android MC: Android reprise GNU Tools track/Virtual-Room (LPC Virtual)
  
  GNU Tools track/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Android microconference focuses on cooperation between the Android and Linux communities.
  
  Convener: Karim Yaghmour (Opersys inc.)
  - 180
    
    Android BoF Intro GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
    
    Come join and further discuss what was talked about at the earlier Android Microconference session. This provides space to allow folks who couldn't attend due to conflicts as well as for longer discussions that wouldn't fit into the earlier microconference session. We'll have a few topics scheduled, but also leave open some space for folks to propose their own items.
    
    Speakers: John Stultz (Linaro), Karim Yaghmour (Opersys inc.)
  - 181
    
    uclamp cgroup usage challenges GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
    
    Continued discussion from the Android Microconference.
  - 182
    
    thermal core usage challenges GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
    
    Continued discussion from the Android Microconference.
  - 07:45
    
    15 min Break GNU Tools track/Virtual-Room (LPC Virtual)
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
  - 183
    
    Open topic #1 GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
  - 184
    
    Open topic #2 GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
  - 185
    fw_devlink GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
    
    Continued discussion from the microconf
    
    Continue default async boot -- what to do about mmc or others like it?
    
    How do we deal with the missing driver case?
    
    What do we do about parents depending on child devices.
    
    Any caveats/concerns with doing topological probing?
  - 09:25
    
    15 min Break GNU Tools track/Virtual-Room (LPC Virtual)
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
  - 186
    
    Open topic #3 GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
  - 187
    
    AOSP devboard collaboration GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
    
    Continued discussion from microconference session
  - 188
    
    Speculative page faults GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
    
    Continued discussion from the Android Microconference.
  - 189
    
    Open Discussion GNU Tools track/Virtual-Room
    
    GNU Tools track/Virtual-Room
    
    LPC Virtual
    
    150
- BOFs Session BOF1/Virtual-Room (LPC Virtual)
  
  BOF1/Virtual-Room
  
  LPC Virtual
  
  150
  
  Birds of a Feather
  - 190
    
    RCU Implementation BOF BOF1/Virtual-Room
    
    BOF1/Virtual-Room
    
    LPC Virtual
    
    150
    
    Speaker: Paul McKenney (Facebook)
    
    whiteboard.pdf
  - 191
    
    VMA life cycle and MM locking BOF1/Virtual-Room
    
    BOF1/Virtual-Room
    
    LPC Virtual
    
    150
    
    This is to discuss the idea of limiting the VMAs to growing, reference counting and how locking could be handled for RCU safe VMA lookups.
    
    Speaker: Liam Howlett (Oracle)
  - 192
    
    Direct map management BOF1/Virtual-Room
    
    BOF1/Virtual-Room
    
    LPC Virtual
    
    150
    
    This BoF is to discuss pros and cons of various approaches to avoid
    performance issues resulting from excess modifications of the direct
    map, what APIs should these approaches provide and what is the best way
    to integrate them with the existing allocators.
    
    Speakers: Mike Rapoport (IBM), Vlastimil Babka (SUSE), Rick Edgecombe (Intel)
    
    LPC21 Direct map management .pdf
  - 193
    
    RISC-V platform specification BOF1/Virtual-Room
    
    BOF1/Virtual-Room
    
    LPC Virtual
    
    150
    
    Let's continue the platform specification discussion in the BoF.
    Some of the things that needs further discussion:
    
    • PCT
    • Mandating Compatibility and branding of the RISC-V platforms
    • Do we mark various combinations deprecated or not ?
    
    Speaker: ATISH PATRA (Western Digital)
    
    RISC-V Bof.pdf
    
    riscv platform specification
    
    riscv-platform-spec.pdf
  - 194
    
    User interrupts BOF BOF1/Virtual-Room
    
    BOF1/Virtual-Room
    
    LPC Virtual
    
    150
    
    Further discussion on the proposed user interrupts feature
    
    Speaker: Sohil Mehta
- BPF & Networking Summit Networking and BPF Summit/Virtual-Room (LPC Virtual)
  
  Networking and BPF Summit/Virtual-Room
  
  LPC Virtual
  
  150
  
  The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
  
  This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
  - 195
    
    Towards truly portable eBPF
    
    As eBPF is getting more popular and mainstream, one of the challenges of making it accessible to more users is how to distribute eBPF powered applications. Unlike simpler applications which involves shipping a binary or a container image, with eBPF we usually need to compile the program for the target kernel. This is a hurdle in adoption by both users and vendors. The CO-RE (Compile Once - Run Everywhere) initiative improved this by introducing a way to ship a compiled artifact, which will work on any supporting distribution. But what is a supporting distribution and what about unsupported distributions? How can we make eBPF CO-RE widely usable in the real world of enterprise users? In this talk we will answer these questions by introducing CO-RE and BTF mechanics, and how to leverage them in a concrete scenario in our project Tracee.
    
    Speakers: Itay Shakury (Aqua Security), Rafael David Tinoco (Aqua Security)
    
    Slides (PDF)
    
    Video (Youtube)
  - 196
    Automatically optimizing BPF programs using program synthesis
    
    This talk will present K2, an optimizing compiler that uses program synthesis to automatically produce both safe, compact, more performant BPF bytecode. K2 compresses BPF bytecode by 6-26%, improves throughput by 0–4.75%, and reduces average latency by 1.36–55.03%, across benchmarks from Cilium, Facebook Katran, hXDP, and the Linux kernel. We designed several domain-specific techniques to make synthesis practical by accelerating equivalence-checking of BPF programs by 6 orders of magnitude.
    
    The talk will consist of the following parts:
    
    A discussion of the challenges in designing optimizing compilers for BPF
    
    A description of why and how to utilize program synthesis to find performant BPF bytecode which can pass the kernel checker
    
    Techniques for fast equivalence and safety checking
    
    Optimizations discovered by K2 for realistic benchmarks
    
    Limitation of K2 and future work
    
    A discussion of how we think K2 might benefit the community, seeking feedback to improve, more benchmarks, and opportunities to work together
    
    You may find more information including K2’s source code, the full technical paper on K2, and responses to some FAQs at https://k2.cs.rutgers.edu
    
    Speakers: Qiongwen Xu (Rutgers University ), Michael Wong (Princeton University), Tanvi Wagle (Rutgers University ), Srinivas Narayana (Rutgers University ), Anirudh Sivaraman (New York University)
    
    Slides (PDF)
    
    Video (Youtube)
  - 197
    BPF security auditing at Google
    
    We’ll discuss some recent and ongoing work we’ve been doing to audit Google’s Linux systems with eBPF. We’ll look at a case study of the problems we’ve solved for logging process lifecycles, and then look at the challenges we’re facing to make these systems as reliable and maintainable as possible. The topics we’ll cover include:
    
    A brief overview of the BPF LSM
    
    Why and how we ended up adding atomics to eBPF
    
    Why we implemented task-local BPF storage
    
    How we push large data blobs through the BPF ringbuffer (and how we’d like to improve it)
    
    Why we wish we didn’t have to attach to so many fexit hooks (and what we’d like to do about it)
    
    Speakers: Brendan Jackman (Google), KP Singh (Google)
    
    Slides (PDF)
    
    Video (Youtube)
  - 198
    
    Translating IPv4 to IPv6 Without NAT
    
    Although an IPv6 only environment is ideal, the path to migration from an IPv4 environment is gradual and will present situations where an IPv6 client will need ongoing connectivity to an IPv4-only server. Such a communication path will need to use one of the existing IPv6 to IPv4 transition mechanisms (such as NAT or a dual IPv4 + IPv6 stack).
    
    We will demonstrate a novel approach to this migration, that uses a unique transition mechanism utilizing the new SECCOMP_IOCTL_NOTIF_ADDFD flag introduced to the seccomp() system call, to intercept egress connect calls to opportunistically use a transition IPv4 address when possible, saving applications the pain of dealing with the end host not being reachable, while still living in an IPv6-only environment. Once applied at the beginning of connection establishment, the data path proceeds uninterrupted between the client and the server distinguishing this approach from many other transition/translation mechanisms.
    
    We will also share a performance analysis of this approach, limitations of what we can do with seccomp(), and future work using this mechanism.
    
    Speakers: Kyle Anderson (Netflix), Keerti Lakshminarayan (Netflix), Alok Tiagi (Netflix)
    
    Slides (PDF)
    
    Video (Youtube)
  - 199
    Untangling DSCP, TOS and ECN bits in the kernel
    
    In Linux, the IPv4 code generally uses IPTOS_TOS_MASK (0x1e) when
    handling the TOS (Type of Service) of IPv4. This mask follows the
    definition of RFC 1349:
    
    0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | PRECEDENCE | TOS | MBZ | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+
    
    However RFC 1349 is only one of several contradicting RFCs that
    try to define how to interpret the IPv4 TOS. In the end, the IETF
    settled on the DSCP+ECN interpretation (RFC 2474 and RFC 3168):
    
    0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | | | | DSCP | ECN | | | | +-----+-----+-----+-----+-----+-----+-----+-----+
    
    That was 20 years ago, so the layout is finally stable. But as the
    diagrams show, RFC 1349 is incompatible with ECN as it already uses
    bit 6 in its TOS field.
    
    Therefore, the IPv4 code also uses another mask, IPTOS_RT_MASK (0x1c),
    to clear bit 6. This mask is used almost every time the kernel does an
    IPv4 route lookup.
    
    Finally, RFC 2474 and RFC 3168 (DSCP+ECN) also cover IPv6. However, the
    IPv6 code generally doesn't mask the ECN bits and considers them as
    part of the TOS for policy routing.
    
    This situation creates several problems:
    
    Regressions brought by patches "fixing" places where IPTOS_TOS_MASK
    wasn't applied (thus breaking users that used bits 0-2).
    
    IPTOS_TOS_MASK is spreading to IPv6 (through RT_TOS()), where it
    doesn't make sense at all (IPv6 has never used the RFC 1349
    layout).
    
    In some edge cases, IPv4 route lookups are done without masking the
    ECN bits (thus giving different results depending on the ECN mark).
    New cases are introduced every now and then.
    
    IPv4 and IPv6 inconsistency.
    
    Impossibility to use the full DSCP range in IPv4.
    
    Policy-routing can break ECN with IPv6 and in some IPv4 edge cases.
    
    Parts of the stack define their own mask to respect the DSCP+ECN
    layout, but without making it reusable.
    
    The objective of this talk is to bring practical examples of
    user-visible inconsistencies and to discuss different ways forward for
    minimising them and avoiding more ECN regressions in the future.
    
    It will be oriented towards the following goals (by decreasing order of
    perceived feasibility):
    
    Remove all uses of IPTOS_TOS_MASK for IPv6.
    
    Prevent IPv4 policy routing from breaking ECN.
    
    Remove IPTOS_TOS_MASK entirely from the kernel, so people don't
    mistakenly copy/paste such code (but keep the definition in
    include/uapi of course).
    
    Allow full DSCP range in IPv4.
    
    Prevent IPv6 policy routing from breaking ECN.
    
    Prevent breaking ECN again in the future (for example by defining a
    new type for storing TOS values, so that Sparse could warn about
    invalid use cases).
    
    Make TOS and ECN handling consistent between IPv4 and IPv6
    (somewhat implied by the previous bullet points).
    
    The main road blocks are code churn and drawing the line between bugs
    and established behaviours.
    
    Speaker: Guillaume Nault (Red Hat)
    
    Slides (PDF)
    
    Video (Youtube)
- Diversity, Equity & Inclusion MC Microconference4/Virtual-Room (LPC Virtual)
  
  Microconference4/Virtual-Room
  
  LPC Virtual
  
  150
  
  Creating diverse communities requires effort and commitment to creating inclusive and welcoming spaces. Recognizing that communities which adopt inclusive language and actions attract and retain more individuals from diverse backgrounds, the Linux kernel community adopted inclusive language in Linux 5.8 release. Understanding if this sort of change has been effective is a topic of active research. This MC will take a pulse of the Linux kernel community as it turns 30 this year and discuss some next steps. Experts from the DEI research community will share their perspectives, together with the perspectives from the Linux community members.
  - 200
    
    Diversity, Equity, Inclusion MC: Welcome and Introduction
    
    Speakers: Kate Stewart (Linux Foundation), Shuah Khan
    
    LPC2021 - DEI - Intro (4).pdf
  - 201
    
    Diversity, Equity, & Inclusion in Open Source Communities: Key Themes & Preliminary Results from LF's 2021 Research
    
    Equity and inclusion in Tech is not just about diversity in hiring, but has profound implications for downstream accessibility, user experience, and the next generation of products. Open source has unique challenges and opportunities to advance DEI and drive more inclusive innovation. To better understand these dynamics and the key resources and solutions needed, The Linux Foundation is conducting research across the entire open source ecosystem. This talk will share some of the key themes and preliminary results from this ongoing effort, followed by an interactive discussion with the Kernal community.
    
    Speaker: Jessica Groopman (Kaleido Insights)
    
    LF Plumbers_ OS DEI Prelim Findings_GroopmanF 20210924.pptx.pdf
  - 202
    
    Women of Open Source Software: Motivations and Experiences
    
    In this interactive session we will discuss why women join OSS, why they stay in OSS and what are their experiences of contributing to OSS. Based on empirical evidence from surveys and interviews, we will brainstorm strategies for welcoming more women to OSS and for improving retention of women in OSS.
    
    Speaker: Dr Vandana Singh (iSchool at University of Tennessee – Knoxville)
    
    LinuxPlumbers_VS.pdf
  - 203
    
    Linux Developers: Motivations & Challenges (preliminary insights)
    
    Participation of women in Open Source Software (OSS) is very unbalanced, despite various efforts to improve diversity. This is concerning not only because women do not get the chance of career and skill developments afforded by OSS, but also because OSS projects suffer from a lack of diversity of thoughts because of a lack of diversity in their projects. Researchers have been trying to understand the low representation rate of women in OSS, as well as to learn more about their motivations, challenges, biases and the strategies that can be adopted to attract and retain this underrepresented population. Linux Kernel is also investigating those factors to create strategies to increase women’s participation.
    
    Speaker: Bianca Trinkenreich (Northern Arizona University)
    
    Linux Developers_ Motivations & Challenges (preliminary insights).pdf
  - 08:40
    
    Break
  - 204
    
    Mentoring at scale: Acknowledging Implicit Mentoring
    
    Mentoring is crucial for knowledge transfer in open source. But traditional dyadic mentoring formats between an expert and novice are hard to scale. In this talk, I will present the different types of mentoring in open source and focus on implicit mentoring---mentoring taking place in everyday development activities like code-reviews. I will show how implicit mentoring can be automatically identified, how widespread it is, who participates, and how to achieve mentoring at scale by building an appreciative project culture.
    
    Speaker: Dr Anita Sarma (Oregon State University)
    
    Linux_Plumber_mentoring.pdf
  - 205
    
    Linux Kernel Trends and Future Goals for Diversity
    
    "Past performance is not an indicator of future" is a common disclaimer in predictions. That being said, understanding the trends and what's working is helpful for doing some goal setting. This session will present some of the diversity data that's been mined from the kernel contributions over the last 30 years, and then open up a discussion as to what are some realistic goals to set for the next year, next 10 and next 20 years, and how.
    
    Speakers: Dr Daniel German (University of Victoria), Kate Stewart (Linux Foundation), Shuah Khan (The Linux Foundation)
    
    LPC 2021 - DEI - Kernel Trends (2).pdf
  - 206
    
    Community Diversity + Events: Impacts and Trends
    
    Diversity and events have a unique relationship. Events can serve to highlight the lack of diversity in the community, but at the same time create a lot of opportunities to increase diversity as well. In this session, we'll review diversity trends across a decade of events, types of diversity we measure and how those have evolved, efforts to increase diversity and what, if any, impact those have had. We'll also discuss how COVID has shifted this relationship as events pivoted to virtual. Did diversity increase or decrease? As we move back to a version of normal and a return to in-person events, how will this impact diversity + events?
    
    Speaker: Angela Brown (Linux Foundation)
    
    LPC2021 - DEI -Events.pdf
  - 207
    
    Wrap up & Next Steps
    
    Speaker: Shuah Khan (The Linux Foundation)
    
    LPC2021 - DEI - Wrapup.pdf
- GPU/media/AI buffer management and interop MC Microconference2/Virtual-Room (LPC Virtual)
  
  Microconference2/Virtual-Room
  
  LPC Virtual
  
  150
  
  The GPU/media/AI buffer management and interop microconference focuses on Linux kernel support for new graphics hardware that is coming out in the near future. Most vendors are also moving to firmware control of job scheduling, additionally complicating the DRM subsystem's model of open user space for all drivers and API. This has been a lively topic with neural-network accelerators in particular, which were accepted into an alternate subsystem to avoid the open-user space requirement, something which was later regretted.
  
  As all of these changes impact both media and neural-network accelerators, this Linux Plumbers Conference microconference allows us to open the discussion past the graphics community and into the wider kernel community. Much of the graphics-specific integration will be discussed at XDC the prior week, but particularly with cgroup integration of memory and job scheduling being a topic, plus the already-complicated integration into the memory-management subsystem, input from core kernel developers would be much appreciated.
  - 208
    
    GPU/media/AI buffer management and interop Housekeeping
    
    Quick 5 minutes introduction:
    
    Rules of engagement
    General logistics
    Notes taking strategy
    Where to chat/interact
    Oher items
    
    Speaker: Daniel Stone (Collabora)
  - 209
    
    dma-fence deadline and priority boosting
    
    In order to meet our fixed frame deadlines (e.g. vertical refresh) whilst still having low power usage, we need to keep our power management policies balanced between performance bursts and deeper sleeps. Between dma-fence being used to declare synchronisation dependencies between multiple requests, and additional hints (e.g. input events suggesting that GPU activity will happen 'soon') we can insert clock boosts to try to head off issues before they happen. Full-system tracing with e.g. Perfetto will also be discussed to get a better picture of the system's behaviour as a whole.
    
    Speaker: Rob Clark (Google)
  - 210
    
    Presentation timing deep dive
    
    Supporting predictable presentation timing for graphics and media usecases requires a great deal of plumbing through the stack, right up to userspace. Whilst some higher-level APIs have been discussed, there are a number of open questions including how to handle VRR, and how to support this with mailbox-type systems like KMS and Wayland. Outline the current state and wants from all the different angles, and discuss how we could come up with lower-level primitives which allow these systems to be built.
    
    Speaker: Daniel Stone (Collabora)
  - 08:25
    
    Break
  - 211
    
    Documenting the Heterogeneous Memory Model Architecture
    
    HMM (heterogeneous memory management) was first merged in the Linux kernel in 2017 and has since been adopted by several device drivers. As it integrates the device drivers more closely with the core kernel's virtual memory management, more kernel subsystems are starting to get involved in related code reviews and take notice, e.g. file systems and page cache. As a consequence, we need to consider and document the interactions of ZONE_DEVICE pages and HMM migration semantics with those subsystems. This meeting is to establish the basis for architectural documentation of use to related kernel subsystems such as filesystem and networking.
    
    Speakers: Daniel Phillips (AMD), Daniel Vetter (Intel)
    
    Documenting the Heterogeneous Memory Model Architecture.pdf
  - 10:00
    
    Break
  - 212
    
    Userspace synchronisation for asynchronous hardware engines
    
    Both future hardware and also user-visible APIs, are demanding that we discard our previous fence-based synchronisation model and allow arbitrary synchronisation primitives similar to Windows/DirectX 'timeline semaphores'. Outline the problems in trying to integrate this with our previous predictable fence-based model with dma_fence and dma_resv and discuss some potential paths and solutions.
    
    Speaker: Jason Ekstrand (Intel)
    
    Why we can’t have nice things.pdf
- IoThree's Company MC Microconference3/Virtual-Room (LPC Virtual)
  
  Microconference3/Virtual-Room
  
  LPC Virtual
  
  150
  
  The IoThree's Company microconference is moving into its third year at Plumbers. Talks cover everything from the real-time operating systems in wireless microcontrollers, to products and protocols, to Linux kernel and tooling integration, userspace, and then all the way up to backing cloud services. The common ground we all share is an interest in improving the developer experience within the Linux ecosystem.
  - 213
    
    IoThree's Company
    
    Come and knock on our door!
    
    The Internet of Things Microconference is in its third year at Plumbers. Talks cover everything from the real-time operating systems in wireless microcontrollers, to products and protocols, to Linux kernel and tooling integration, userspace, and then all the way up to backing cloud services. The common ground we all share is an interest in improving the developer experience within the Linux ecosystem.
    
    In this introduction, we give a brief overview of the presenters and set the stage for the remainder of the MC.
    
    Speakers: Christopher Friedt (Friedt Professional Engineering Services), Jason Kridner (Texas Instruments and BeagleBoard.org Foundation), Drew Fustini (BeagleBoard.org Foundation)
  - 214
    
    Overview of LoRa & LoRaWAN support in Zephyr
    
    Zephyr RTOS, the fast-growing, scalable, open source RTOS for resource constrained devices recently gained support for LoRa and LoRaWAN technologies. The addition of LoRa technologies enabled Zephyr to be used in applications where long range coverage is needed. With LoRa/LoRaWAN support in place, Zephyr is emerging as the preferred software stack for the LoRa End nodes while Linux continues to be the de-facto software stack for the LoRa Gateways.
    
    In this discussion, the current status of LoRa and LoRaWAN support in Zephyr will be explored and we will discuss about how to add the persistent storage support for storing parameters such as keys, and devnonce to Non-volatile memory. We will also touch base on the ongoing work towards the addition of LoRaWAN support in the Linux kernel by the community.
    
    Speaker: Manivannan Sadhasivam
    
    Overview of LoRa & LoRaWAN support in Zephyr - LPC21.pdf
  - 215
    
    mikroBUS Driver for Add-on Boards
    
    mikroBUS is an add-on board socket standard by MikroElektronika that can be freely used by anyone following the guidelines. The mikroBUS standard includes SPI, I2C, UART, PWM, ADC, GPIO and power (3.3V and 5V) connections to interface common embedded peripherals, there are more than 800 add-on boards ranging from wireless connectivity boards to human-machine interface sensors which conform to the mikroBUS standard, out of which more than 140 boards already have device driver support in the Linux kernel.Today, the most straight forward method for loading these device drivers is to provide device-tree overlay fragments at boot time which needs maintaining a huge out-of-tree repository of device tree fragments for each add-on board for each supported socket for each target, moreover device-tree currently does not support instantiating devices on dynamically created greybus peripherals.
    
    mikroBUS driver is introduced in the kernel to solve the problem by enabling mikroBUS as a probeable bus such that the kernel can discover the device(s) on the bus at probe time, this is done by storing the add-on board device driver-specific information on a non-volatile storage accessible over 1-wire on the mikroBUS port. The format for describing the device driver-specific information is an extension to the Greybus manifest. In addition to physical mikroBUS ports on a target, the driver also supports instantiation of devices on remote mikroBUS port(s) on a micro-controller which is visible to the host as a set of greybus peripherals.The choice of greybus manifest for device description makes sure that only one kind of device description is required independent of the way in which the device is connected to the host. The mikroBUS driver does not have any strict associations to the pin mapping of the port and the same framework can be reused for other similar add-on board standards such as FeatherWing, PMOD, Grove or Qwiic. With more than 140 add-on boards having tested support today, the mikroBUS driver helps to reduce the time to develop and debug various add-on boards and support for greybus enables rapid prototyping and deployment of remote systems.
    
    Speaker: Mr Vaishnav M A (Beagleboard.org)
    
    mikroBUS-LPC2021-Slides.pdf
  - 08:25
    
    Break
  - 216
    
    IoT Gateway Blueprint with Thread and Matter
    
    This talk will cover the ideas and implementations for an IoT gateway blueprint
    based on Linux and build with Yocto.
    
    Thread technical topics discussed will be OpenThread for connectivity between
    the Linux based gateway and Zephyr based nodes, Matter (former CHIP) for
    application layer profiles and device types and an OTA service to assist low
    resource IoT devices with firmware upgrades. Furthermore, we will discuss additional
    network services for native IPv6 as well as NAT64 connectivity.
    
    Speaker: Mr Stefan Schmidt
    
    2021-09-24-LPC-IoT-thread-and-matter.pdf
  - 217
    Apps not boilerplate, leveraging Android's CHRE and Zephyr
    
    The process of building an IoT or EC is a very involved and complex process. Many different specialities are involved in creating the hardware as well as the software that powers it. While many of these costs are unavoidable, they have been mitigated in other disciplines of software development: mobile, web, and server. In all three, several frameworks exist to abstract away the underlying hardware and even running constraints. In the presentation:
    
    Explore some of the pain points of bringing up new devices with a strong focus on my personal experience on sensors in Chromium's EC.
    
    Provide some insight of Android's Context Hub Runtime Environment CHRE.
    
    Separate feature development into highly testable nanoapps which handle events from the main CHRE event loop in a pub-sub like fashion.
    
    Providing a common event-bus like system which routes events generated by Platform Abstraction Layer (PAL) frameworks to the various nanoapps.
    
    Providing a modular system for peripheral frameworks such as sensors, WiFi, GNSS, WiFi, etc. This system is not without its faults, it has to be implemented for each platform as a PAL.
    
    Discuss Zephyr's use of devicetree and hardware abstraction APIs (specifically in the context of sensors along with an example).
    
    Provide an example of using Zephyr along with CHRE, the synergetic effects of the two, and how can the two be used to mitigate a lot of the above costs/pitfalls, improve time to market, and ease the overhead of testing.
    
    Issue: Adding CHRE as a Zephyr module
    
    Issue: Creating the CHRE compatible sensor framework
    
    Speaker: Yuval Peress (Google)
    
    Zephyr CHRE (LPC).pdf
  - 10:00
    
    Break
  - 218
    
    Embedded Linux & RTOSes: why not both?
    
    One of the first questions you need to answer when embarking on an IoT project is do you use a Linux-based platform versus a RTOS like Zephyr. Is one better than the other? In this talk we'll explore the strengths of each approach, what Linux can learn from RTOSes (and vice versa) and even examples where an IoT device would use Linux and an RTOS.
    
    Speaker: Jonathan Beri
    
    linux-rtos.pdf
- LPC Refereed Track Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  - 219
    
    Compute Express Link + Linux + QEMU = Yes
    
    Compute Express Link (CXL) is a cache coherent protocol designed to boost the performance of accelerators and memory expanders. By nature of being an open standard, it also allows vendors to design hardware which will work with a generic CXL driver in the same vein as AHCI, or NVMe (and others). With CXL 2.0 spec release late last year, we're starting to see usage models being actively pushed extend to client, enterprise, and cloud.
    
    Our Linux driver team was tasked with writing the Linux driver without the luxury of having reference hardware to develop on. QEMU is a common solution for this problem. However, a couple of aspects set this journey aside from many others. First, this isn't just a device that requires emulation, but an entire bus, and second, we began development before the spec was even finalized and this allowed us to find spec bugs and gaps.
    
    The talk will cover the plumbing in QEMU used to enable our driver development. It will begin with Compute Express Link 2.0 fundamentals, the challenges that posed, and how those were solved or deferred, in detail, inside QEMU. There will be some time spent on the Linux kernel development done to date, and why QEMU is such an ideal environment for a task like this.
    
    The current work is very bare-bones with respect to the complex topologies and configurations that CXL 2.0 enables. The remainder will be spent on the current gaps in the QEMU emulation and a call for help on how we can fix those and what should be done [if anything] to get this work upstream.
    
    Speaker: Ben Widawsky
    
    LPC2021 - CXL.pdf
    
    Youtube video
  - 220
    
    Bootconfig and kernel cmdline
    
    In the recent kernel, Extra Boot Configuration (bootconfig) is available to pass the kernel boot parameters in the structured key-data form instead of single-line command line. The parameters passed via bootconfig are just merged to the kernel command line string(cmdline). Thus the kernel modules/subsystems can continue using kernel cmdline APIs, but can not use the bootconfig APIs for the parameters given by cmdline.
    The bootconfig API obviously gives a different programming model for the parameter parsing for kernel modules. The kernel module_params API is passive, main use case is callbacks handles a fixed parameter. On the other hand, the bootconfig API is active, user modules queries the parameters from the bootconfig in their preferred order and the parameter name can be dynamically expanding. If both APIs are available in the kernel modules/subsystems, users can specify more complex configuration not just setting parameter values.
    This session will explain what is the bootconfig and the relationship of the cmdline, and discuss what will be the issue to unify cmdline and bootconfig in API level.
    
    Speaker: Masami Hiramatsu (Linaro Ltd.)
    
    LPC21 - Bootconfig and Command line v2.odp
    
    LPC21 - Bootconfig and Command line v2.pdf
  - 221
    
    Guider: Linux Tracing using Python
    
    There are various types of user-space tracing programs these days. But they are too versatile and each one has to be installed or difficult to use as a beginner.
    
    This talk introduces Guider that provides various and powerful Linux tracing features using ftrace, ptrace, and procfs.
    It's repository is https://github.com/iipeace/guider.
    
    Speaker: Peace Lee
    
    Guider_Features_210828.mp4
    
    Guider_LinuxTracingUsingPython_PeaceLee_LPC_2021.pdf
  - 222
    Measuring Code Review in the Linux Kernel Development Process
    
    In this presentation, we show some work on Measuring Code Review in the Linux Kernel Development Process.
    
    We investigated the following research questions:
    
    Does the number of responses increase as the patch developer is more experienced?
    
    Do maintainers get fewer or more responses than others, when they author a patch?
    
    Do patch developers who have previously been active in some areas of the kernel get more responses than developers who have been active in other areas?
    
    We also investigated various characteristics of the patches themselves; such as files, sections
    and mailing lists, as the following questions:
    
    Does the number of responses increase or decrease with the number of files a patch proposes to change?
    
    Does the number of responses increase or decrease with the number of maintainer sections to which changed files belong to?
    
    Does a patch get more responses if it is submitted to more mailing lists?
    
    Do some mailing lists or maintainer sections lead to larger numbers of responses than others?
    
    As 7.94% of the response traffic is classified as being authored by bots, we also considered where bots are active.
    
    We will present some interesting insights we gained in this research and the diverse set of variables which define the review process. This presentation summarizes the results of a master thesis, finished in spring 2021.
    
    Speakers: Ms Başak Erdamar, Lukas Bulwahn (BMW AG)
    
    Erdamar_2021_Measuring-Code-Review-in-the-Linux-Kernel.pdf
    
    LCP2021_Bulwahn_Erdamar_Measuring-Code-Review-in-the-Linux-Kernel.pdf
- Toolchains and Kernel MC Microconference1/Virtual-Room (LPC Virtual)
  
  Microconference1/Virtual-Room
  
  LPC Virtual
  
  150
  
  The Toolchains and Kernel microconference focuses on topics of interest related to building the Linux kernel. The goal is to get kernel developers and toolchain developers together to discuss outstanding or upcoming issues, feature requests, and further collaboration.
  
  toolchains-kernel-MC-lpc-2021-notes.txt
  - 223
    Toolchains and Kernel MC Welcome
    
    This is a quick intro to the MC.
    
    The Toolchains and Kernel micro conference focuses on topics of interest related to building the Linux kernel. The goal is to get kernel developers and toolchain developers together to discuss outstanding or upcoming issues, feature requests, and further collaboration.
    
    Suggested Topics:
    
    Continuous Integration
    
    Toolchain Feature Requests
    
    Rust support
    
    Outstanding/painful toolchain bugs
    
    Control Flow Integrity
    
    Syscall wrapping in glibc.
    
    Security features in the toolchains
    
    Achievements since last year’s LPC:
    
    linux-toolchains mailing list and archive created.
    
    Rust-for-linux Github org established. Patches move from out of tree module building, to in tree module building.
    
    CI for kernel builds with LLVM moved to tuxbuild after an unexpected “no more free lunch” from TravisCI.
    
    LTO support landed in mainline.
    
    PGO patches sent upstream.
    
    At least one bugfix sent found via clang-tidy/clang-analyzer, discussions around driving tree wide cleanups via clang-tidy.
    
    GCC implemented support for asm goto with outputs
    
    Support for auto-initialized automatics in GCC is being worked out in GCC upstream. This is one of the security features that were deemed as desirable by the kernel last year. Work on the other missing desired security features is WIP.
    
    Possible Topics/Attendees:
    
    Upstreaming Rust Support - (Miguel Ojeda, Wedson Almeida Filho, Greg Kroah-Hartman, Michael Ellerman, Josh Triplett, Alex Gaynor, Geoffrey Thomas, Sami Tolvanen)
    
    Using Clang's locking annotations - (Jann Horn, Kees Cook)
    
    Memory ordering progress in the C/C++ standards committees - (Paul McKenney, Will Deacon, Peter Ziljstra)
    
    Toolchain security feature requests - (Kees Cook)
    
    Post Link Optimization of the kernel with Binary Optimization and Layout Tool (BOLT) - (Maksim Panchenko)
    
    Objtool on arm64 - (Josh Poimboeuf, Peter Ziljstra, Will Deacon, Bill Wendling)
    
    DWARF, CTF and BTF (Indu Bhagat, Mark Wielaard, Dodji Seketeli)
    
    BPF/BTF/CORE support in the GNU Toolchain (Jose E. Marchesi, David Faust, Weimin Pan)
    
    Using BTF for ABI analysis (Matthias Maennic, Giuliano Procida)
    
    Speakers: Jose E. Marchesi (GNU Project, Oracle Inc.), Nick Desaulniers (Google)
  - 224
    The Rust toolchain in the kernel
    
    The Rust for Linux project is adding support for the Rust language to the Linux kernel. If the project is successful, and many drivers start to be written in Rust, then the Rust compiler and associated tools will become a key part of the kernel toolchain.
    
    This raises many questions which we will try to answer and/or discuss with others:
    
    Which particular Rust toolchain (channels, versions, etc.) is needed for the kernel? What is RUSTC_BOOTSTRAP and why we need it?
    
    Which components are required to build the kernel?
    
    Which parts of the standard library are required? Do they need to be compiled in a particular way?
    
    Which version of LLVM rustc requires?
    
    What other tooling compiling the kernel is required? e.g. bindgen.
    
    What tooling is required to build the documentation?
    
    How Linux distributions should distribute this Rust toolchain, e.g. should it be a separate one from the main Rust packages they may otherwise have?
    
    Should we provide pre-compiled Rust toolchains from kernel.org?
    
    Which architectures are supported so far by LLVM? Which ones may be soon supported?
    
    Is it possible to have GCC-built kernels with Rust support? To which degree is it supported?
    
    Which are the alternative Rust compilers and how advanced they are?, e.g. gcc-rs (the new GCC frontend for Rust), rustc_codegen_gcc (the new rustc backend for GCC) and mrustc (the bootstrapping compiler).
    
    Speaker: Miguel Ojeda
    
    2021-09-24 - Linux Plumbers Conference - The Rust toolchain in the kernel.pdf
  - 225
    objtool on arm64
    
    objtool is heavily used on x86, but isn't currently support upstream by arm64.
    
    In order to avoid depending on objtool to enable any kernel features for arm64 and also to avoid disabling compiler optimisations along the lines of https://git.kernel.org/linus/3193c0836f20 when objtool cannot reconstruct the control flow, how much of its functionality is actually required on arm64 and how much of that could be directly implemented by the toolchain instead?
    
    From:
    
    https://lore.kernel.org/r/YKO/di4h3XGjqu68@hirez.programming.kicks-ass.net
    
    some objtool features on x86 are:
    
    validate stack frames
    
    generate ORC unwind data (optional)
    
    validates unreachable instructions; specifically the lack thereof
    (optional)
    
    validates retpoline; or specifically the lack of indirect jump/call
    sites (with annotations for those few that are okay). (optional)
    
    validates uaccess rules; specifically no call/ret in between
    __user_access_begin() and __user_access_end(). (optional)
    
    validates noinstr annotation; HOWEVER we rely on objtool to NOP
    all __sanitizer_cov_* calls in .noinstr/.entry text sections because
    __no_sanitize_cov is 'broken' in all known compilers.
    
    generates __mcount_loc section and NOPs the __fentry call sites
    (optional)
    
    generates .static_call_sites section for STATIC_CALL_INLINE support
    
    rewrites compiler generates call/jump to the retpoline thunk to an
    alternative such that we can patch out the thunk with an indirect
    call/jmp when retpolines are disabled. (arch dependent)
    
    rewrites specific jmp.d8 sites (as found through the __jump_table
    section) to nop2, because GAS is unable to determine if a jmp becomes
    a jmp.d8 or jmp.d32 and emit the right sized nop. (optional)
    
    Speakers: Josh Poimboeuf (Red Hat), Mark Rutland (Arm Ltd), Peter Zijlstra (Intel OTC), Will Deacon
    
    Objtool on arm64 - LPC 2021.pdf
    
    unwinding-arm64.pdf
  - 08:05
    
    break 1
  - 226
    
    Report From The Standards Committees
    
    Both C and C++ started as strictly single-threaded languages, despite significant multi-threaded use more than 30 years ago. Explicit support for multithreaded execution appeared in 2011, but this was by no means the final word. This presentation will give a quick overview of low-level standards-committee concurrency progress since then, including a snapshot of work on hazard pointers, RCU, relaxed accesses, dependency ordering, and the interplay between the C/C++ and Linux-kernel memory models.
    
    Speaker: Paul McKenney (Facebook)
    
    report-from-stds.2021.09.24a.pdf
  - 227
    The never-ending saga of control dependencies
    
    The Linux kernel continues to rely on control dependencies as a cheap mechanism
    to enforce ordering between a prior load and a later store on some of its
    hottest code paths. However, optimisations by both the compiler and the CPU
    hardware can potentially defeat this ordering and introduce subtle,
    undebuggable failures which may only manifest on some systems.
    
    Improving the robustness of control dependencies is therefore a hotly debated
    topic, with proposals ranging from limiting their usage, inserting conditional
    branches, introducing compiler support and using memory barriers instead. The
    scope of possible solutions has resulted in somewhat of a deadlock, so this
    session aims to cover the following in the interest of progressing the debate
    and soliciting opinions from others:
    
    What are control dependencies?
    
    How can they be broken by the compiler?
    
    How can they be broken by the CPU? (specifically, arm64)
    
    volatile_if() and a potential compiler __builtin
    
    A better barrier() macro
    
    Upgrading READ_ONCE() and relaxed atomics to have acquire semantics
    
    LKML mega-thread: https://lore.kernel.org/r/YLn8dzbNwvqrqqp5@hirez.programming.kicks-ass.net
    
    Speakers: Will Deacon, Peter Zijlstra (Intel OTC), Paul McKenney (Facebook), Jade Alglave (Arm)
    
    Control deps - LPC 2021.pdf
  - 09:20
    
    break 2
  - 228
    
    Optimizing Linux Kernel with BOLT
    
    Previous research has demonstrated that the Linux Kernel can benefit greatly from the latest compiler optimization techniques. Binary Optimization and Layout Tool (BOLT) is successfully used to accelerate large applications compiled with PGO and LTO by further improving the code layout to favor underlying hardware page and instruction caching. However, applying BOLT to the kernel faces multiple hurdles as the tool splits and reorders code sequences across function boundaries. The corresponding metadata used for code patching at boot and runtime needs to be updated accordingly. At the same time, BOLT optimizations have to be tailored to meet certain expectations about the properties of the code. Updating exception-handling and stack-unwinding data present another set of challenges. Even allocating memory for the modified code is not as straightforward as is the case with a typical ELF binary. We'll discuss the possible approaches to optimizing the kernel with BOLT and the project's current status.
    
    Speaker: Maksim Panchenko (Facebook)
    
    Optimizing Linux Kernel with BOLT.pdf
  - 229
    Compiler Features for Kernel Security
    
    GCC and Clang both have a variety of security features available, but they are not always at parity with each other. This discussion will review the security features important to the Linux kernel with regard to what's working, what's missing, and what needs adjustment.
    
    Specifically, these areas will be discussed along with anything else that seems relevant:
    
    stack protector guard location (i.e. enabling per-task canaries)
    
    -mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=0
    
    call-used register zeroing (now in GCC 11)
    
    -fzero-call-used-regs
    
    stack variable auto-initialization (already in Clang, soon to be in GCC 12)
    
    -ftrivial-auto-var-init={zero,pattern}
    
    array bounds checking
    
    -Warray-bounds -Wzero-length-bounds -Wzero-length-array -fsanitize=bounds -fsanitize=bounds-strict
    
    integer overflow protection
    
    -fsanitize=signed-integer-overflow -fsanitize=unsigned-integer-overflow
    
    Link Time Optimization
    
    -flto -flto=thin
    
    backward edge Control Flow Integrity
    
    -mbranch-protection=pac-ret[+leaf] -fsanitize=shadow-call-stack CET
    
    forward edge Control Flow Integrity
    
    -fcf-protection=branch -mbranch-protection=bti -fsanitize=cfi
    
    Spectre v1 mitigation
    
    -mspeculative-load-hardening
    
    structure layout randomization
    
    __attribute__((randomize_layout))
    
    constant expression for "is an lvalue?"
    
    constant expression for lvalue type extraction
    
    Speakers: Kees Cook (Google), Qing Zhao
    
    compiler-features-for-kernel-security.pdf
- Keynote: Closing Keynote Refereed Track/Virtual-Room (LPC Virtual)
  
  Refereed Track/Virtual-Room
  
  LPC Virtual
  
  150
  
  Conveners: Christian Brauner, David Woodhouse
  - 230
    
    Closing Keynote & Beverage Hall
    
    This event marks the end of Linux Plumbers 2021. We will take a look back at Linux Plumbers 2021, our challenges in organizing it and what our hopes for Linux Plumbers 2022 are.
    
    During Monday's opening keynote we had the chance to look back on the last 30 years of Linux. In the closing keynote we will concentrate on the future of Linux instead. And this requires your input too! We would like to hear what you think will happen with Linux in the future. So please fill out our "Linux Prediction Survey" at
    https://docs.google.com/forms/d/e/1FAIpQLSc6mfLXCXQLY6NfT5apbtu2dZVSQHBESVpDxbxqrBP5HnpfTA/viewform?usp=pp_url
    We will discuss the results in this session.
    
    Speakers: Mr Christian Brauner, David Woodhouse

Choose timezone

Linux Plumbers Conference 2021

20-24 September,Virtually

Networking and BPF Summit/Virtual-Room

LPC Virtual

Microconference2/Virtual-Room

LPC Virtual

GNU Tools track/Virtual-Room

LPC Virtual

Kernel Summit/Virtual-Room

LPC Virtual

Refereed Track/Virtual-Room

LPC Virtual

Microconference3/Virtual-Room

LPC Virtual

Microconference1/Virtual-Room

LPC Virtual

Refereed Track/Virtual-Room

LPC Virtual

Networking and BPF Summit/Virtual-Room

LPC Virtual

The data plane and the control plane

Unoffloaded software upper interfaces

Switch topology changes

Future changes

Microconference2/Virtual-Room

LPC Virtual

Microconference4/Virtual-Room

LPC Virtual

GNU Tools track/Virtual-Room

LPC Virtual

Refereed Track/Virtual-Room

LPC Virtual

Microconference3/Virtual-Room

LPC Virtual

Microconference1/Virtual-Room

LPC Virtual

ABSTRACT

INTRODUCTION

PROBLEM STATEMENT

EXISTING SOLUTIONS AND LIMITATIONS

Limitations of stalld in resolving kthread starvation:

1. Stalld does not scale with the number of CPUs

2. Stalld can get starved itself

3. Stalld's logging is unreliable

4. Trade-off between time-to-respond vs CPU consumption

DESIGN OF PROPOSED SOLUTION (IN-KERNEL KTHREAD STARVATION AVOIDANCE)

1. Compartmentalize the fault-domains of the RT application & the OS

2. Starvation avoidance via (per-CPU) scheduler-hooks scales well

3. Kernel-based design lends itself to an elegant implementation

IMPLEMENTATION OUTLINE

CONCLUSION

Microconference3/Virtual-Room

LPC Virtual

Networking and BPF Summit/Virtual-Room

LPC Virtual

Motivation

Problem Statement

Approach

Results

Conclusions

GNU Tools track/Virtual-Room

LPC Virtual

Refereed Track/Virtual-Room

LPC Virtual

Microconference4/Virtual-Room

LPC Virtual

Microconference2/Virtual-Room

LPC Virtual

Microconference1/Virtual-Room

LPC Virtual

Networking and BPF Summit/Virtual-Room

LPC Virtual

GNU Tools track/Virtual-Room

LPC Virtual

Microconference1/Virtual-Room

LPC Virtual

Kernel Summit/Virtual-Room

LPC Virtual

DAMON and DAMOS