The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond. LPC 2021 will be held virtually (like in 2020). We are looking forward to seeing you online!
The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
Short intro/welcome session to the Networking and BPF track.
Sometimes using tracing - instead of traditional kernel debuggers - to investigate kernel issues can be necessary. Some problems such as races are inherently time-sensitive, so minimally invasive tracing is ideal in such cases. However, it is also true that debuggers have capabilities that BPF-based tracing does not (or has recently acquired) - printing data structures, tracking local variable values, inline function visibility, etc. In fact looking at gdb capabilities is a great way to explore some possibilities for future tracing functionality. Here we explore some of the possibilities, the potential cost-benefit, and seek feedback and discussion with the community on the potential value of the approaches explored.
eBPF has been used extensively in performance profiling and monitoring. In this talk, I will describe a set of eBPF applications that help monitor and enhance cpu scheduling performances. These applications include:
Profiling scheduling latencies. I will talk about an application of eBPF to collect scheduling latency stats.
Profiling resource efficiency. For background, I will first introduce the scheduler feature core scheduling which is developed for mitigating L1TF cpu vulnerability. Then I will introduce the eBPF feature ksym which enables this application and describe how eBPF can help report the forced idle time, a type of cpu usage inefficiency caused by core scheduling.
The third application of eBPF is to assist userspace scheduling. ghOSt is a framework open sourced by Google to enable general-purpose delegation of scheduling policy to userspace processes in a Linux environment. ghOSt uses BPF acceleration for policy actions that need to happen closer to scheduling edges. We use this to maximize CPU utilization (pick_next_task), minimize jitter (task_tick elision) and control tail latency (select_task_rq on wakeup). We are also experimenting with BPF to implement a scaled-down variant of the scheduling policy while upgrading the main userspace ghOSt agent.
This talk presents our recent work available in the v5.14 kernel, which improves the SO_REUSEPORT
functionality.
The SO_REUSEPORT
option was introduced in v3.9. In the former version, only one socket is allowed to listen()
on any given TCP port. The traditional technique for a high-performance server is to have a single process that accept()
s and distributes connections to other processes or to have multiple processes that accept()
connections from the same single socket. However, the accept()
syscalls to a single listen()
ing socket can be a bottleneck. The SO_REUSEPORT
option allows multiple sockets to listen()
on the same port and addresses the bottleneck.
If the option is enabled, the kernel distributes connections evenly to each listen()
ing socket when SYN packets arrive. Once the kernel has committed a connection to a listen()
ing socket, it does not change later. Thus, when a listen()
ing socket is close()
d, the not yet accept()
ed connections are aborted even if other sockets still listen()
on the same port.
This talk shows how the SO_REUSEPORT
mechanism works with SYN processing, when it causes connection failures, how we can work around it with BPF, and how we address it with the new socket migration feature and the extension of BPF.
There's currently kernel-side bottleneck when attaching probe
to multiple functions, which makes several tools like bpftrace
or retsnoop suffer in such use cases - it takes forever to
attach single probe for multiple functions ;-)
After multiple discussions and many failed attempts it looks
like we are on track to have working solution, which consists
of multiple gradual changes in ftrace and bpf code.
In this presentation I'll show and explain why the current code is
slow and introduce the proposed solution to the problem and the
status of the implementation.
In this talk I will outline a facility for tracing BPF map updates which might be used to perform zero downtime upgrades of stateful programs.
Map updates cannot currently be natively traced within BPF. I propose a set of kernel changes where tracing programs can be attached to individual maps. These programs run in response to particular operations: one might run on update, and another on deletion.
This facility seems like it should be broadly useful, but it was designed with a specific use case in mind. We would like to be able to migrate state between two versions of a set of programs, and swap between the two versions with zero downtime. By tracing updates on the original set of maps, I believe that we can achieve this goal.
Please see the attached paper for a deeper discussion of the problem and proposed solution.
The Containers and Checkpoint/Restore Microconference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.
Contributions to the micro-conference are expected to be problem statements, new use-cases, and feature proposals both in kernel- and userspace.
The user namespace currently relies on mapping UIDs and GIDs from the initial namespace (full uint32 range) into the newly created user namespace. This is done through the use of uid_map/gid_map with the kernel allowing mapping your own uid/gid and otherwise requiring a privileged process write a more complete map.
As more and more software (not just container managers) are making use of user namespaces, a few patterns and problems have started to emerge:
For maximum security, containers should get their own, non-overlapping uid/gid range. Non-overlapping here means no shared uid/gid with any other container on the system as well as no shared uid/gid with any user/processes running at the host level. Doing this prevents issues with configuration tied to a particular uid/gid which may cross container boundaries (user limits are/were known to do that) as well as prevent issues should the container somehow get access to another's filesystem.
Network authentication as well as dynamic user creation is causing more "high" uid/gid being used than before. So while it was perfectly fine to give a contiguous range of 65536 uid/gid to a container before, nowadays, this often needs to be extended to 1-2M to support network authentication and may need a bunch of mapped uid/gid right at the end of the 32bit range to handle some temporary dynamic users too.
Coordination on uid/gid range ownership in theory was done through the use of shadow's /etc/subuid, /etc/subgid and the newuidmap/newgidmap helpers. In practice those have only ever really been used when containers are created/started by non-root users and are generally ignored by any tool which operates as root. This effectively means there's no coordination going on today and its' very easy to accidentally assign a uid/gid to a user namespace which may in fact be used by the host system or by another user namespace.
The bulk of those constraints come from the fact that the user namespace has to deal with filesystem access and permissions. Having a uid in the container translated to a real uid and have that be what's used to access the VFS. If a uid cannot be translated, it's invalid and can't be used.
But now that we have a VFS layer feature to support ID shifting, we may be able to decouple the two and allow for user namespaces that have access to the full uid/gid range (uint32) yet are still technically using completely distinct kuid/kgid from everything else. This would then rely on the use of VFS based shifting for any case where data must be accessed from outside of that namespace.
This kind of approach would make it trivial to allocate new user namespaces, would drop the need for coordination and avoid conflicts with host uid/gid. Anyone could safely get a user namespace with access to all uid/gid but restricted to virtual filesystems. Then with help from a privileged helper could get specific mounts mapped into their namespace allowing for VFS operations with the outside world.
There are some interesting corner cases though, like what do we do when transferring user credentials across namespace boundaries. As in, how do we render things on the parent user namespace, whether it's the ownership of a process or the data in a ucred.
The Oracle database offers a long-term-stable version that is supported and
maintained for many years. But as Linux distributions slowly transition
from cgroup v1 to cgroup v2, this creates a challenge for the DB. cgroup v1
and cgroup v2 have different interfaces and best practices.
This talk is to discuss the current status of the cgroup abstraction layer, how
applications like the Oracle database plan to use it, and gather/discuss other
users' requirements for this layer.
[1] https://github.com/libcgroup/libcgroup/releases/tag/v2.0
[2] https://sourceforge.net/p/libcg/mailman/message/37317688/
Mount checkpoint/restore is an important part of CRIU, it is responsible for
consistency of the file system view of dumped processes. In current state we
can only restore simple mount configurations, something more complex would
either make us fail or, which is even worse, make us creating wrong file
system view for restored processes.
In CRIU we only see the final state, the result of probably multiple kernel API
calls inside a container, and on restore we need to recreate the sequence of
calls which would lead to the exact same state, in general this task can be
very complex. So sometimes the only way is to simplify the API so that it
become easier to restore all possible configurations.
Last year [1] we discussed a variety of problems CRIU faces with mounts, most
important ones related to mount propagation and how to simplify mount
propagation configuration so that even complex setups can be re-created simply
and correctly.
This talk will start with showing more complex mount configurations to
demonstrate that that we still need an API change. Then there will be a status
update on kernel patch progress and changes that were done during the last year
followed by the discussion on how to make the patch [2] mergeable to the
upstream kernel.
Here are links:
Thanks to Andrei Vagin and Christian Brauner for a great help with it!
Starting things is slow. Even if only 1 second slow, saving 1s on a million container restores means we can save 11 days of useless work that every container will perform identically.
That's where snapshots come in. Snapshots in theory allow us to save an initialized container once, but then restore it a million times at less overhead than cold starting it takes.
Unfortunately, Linux applications (and the kernel in VM based container setups) expect that during their lifetime they don't get cloned from the outside. Applications create user space PRNGs (Pseudo Random Number Generators) which would generate identical random numbers after a clone. They create UUIDs that would no longer be unique. They generate unique temporary key material that is no longer unique.
Eventually, if we want to enable cloning properly, user space applications will need to learn that they have to adapt to clone events. For that they need notifications.
This session will discuss the requirements such a notification mechanism has as well as possible paths forward to implement it and drive adoption.
References:
We recently announced our work to support Checkpoint and Restore with AMD GPUs. This was first time a device plugin is introduced and that deals with one of the most complex devices on the system i.e. GPU. We made several changes to CRIU, introduced new plugin hooks and reported some issues with CRIU.
https://github.com/RadeonOpenCompute/criu/tree/amd-criu-dev-staging/plugins/amdgpu#readme
While there were several new challenges that we faced to enable this work, we were finally able to support real tensorflow/pytorch work loads across multi-gpu nodes using criu and were also able to migrate the containers running gpu bound worklaods. We have another proposed talk where we'll talk about the bigger picture but in this talk, we'd like to specifically talk about our journey where we started with a small 64KB buffer object in GPU VRAM to Gigabytes of single VRAM buffer objects across GPUs. We started with /PROC/PID/MEM interface initially and then switched to a faster direct approach that only worked with large PCIE BAR GPUs but that was still slow. For instance, to copy 16GB of VRAM, it used to take ~15 mins with the direct approach on large bars and more than 45 mins with small bars. We then switched to using system DMA engines built into most AMD GPus and this resulted in very significant improvements. We can checkpoint the same amount of data within 10 seconds now. For this we initially modified libdrm but the maintainers didn't agree to change an private API to expose GEM handles to the userspace so we finally ended up make a kernel change and exporting the buffer objects in VRAM as DMABUF objects and then import in our plugin using libdrm.
We would also like to to talk about how we further optimized it using multithreading and also our experience with compression using criu-image-streamer to save time and space further. We also encountered limitation of google protobuf in handling large vram buffer objects.
Overall this was a very significant feature addition that made our work usable from a POC to handle real world huge machine learning and training workloads.
Thank you
Rajneesh
CRIU uses many different interfaces to get information about kernel resources,
to extract sockets data sock_diag subsystem is used, for mounts/mount namespaces,
procfs per-pid mountinfo files are used, to get some file type-specific info we
use procfs fdinfo
interface (which allows to get mnt_id
from which file was opened,
file flags and so on).
One of the most important and time-consuming stages in CRIU dump is getting
process memory mappings information. Let's discuss that problem and
approaches to optimize the performance of this stage. There was a prototype
implementation of netlink-based interface to get information about a task
[1]. We suggest to use eBPF iterators framework [2] to create
CRIU-optimized interface to get task VMAs data.
Another interesting thing is mounts information acqusition. For simple cases
mountinfo file seems sufficient. Previous year we introduced support of
checkpoint-restoring nested containers. Main goal was to have ability to
C/R OpenVZ containers with Docker containers inside. And here we met
problem with overlayfs mounts. CRIU needs to get real overlayfs paths from
the kernel (mnt_id+full path for each source directory) and these paths
may be very long (like PAGE_SIZE). And this is the problem because of
serious limitations which implied by mountinfo interface (limited size of lines,
bad extendability). Some overlayfs-specific patches were proposed [3] earlier,
but it's worth to have some universal approach to query mounts information for
all file systems. There was a great subsystem called fsinfo [4] proposed by
David Howells. But for some reasons it wasn't merged. There is idea to
get some progress by creating some eBPF helpers which allows to get mounts
information.
Thanks a lot to Andrei Vagin for advices and help.
Links:
[1] https://github.com/avagin/linux-task-diag/commits/v5.8-task-diag
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/task_iter.c?h=v5.13#n472
[3] overlayfs: C/R enhancments https://lkml.org/lkml/2020/10/4/208
[4] fsinfo https://lwn.net/Articles/827934/
I'll be talking about the -fanalyzer
static analysis option I added to
GCC:
("Prepared project report": 25 minutes, including questions)
Points-to analysis is a static code analysis that calculates the pointer-pointee relationship between expressions and static memory locations. The results of the points-to analysis may be used by multiple optimizations and analyses. Of particular interest a precise points-to analysis is necessary to perform data-layout optimizations at the level of alias sets. We use the high level, declarative, logic language Souffle to encode the semantics of a points-to analysis in few lines of code. The Souffle compiler allows us to synthesize a parallel C++ representation of the points-to analysis from the Souffle representation. In this talk we will go over the implementation of an intra-procedural, inclusion-based, field-insensitive, flow-insensitive, context-insensitive, points-to analysis which works with the existing link-time optimization (LTO) framework in GCC. While the current prototype is less precise than the existing points-to analysis our plan is to increase the level of precision of this implementation and use its results to enable future link-time data-layout optimizations.
LLVM has two main test suites:
However, there is no large body of tests of detailed functionality which is compiled right down to the target object code and executed. At previous conferences, we have described the changes we have made to allow the GCC test suite to be used for nightly public regression testing of LLVM for RISC-V. Here we will discuss the necessary transformations to the testsuite to support LLVM.
Bunsen is a toolkit for compact storage and analysis of DejaGNU test results. The toolkit includes a storage engine that compresses and indexes a large collection of test result logs in a Git repository, a Python library for querying and analyzing the test result collection, and a simple CGI service for accessing query results through a web browser.
In this talk I will give an in-depth look at how Bunsen can be used to understand the current state of a project's testsuite. Based on my experience using Bunsen to collect and monitor test results from the SystemTap project, I will show how keeping a long-term repository of test results enables more sophisticated and useful analysis. In particular, I will show how Bunsen analysis scripts can help to locate significant regressions and filter out insignificant ones, identify nondeterministic ('flaky') testcases, and narrow down the commits that introduced a particular regression.
Type: prepared presentation (~25min)
Abstract:
AMD has been working on adding support for GPU compute debugging to GDB. Early on, it became apparent that current DWARF would not be sufficient to support optimized SIMT/SIMD code, so we came up with extensions and generalizations that we intend to propose to DWARF 6. Although designed with GPUs in mind, the extensions are generic and can just as well be used to improve quality of debug information for CPUs and for any architecture. We've implemented the extensions in GDB, and are in the process of implementing them in LLVM. One interesting area that required extensions is DWARF expressions, support for which we're currently upstreaming to GDB. In this presentation we will give an overview of what were the problems we saw, and what we've done to address them. More information on the subject and on our proposed DWARF extensions can be found here: https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
CTF (Compact C Type Format) is a debugging format whose main (but not only) purpose is to convey type information of C program constructs. BTF is a similar format used in the Linux kernel to support the portable execution of BPF programs. Both formats share a common ancestor and show some remarkable similarities. However they are not the same format, their application goals are different, are developed by different groups, and they use a different binary representation.
We have added support for both formats to the GNU Toolchain. CTF is now fully supported in GCC, linker (with type deduplication), binary utilities (dumping the contents of .ctf sections in human readable format), a GNU poke description for editing encoded CTF, and GDB support. BTF is supported in GCC, mainly to be used by the BPF backend. There is however no support for BTF in Binutils at this point.
In this talk we will show how these new debug formats have been implemented in GCC, highlighting how the implementation relies exclusively on the internal DWARF representation built by the compiler. This effectively makes DWARF the canonical internal representation for debugging info in GCC. This approach has worked well so far and looks very promising.
We will also discuss the support for the CTF debug format in GDB, which includes support for both CTF sections and CTF archives (latter under review.)
BPF is a virtual machine that resides in the Linux kernel. Initially intended for user-level packet capture and filtering, BPF is nowadays generalized to serve as a general-purpose infrastructure also for non-networking purposes. BPF programs are often written manually, directly in assembly instructions. However, people often want to write their BPF programs in C. We recently added support for this virtual architecture to the GNU Toolchain, to complement the already existing support in LLVM.
In this talk we will first be showing and discussing the latest developments related to this peculiar target. This includes the support for new instructions, atomics, architecture versioning, the BTF debugging format, and the support for the CO-RE (Compile-once, Run-Everywhere) mechanism in GCC, that allows to compile binary BPF programs that are portable between several kernel versions. Then we will show several on-going efforts, such as the addition of a verifier-aware testsuite to GCC (using a command-line access to the kernel verifier).
Finally, we will be addressing a very interesting problem that is arising in this field of compiled BPF. In this model where C code is processed by an optimizing compiler to generate assembly, and then checked by a run-time always-changing verifier, driven by delicate heuristics, it can be very difficult to predict the behavior of the later just by looking at the C code, in non-trivial cases. This can be very frustrating, as our own in-house experience with DTrace2 demonstrates. We have identified some particular language constructs (like loops) and compiler optimization passes that are most prone to lead to this situation.
While lore.kernel.org is a fairly new service, it has quickly become an indispensable workflow part for many maintainers. Tools like b4 allow to automate many aspects of maintainer duties:
This talk will review some of the above functionality that may be already familiar to maintainers, but will also go through other features of public-inbox, and peek into what is coming in newer releases, such as:
Working with patches sent via email does not need to be frustrating or insecure, and we have tools to prove it.
Jesse Barnes, Chrome OS baseOS (Firmware+Kernel) lead, and myself would like to present the current state of affairs of the Linux kernel on ChromeOS and the challenges we face, how we solve them and get your feedback.
We can also talk about how our efforts can help upstream development, for example by running experiments in the field to compare approaches to a specific problem or area.
Shipping ChromeOS to millions of users that span across hundreds of different platforms, multiple active kernel versions and across many different SoC architectures, introduces interesting challenges:
We feel 45-60 minutes would be enough and will allow a discussion.
Thanks a lot in advance,
Alex Levin,
ChromeOS platform tech lead.
Linux kernel uses several coarse representations of the physical memory
consisting of [start, end, flags] structures per memory region. There is
memblock that some architectures keep after boot, there is iomem_resource
tree and "System RAM" nodes in that tree, there are memory blocks exposed
in sysfs and then there are per-architecture structures, sometimes even
several per architecture.
These abstractions are used by the memory hotplug infrastructure and
kexec/kdump tools. On some architectures the memblock representation even
complements the memory map and it is used in arch-specific implementation
of pfn_valid().
The multiplication of such structures and lack of consistency between some
of them does not help the maintainability and can be a reason for subtle
bugs here and there. Moreover, the gaps between architecture specific
representations of the physical memory and the assumptions made by the
generic memory management about the memory layout lead to unnecessary
complexity in the initialization of the core memory management structures.
The layout of the physical memory is defined by hardware and firmware and
there is not much room for its interpretation. Regardless of the particular
interface between the firmware and the kernel a single generic abstraction
of the physical memory should suffice and a single [start, end, flags] type
should be enough. There is no fundamental reason it is not possible to
converge per-architecture representations of the physical memory, like
e820, drmem_lmb, memblock or numa_meminfo into a generic abstraction.
Memblock seems the best candidate for being the basis for such generic
abstraction. It is already supported on all architectures and it is used as
the generic representation of the physical memory at boot time. Closing the
gaps between per architecture structures and memblock is anyway required
for more robust initialization of the memory management. Addition of simple
locking of memblock data for memory hotplug, making the memblock
"allocator" part discardable and a mechanism to synchronize "System RAM"
resources with memblock would complete the picture.
Extending memblock with necessary funcitonality and gradually bridging the
gap between the current per-architecure physical memory representation and
the generic one will imporve robustness and maintainablity of the early
memory management.
The Rust for Linux project is adding support for the Rust language to the Linux kernel. This talk describes the work done so far and also serves as an introduction for other kernel developers interested in using Rust in the kernel.
It covers:
core
and alloc
), etc.SAFETY
comments).Rust is a systems programming language that is getting stronger support by many companies and projects over time, thanks to its memory-safety innovations (e.g. the safe/unsafe split, the borrow checker, etc.).
This talk covers:
The Rust programming language is becoming more and more popular: it's even considered as another language allowed in the Linux kernel.
That brought up the question of architecture support as the official Rust compiler is based on LLVM.
This project, rustc_codegen_gcc, is meant to plug the GCC backend to the Rust compiler frontend as a relatively low-effort: it's a shared library reusing the same API provided by the Rust compiler as the cranelift backend.
As such, it could be used by some Linux projects as a way to provide their Rust softwares to more architectures.
This talk will present this project and its progress.
GCC Rust is a front-end project for the GNU toolchain, a work-in-progress alternative to the official Rustc compiler. Being part of GCC, the compiler benefits from the common compiler flags, available backend targets and provides insight into its distinct optimiser's impact on a modern language.
This project dates back to 2014 where Rust was still ~0.8, but the language was subject to frequent change making an effort too challenging to keep up. More recently, the core language is stable, and in early 2019 the development restarted. Since then, the project has laid out the core milestone targets to create the initial MVP with freely available status reports and is part of Google Summer of Code 2021 under the GCC organisation.
The GNU toolchain has been the foundation of the Linux ecosystem for years, but the official Rustc compiler takes advantage of LLVM for code generation; this leaves a gap in language availability between the toolchains. GCC Rust will eliminate this disparity and provide a compilation experience familiar to those who already use GCC.
As of 2021, GCCRS gained sponsorship from Open Source Security, inc and Embecosm to drive the effort forward. With this support, the project has gained mentorship from the GCC and Rust community.
In this talk, I will introduce the compiler, demonstrate its current state and discuss the goals and motivations for the project.
In certain corners of the Linux Kernel, manual locking and lockless-synchronization primitives are developed instead of using the existing (and default) kernel locking APIs. This is obviously frowned upon, but still exists for historical reasons or because developers think that the subsystem in question is special enough to warrant manual synchronization primitives.
Adopting the PREEMPT_RT patch for mainline quickly exposes such cases, as these manual synchronization mechanisms are usually written with invalid assumptions regarding PREEMPT_RT that either negatively affect preemptibility or directly result in livelocks.
In the non-mainline parts of the PREEMPT_RT patch, such offending call sites were directly dealt with using "#ifdef" shortcuts. This can be an acceptable solution for an external patch, but for mainline the bar is much higher: the kernel is first surveyed for similar patterns in other subsystems and then, in cooperation with the locking subsystem maintainers, new official kernel locking mechanisms are created for such call sites.
This is better for the kernel ecosystem in general, and better for the call sites themselves. Using official kernel locking mechanisms comes with the benefits of reliability, thorough reviews, and lockdep validation.
In this session, four cases will be presented — from actual experiences while pushing some of the remaining parts of the PREEMPT_RT patch mainline: new sequence counter APIs, modified APIs for disabling tasklets, local locks, and generic patterns to avoid using the low-level in_irq/softirq/interrupt() macros in non-core kernel code.
The session will end with a discussion of some of the remaining manual locking mechanisms yet to be converted for proper mainline PREEMPT_RT inclusion.
The Open Printing microconference focuses on improving and modernizing the way we print in Linux.
Over the years OpenPrinting has been actively working on improving and modernizing the way we print in Linux. We have been working on multiple areas of printing and scanning. Especially driverless print and scan technologies have helped the world do away with a lot of hassles involved in deciding on the correct driver to use and to install the same. Users can now just plug in their printer and do what they need.
So what next in Open Source Printing?
Go through the changes in CUPS 2.4.x, including printer sharing changes for mobile, OAuth support as a replacement for Kerberos, Printer Applications as a replacement for printer drivers, TLS/X.509 changes, and CUPS in containers (snap, Docker, others?) Discuss specific needs and timeframes WRT Kerberos->OAuth, drivers->Printer Applications, X.509 management, and deploying CUPS in containers going forward
(Continuation of discussions at the OpenPrinting Summit, active development in the OpenPrinting CUPS Github repository)
Discuss proposed CUPS 3.0 design work from prior presentations. Discuss future CUPS development: identify supported platforms, key printing system components, areas of responsibility, schedules, goals, organizational issues, and milestones. Discuss integration with Printer Applications and application stores like the Snap Store.
(Continuation of discussions at the OpenPrinting Summit)
In the new printing (and scanning) architecture available printers are not defined by CUPS queues any more but by IPP services, being network printers, Printer Applications, and IPP-over-USB for USB printers. CUPS queues are simply auto-created corresponding to these IPP services. So it does not make sense to have a printer setup tool which lists the available CUPS queues and allows creating them. Instead, we need a tool which lists IPP services and for each service gives access to configure it, via buttons opening the web interface and also GUI for IPP System Service.
For legacy devices which do not provide IPP services by themselves, we need a tool to discover them and to find Printer Applications for them, both locally installed or installable, like in the Snap Store.
We will discuss the details and the integration of these tools in the desktop environments.
Already some years ago we introduced the concept of the Common Print Dialog Backends (OPenPrinting GitHub: CPDB, CUPS backend) where we separate the print dialog GUI from the support code for the actual print technology (like CUPS, IPP, …) via a D-Bus interface, so that GUI toolkits and the print technology support can be developed and released independently.
This especially allows for better support of the fast-paced changes in the printing technology vs. the long development cycles of the GUI toolkits. Also new print technologies, like cloud print services can be added easily, with the appropriate backend provided in a Snap in the Snap Store.
Now this concept gets important again as the printing architecture is under heavy development with all things IPP, CUPS 2.4.x, 3.x, …
Here we especially discuss about the adoption into common GUI toolkits like GTK and Qt.
Classic printer and scanner drivers are replaced by Printer/Scanner Applications which emulate IPP-based network devices. We also have implemented most of the supporting code to easily create such Printer/Scanner Applications (PAPPL), a library for retro-fitting classic PPD-based CUPS drivers (pappl-retrofit), and Printer Applications retro-fitting PostScript PPDs (Snap Store), Ghostscript drivers (Snap Store), and HPLIP's printing (Snap Store).
In this session we want to help developers get started with the design, creation, and Snap-packaging of Printer/Scanner Applications, Especially we also want that printer/scanner developers create native Printer/Scanner Applications and not retro-fits of their classic CUPS/SANE drivers (Tutorial from Google Season of Docs 2021). Updates on the development progress in the monthly news posts.
Devices are discovered by DNS-SD. Adding support for pairing scanners with printers, since the typical use case (multi-function printer) will have the scan-specific TXT keys added to the printer, and the printer-dns-sd-name value coming from the printer. IPP scanners generally will not have their own DNS-SD records since they are paired with
IPP printers. IPP scanner registrations don't use the same TXT keys as printers.Scan-specific keys are added as IPP scanner registration consists, for the pairing API to associate the scanner with the printer. The client polls the scanner's properties with a get-printer-attributes IPP request on the scanner URI. For this pappl_scanner_t object is
implemented and scan-specific header files are added with the updated attributes and capabilities of a scanner- changing print to input, and equivalent driver functions are added for scan, as that in printing. The user sets options like scan area, resolution, quality, color, ADF mode, ... and requests the scan.
The Scheduler microconference focuses on deciding what process gets to run when and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics at the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.
Title: Scheduler Microconference
The scheduler is an important functionality of the Linux kernel, deciding what process gets to run when and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics at the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.
For example, at last year's Scheduler MC, we discussed core scheduling which is now on its way to being merged [1]. The scheduling fairness patches were merged [2], NUMA topology limitations fixes were added to the kernel [3]. Not only some progress was made in the direction of accepting patches, but also some topics were proved to be not feasible, like “Flattening the CFS runqueue,” and this was facilitated by the conference format.
This year, we think the following topics will lead to a productive microconference:
Come and join us in the discussion of controlling what tasks get to run on your machine and when. We hope to see you there!
Attendees list:
Links:
[1] https://lore.kernel.org/lkml/20210422120459.447350175@infradead.org/T/
[2] scheduling fairness commits:
[3] numa topology commits:
[4] https://lore.kernel.org/lkml/cover.1610463999.git.bristot@redhat.com/
[5] https://lore.kernel.org/linux-arm-kernel/20210420001844.9116-5-song.bao.hua@hisilicon.com/T/
[6] https://www.spinics.net/lists/kernel/msg3894298.html
[7] https://www.spinics.net/lists/kernel/msg3914884.html
The Linux scheduler shuffles tasks around on the various CPUs all the time, as mandated by the implementation of a combination of policies and heuristics. In general, this works well for many different workloads and lets us achieve a more than acceptable compromise among often conflicting goals, such as maximum throughput, minimum latency, reasonable energy consumption, etc. Furthermore, for the cases that really have special requirements, there are knobs to poke --such as different scheduling policies, priorities, affinity, up to CPU isolation-- for steering the scheduler toward any desired behavior.
Nevertheless, we believe that there are cases where a less "migration prone" attitude from the scheduler could be beneficial, e.g., CPU-bound tasks (possibly HPC workloads) or the virtual CPUs of a virtual machine (at least in some circumstances). These cases would benefit from having tasks a bit more "sticky" to the CPUs where they are running, but for which pinning or a custom policy would be infeasible for the user to be configured. Or maybe it's fine to pin or change the priority, but then this means that we need to know what is causing the unwanted migrations, in order to roll out the best counter-measures. For instance, if I can figure out that my task is often migrated because it is preempted by others, I can think about rising its priority and/or rebalancing (or reconsidering) the load on my system.
We therefore started our investigation around migrations. Basically, with a task migration event as our starting point, we wanted to see if it was possible to figure out what other events caused it to actually occur, and how far back in the chain of such events we could get. We are using a combination of existing (the various tracing facilities) and new (e.g., Sched struct retriever) tools and techniques. We wanted to start really simple and looked at what happens to a very basic main(){while(1);}
task, and discovered that on a CPU with multiple cores and multiple threads, it migrates among different cores a bit more than what we expected. If we switch off hyperthreading, though, cross-cores migrations disappear too...
So, even if we are still at the beginning, the tools we are using are still work-in-progress and the one above is just one example, we want to present the current status of this activity to the community, in case anyone else is also interested or has any feedback.
On the CFS wakeup path, wake_wide() doesn't always behave itself very well in interrupt-heavy workloads. We have systems configured with static IRQ bindings because IRQs are served faster on certain CPUs due to hardware topology. We then noticed on these systems that wakeups kept pulling tasks to the socket serving network IRQs while leaving the other socket nearly idle on a read-only workload from YCSB, an open source database benchmark. On a lightly loaded system with two 32-core sockets, wake_wide() led the scheduler to wake affine most of the time. Wake affine is a two-pass process involving both wake_wide() and wake_affine(), but wake_wide() is the more dominant factor than wake_affine() in our workloads. Periodic and idle load balancing must work to undo wake affine's overeager pulling, but ultimately network interrupts are so frequent in YCSB that wake affine wins out.
So far, we've gotten mixed results when trying to address the performance hit these issues cause. Disabling wake_affine() causes the benchmark to improve by 10-15% on a lightly loaded system (fewer DB connections) but slow down by up to 17% on more heavily loaded systems (more DB connections). wake_wide() works well when the waker and wakee are related, but we need a better heuristic for wakeups in interrupt heavy workloads, where the interrupt may or may not be related to the wakee.
A better heuristic ideally should be able to determine which CPU's cache is warmer for the wakee and doesn't cause excessive pulling.
AMD and ARM server architectures further complicate the issue with wake_wide() overeager pulling (see other abstract).
An LLC domain can span a whole socket on an Intel server but are significantly smaller on AMD ZEN due to its CCXs. For example, on ZEN 2, each CCX has only 4 cores. When binding network IRQs to such a CCX, we can consistently reproduce a scenario in which over 50 iperf tasks pile up there.
Some ARM servers may suffer from the opposite problem of not having LLC domains at all because they don't expose L3 cache, also called SLC, to the kernel or support hyperthreading. wake_wide() and select_idle_sibling() rely on the existence of LLC domains to make smart decisions about wake affine and balancing load within an LLC domain. If there is no LLC domain, wake affine never happens and select_idle_sibling() won't try to look for an idle CPU within an LLC domain. In other words, a task will be woken up on its previous CPU even if it shares cache with the waker or the previous CPU is busy. Is this what we want? I don't think so, it's not optimal in some cases even if it helps in others. For instance, in our read-only YCSB workloads with static IRQ binding, always waking up on the previous CPU performs better on lightly loaded systems but worse on heavily loaded systems. So I think we should consider how to improve the use of LLC sched domains in the wakeup path on these architectures.
Several proposals have been tried to change the policy of the wake up path regarding the selection of an idle CPU in the scheduler:
- Consider new topology levels
- Speedup and optimize idle cores and/or CPUs selection
- Better estimate how much effort worth spending to look for an idle CPUs
- and more others
This talk will summarize the current ongoing proposals and discuss the best way to move forward
CPU-intensive kthreads aren't generally accounted in the CPU controller, so they escape weight and bandwidth settings when they do work on behalf of a task group.
This is a problem in at least three places in the kernel. Padata multithreaded jobs (link1, link2, link3) may be started by a user task, so helper threads should be bound by the task's task group controls. Async memory reclaim (kswapd, cswapd) should be accounted to the cgroup that the memory belongs to, and similarly net rx should be accounted to the task groups of the corresponding packets being received. There are also general complaints from Android.
Each use case has its own requirements. In padata and reclaim, the task group to account to is known ahead of time, but net rx has to spend cycles processing a packet before its destination task group is known. Furthermore, the CPU controller shouldn't throttle reclaim or net rx in real time since both are doing high priority work. These make approaches that run kthreads directly in a task group, like cgroup-aware workqueues or a kernel path for CLONE_INTO_CGROUP
, infeasible. The proposed solution of remote charging can accrue debt to a task group to be paid off (or forgiven) later, addressing both of these issues.
Prototype code has shown some of the ways this might be implemented and the tradeoffs between them. Here's hoping that an early discussion will help make progress on this longstanding problem (link1, link2, link3).
Sugov implements a rather simplistic concept of boosting I/O-bound
tasks, through tracking I/O wakeups reported on each CPU and adjusting a
synthetic boost value to potentially influence upcoming frequency changes.
The actual boost value depends on a number of different conditions, like
timings of the task wake-ups and CPU frequency update requests or the
CPUfreq policy.
This makes things rather fuzzy and exposes the following drawbacks which
might result in unexpected lost of I/O boost build-ups or undesired CPU
frequency spikes:
1) Sugov does not differentiate between I/O boost request sources so it
can't detect multiple unrelated tasks that do have sporadic I/O wake-
ups.
2) As the boost value is being maintained per CPU, boost accumulated on
one CPU might be lost upon task migration.
3) There is no guarantee that the task(s) that did trigger the I/O boost
is/are still runnable on that CPU.
4) Relevant task uclamp restrictions are not being taken into account.
5) No notion of dependency on the actual device's performance and
throughput. I.e. boosting CPU frequency might turn out to be
pointless in case the device cannot cope with the increasing IO
request rate.
This presentation shows how these shortcomings could be solved or at
least mitigated by moving from per-core to per-task I/O boost tracking
implementation.
One of the most significant metrics for good user experience on a mobile
device is how quickly the system can react to load changes.
Util_est is used in mainline to create a more stable signal for per-task
demand, which is the maximum of the task util_est and PELT utilization
(known as the task utilization).
In case PELT utilization becomes higher than util_est, the behaviour of
that task is changing and it needs more resources than previously allocated.
The responsiveness of the task can be improved by boosting the task
utilization during this time beyond its PELT utilization.
This presentation describes an implementation of this idea and shows how
it improves behaviour on an Android device.
This year is the 30th anniversary of the Linux Kernel project, and for most of you the history of the Linux Kernel is well known. While this talk will honor much of that, it also hope to also bring in the histories of other projects that affected Linux and Computer Science, with recognition of the past, humor of the present and looking forward to the future.
The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
XDP is designed for maximum performance which is why certain driver use-cases are not supported (e.g. Jumbo frames or TSO/LRO). The single buffer per-packet design defines a simple and fast memory model and allows eBPF Direct Access (DA) to packet data. Both of them are essential for performance. However, it is the high time we fill the gap with the networking stack and enable non-linear frame support for XDP. There are multiple use-cases for XDP multi-buff like TSO/GRO, Jumbo frames or packet header split across multiple buffers. In this talk, we will present our design for non-linear frames in XDP with the objective to support TSO/GRO or Jumbo frames and, at the same time, to not slow down the single buffer per-packet use case.
BPF programs are critical system components performing core networking functionality, system audit logs, tracing, and runtime security enforcement to list a few. Charged with such crucial tasks, how do we audit the BPF subsystem itself to ensure system bugs are noticed and malicious attackers can not subtly manipulate these components, inject new programs, or quietly run their own BPF programs?
One proposal is to sign BPF programs following the model used for years to sign kernel modules. In this model the BPF programs loaded into the kernel are signed and verified to ensure only authorized programs can be loaded. Although we support such efforts we believe they are insufficient to actually provide meaningful security guarantees. Unlike kernel modules of old BPF programs tend to have a control plane that are tightly coupled with the application
where signing the BPF program only covers the most obvious attacks.
In this talk, using real-world examples, we show signing BPF programs provides only minimal improvement over the current model. Instead we propose a robust audit system and enforcement of BPF system calls to ensure access to critical control paths and enforcing loaders of said programs are known. Further, by scanning programs at load time we can do fine grained permissions, e.g. only allowing specific BPF helpers and maps to be exposed in the targeted file systems. Finally, by doing runtime auditing and enforcement we can provide fine grained per user policies based on the trust worthiness of the user. How do we propose to build such a platform? Using BPF of course! By loading a core set of BPF programs in early boot we will show how to implement the proposed model.
Rust is becoming an increasingly popular choice as a systems programming language. In fact, it's been the #1 most loved language on Stack Overflow for the last 6 years. Aside from being fast, type safe and memory safe, its tooling is excellent which yields high developer productivity. It has been used to write embedded systems software, it is central to the WebAssembly ecosystem, and it is very close to being used inside the Linux Kernel.
eBPF offers many exciting possibilities, but getting started developing eBPF programs is hard. While there are many eBPF libraries that target writing userspace applications in most popular programming languages, very few of them also seek to improve the experience of writing, building and debugging the eBPF program itself.
Aya is an eBPF library built for exactly this purpose. Using Aya, we seek to improve the eBPF developer experience with Rust!
In this talk, we will demonstrate how Aya can be used to quickly develop an eBPF application as well as covering plans for new features to further improve the experience.
Learn how to:
- Quickly start a new eBPF Project
- Compile Rust programs to eBPF bytecode
- Generate bindings to kernel types using BTF
- Allow seamless sharing of code between eBPF and user space
- Load eBPF programs from user-space and interact with maps
bpftrace was originally designed with a dynamic compilation model. While that model has worked fairly well, new developments and concerns in the eBPF ecosystem have prompted re-evaluation of the original design.
First, BPF is becoming more widely used so performance is more of a concern. Running LLVM to generate bytecode on the fly is somewhat costly, especially for bpftrace-enabled data collection in production. Binary weight is also an issue because bpftrace currently ships with LLVM and clang libraries.
Second, signed BPF programs is making its way into the kernel in response to security concerns. bpftrace is not immune to those security concerns so bpftrace must have an answer as well if it is to remain relevant in more secure environments.
Finally, CO-RE, the building blocks for portable BPF programs have become production ready and are shipping in many distros. bpftrace can build on top of these pieces to deliver ahead-of-time compiled bpftrace programs. These AOT programs will faster to run and smaller in binary size.
This talk will go into the ongoing work to enable AOT bpftrace programs. There are a lot of moving pieces because the existing code has made broad assumptions about the compilation model. Hopefully by the end of this talk, participants will have a better idea about what work has been accomplished, what remains to be done, and what unsolved issues still need to be resolved.
The DSA subsystem was originally built around Marvell devices, but has since
been extended to cover a wide variety of hardware with even wider views of
their management model. This presentation discusses the changes in DSA that
took place in the last years for this wide variety of switches to offer more
services, and in a more uniform way, to the larger network stack.
Summarized, these changes are:
Acknowledging switches which only have DSA tags for control plane packets,
and modifying the bridge driver to accept termination of data plane packets
from these switches.
Support for unoffloaded upper interfaces.
Support for more cross-chip topologies than the basic daisy chain, while
maintaining the basic principle that network interfaces under the same bridge
can forward from one to another, and interfaces under different bridges
don't.
The original DSA architecture of exposing one virtual network interface for
each front-facing switch port, and not exposing virtual network interfaces for
the ports facing inwards (CPU ports, DSA/cascade ports) has remained unchanged
to this day. DSA network interfaces should not only be conduit interfaces for
retrieving ethtool statistics and registering with the PHY library, but they
should be fully capable of sending and receiving packets. This is accomplished
via the DSA tagging protocol drivers, which are hooks in the RX and TX path of
the host Ethernet controller (the DSA master) which multiplex and demultiplex
packets to/from the correct virtual switch interface based on switch-specific
metadata that is placed in the packets.
In this model, the basic function of a network switch from a hardware
designer's perspective, which is to switch packets, is an optional feature from
the Linux network stack's perspective, and was added years after the original
design had been established.
Behind the seemingly uniform implementation of DSA tagging protocols and switch
drivers, which are tightly managed by the DSA framework, lie many differences
and subtleties that make the feature set exposed by two different DSA switches
to the network stack very different.
The majority of network switches capable of management have some sort of
distinction between the data plane packets and the control packets.
At the most basic level, control packets, which must be used for link-layer
control protocols like STP and PTP, have the ability to target a specific
egress port and to override its STP state (inject into a BLOCKING port). These
packets typically bypass the forwarding layer of the switch and the frame
analysis stage of the ingress (CPU) port and are injected directly into the
egress port. The implications are that metadata such as QoS class and VLAN ID
must be specified by the operating system driver directly as part of the DSA
tag, and that hardware address learning is not performed on the CPU port.
On the opposite side of the spectrum, data plane packets do not perform STP
state override, are subject to hardware address learning on the CPU port, but
also cannot be steered towards a precise destination port, since they are also
subject to the forwarding rules of the switch.
At the extreme, there exists a DSA_TAG_PROTO_NONE tagging protocol, which
admits defeat and does not attempt to multiplex/demultiplex virtual switch
interfaces from the DSA tag, and all network I/O through such a switch takes
place through the DSA master which is seen as a switched endpoint. The network
interfaces registered for the switches are only used for control operations
(ethtool, PHY library) and are "dead" to the network stack both for control
plane and for data plane packets. These are the "unmanaged" switches.
Finally, in some switch designs, injecting a control packet is an expensive
operation which cannot be sustained at line rate, and the bulk of the traffic
(the data plane packets) should be injected, from the hardware designer's
perspective, directly through the DSA master interface, with no DSA tag.
These are the "lightly managed" switches, and their virtual DSA interfaces are
similarly "dead" to the network stack except for link-local packets.
The most basic and common approach with this type of hardware is to simply set
up a user space configuration to perform the traffic termination from the
switching domain on the DSA master itself. For some packets to target a single
switch port, the user is required to install a bridge VLAN on the switch port
which is egress-tagged towards the CPU port, then create an 8021q upper with
the same VLAN ID on top of the DSA master, and send/receive traffic through the
8021q upper of the DSA master. This approach is, however, undesirable because
bridging DSA interfaces with non-DSA (foreign) interfaces is impossible, which
is an important use case for boards with a switch and a Wi-Fi AP (home routers).
Interfaces that are DSA masters cannot be added to a bridge either.
A slightly better integrated way of achieving the same result is the relatively
new software-defined DSA tagging protocol named tag_8021q, which can bring both
the lightly managed and unmanaged switches closer to the user model exposed by
DSA switches with hardware support for a DSA tagging protocol.
The tag_8021q protocol is fundamentally still sending data plane packets from
the perspective of the hardware, so there are things it cannot accomplish, like
STP state override. Additionally, the DSA framework has traditionally not
enforced any meaningful distinction between data plane and control plane
packets, since originally, the assumption was that all packets injected by the
software network stack should be control packets.
To unify the hardware and the software notions, and to use these chips in the
way they were meant to, the network stack must be taught about data plane
packets. The tag_8021q model breaks down when DSA switch interfaces offload a
VLAN-aware bridge, which is in fact their primary use cases. This is because
the source port of the switch cannot be retrieved based on the VLAN ID by the
tagging protocol driver on RX, because the VLANs are under the control of the
bridge driver, not DSA, and there is no guarantee that a VLAN is uniquely
installed on a single switch port. So bridging with foreign interfaces becomes
equally impossible.
The decisive changes which made these switches correctly offload a VLAN-aware
bridge come in the form of not attempting to report a precise source port on RX
for data plane packets, just a plausible/imprecise one. As long as some
requirements inside the software bridge's ingress path are satisfied (valid STP
state, VLAN ID from the packet is in the port's membership list), the bridge is
happy to accept the packet as valid, and process it on behalf of the imprecise
DSA interface that was reported.
Complications arise due to the fact that the software bridge might learn the
MAC SA of these packets on a potentially wrong port, and deliver those packets
on the return path towards the wrong port. Additionally, due to bandwidth
constraints, DSA interfaces do not synchronize their hardware FDB with the FDB
of the software bridge, so the software bridge does not have an opportunity to
figure out the real source port of imprecise packets.
To give DSA the chance to right a wrong, the bridge driver was modified to
support TX forwarding offload. With this feature, the software bridge avoids
cloning an skb which needs to be flooded to multiple ports, and sends only one
copy of the packet towards a single network interface from each "hardware
domain" that the flooded packet must reach. The port driver is responsible with
looking up its hardware FDB for that packet and replicate the packet as needed.
This is a useful feature in itself, because with switches with a large port
count, multicast traffic on the bottlenecked link between the DSA master and
the CPU port is reduced, and packets are replicated inside the hardware.
But with the lightly-managed and unmanaged switches, it makes the imprecise RX
work correctly, since the TX is also imprecise. So even though the software
bridge did learn the MAC SA of the packets on the wrong source port, that
source port is in the same hardware domain with the right port, and even though
the software FDB is incorrect, the hardware FDB isn't. So DSA drivers for
lightly-managed and unmanaged switches have a chance to properly terminate
traffic on behalf of a VLAN-aware bridge, in a way that is compatible with
bridging with foreign interfaces, and with a user space interaction procedure
that is much more uniform with DSA drivers that always send and receive packets
with a DSA tag.
Recently, DSA has also gained support for offloading other virtual network
interfaces than the Linux bridge. These are the hsr driver (which supports the
HSR and PRP industrial redundancy protocols) and the bonding/team drivers
(which support the link aggregation protocol).
Not all switches are capable of offloading hsr and team/bonding, and DSA's
policy is to fall back to a software implementation when hardware offload
cannot be achieved: the bandwidth to/from the CPU is often times good enough
that this is not impractical.
However, DSA's policy could not be enforced right away with the expected
results, due to two roadblocks that led to further changes in the kernel code
base.
To not offload an upper interface means for DSA that the physical port should
behave exactly as it would if it was a standalone interface with no switching
to the others except the CPU port, and which is capable of IP termination.
But when the unoffloaded upper interface (the software LAG) is part of a
bridge, the bridge driver makes the incorrect assumption that it is capable of
hardware forwarding towards all other ports which report the same physical
switch ID. Instead, forwarding to/from a software LAG should take place in
software. This has led to a redesign of the switchdev API, in that drivers must
now explicitly mark to the bridge the network interfaces that are capable of
autonomous forwarding; the new default being that they aren't. In the new
model, even if two interfaces report the same physical switch ID, they might
yet not be part of the same hardware domain for autonomous forwarding as far as
the bridge is concerned.
The second roadblock, even after the bridge was taught to allow software
forwarding between some interfaces which have the same physical switch ID, was
FDB isolation in DSA switches. Up until this point, the vast majority of DSA
drivers, as well as the DSA core, have considered that it is enough to offload
multiple bridges by enforcing a separation between the ports of one bridge and
the ports of another at the forwarding level. This works as long as the same
MAC address (or MAC+VLAN pair, in VLAN-aware bridges) is not present in more
than one bridging domain at the same time. This is an apparently reasonable
restriction that should never be seen in real life, so no precautions have been
taken against it in drivers or the core.
The issue is that a DSA switch is still a switch, and for every packet it
forwards, regardless of whether it is received on a standalone port, a port
under a VLAN-unaware bridge or under a VLAN-aware one, it will attempt to look
up the FDB to find the destination. With unoffloaded LAGs on top of a
standalone DSA port, where forwarding between the switched domain and the
standalone port takes place in software, the expectation that a MAC address is
only present in one bridging domain is no longer true. From the perspective of
the ports under the hardware bridge, a MAC address might come from the outside
world, whereas from the perspective of the standalone ports, the same MAC
address might come from the CPU port. So without FDB isolation, the standalone
port might look up the FDB for a MAC address and see that it could forward the
packet directly to the port in the hardware bridge domain, where that packet
was learned by the bridge port, shortcircuiting the CPU. But the forwarding
isolation rules put in place will prevent this from happening, so packets will
be dropped instead of being forwarded in software.
Individual drivers have started receiving patches for FDB isolation between
standalone ports and bridged ports, but it is possible to conceive real life
situations where even FDB isolation between one bridge and another must be
maintained. Since the DSA core does not enforce FDB isolation through its API
and many drivers already have been written without it in mind, it is to be
expected that many years pass until DSA offers a uniform set of services to
upper layers in this regard.
Traditionally, the cross-chip setups supported by DSA have been daisy chains,
where all switches except the top-most one lack a dedicated CPU port, and are
simply cascaded towards an upstream switch. There are two new switch topologies
supported by DSA now.
The first is the disjoint tree topology. A DSA tree is comprised of all
switches directly connected to each other which use a compatible tagging
protocol (one switch understands the packets from the other one, and can
push/pop them as needed). Disjoint trees are used when DSA switches are
connected to each other, but their tagging protocols are not compatible.
As opposed to one switch understanding another's, tag stacking takes place, so
in software, more than one DSA tagging protocol driver needs to be invoked for
the same packet. In such a system, each switch forms its own tree. Disjoint
trees were already supported, but the new changes also permit some hardware
forwarding to take place between switches belonging to different trees. For
example, be there an embedded 5 port DSA switch that has 3 external DSA
switches connected to 3 of its ports. Each embedded DSA switch interface is a
DSA master for the external DSA switch beneath it, and there are 4 DSA disjoint
trees in this system. For a packet to be sent from external switch 1 to
external switch 2, it must be forwarded towards the CPU port. In the most basic
configuration, forwarding between the two external switches can take place in
software. However, it is desirable that the embedded DSA switch that is a
master of external switches 1 and 2 can accelerate the forwarding between the
two (because the external switches are tagger-compatible, they are just
separated by a switch which isn't tagger-compatible with them). Under some
conditions, this is possible as long as the embedded DSA switch still has some
elementary understanding of the packets, and can still forward them by MAC DA
and optionally VLAN ID, even though they are DSA-tagged. With the vast majority
of DSA tagging protocols, the MAC DA of the packets is not altered even when a
DSA tag is inserted, so the embedded DSA master can sanely forward packets
between one external switch and another. This is one of the only special cases
where DSA master interfaces can be bridged (they are part of a separate bridge
compared to the external switch ports), because in this case, the DSA masters
are part of a bridge with no software data plane, just a hardware data plane.
The second requirement is for both the embedded and the external switches to
have the same understanding of what constitutes a data plane packet, and what
constitutes a control plane packet: STP packets received by the external switch
should not be flooded by the embedded switch. Due to the same reason that the
embedded switch must still preserve an elementary understanding of the MAC DA
of packets tagged with the external switch's tagging protocol, this will also
be the case, since typical link-layer protocols have unique link-local
multicast MAC addresses.
The second is the H tree topology. In such a system, there are multiple
switches laterally interconnected through cascade ports, but to reach the CPU,
each switch has its own dedicated CPU port. It turns out that to support such a
system, there are two distinct issues.
First, with regard to RX filtering, an H tree topology is very similar in
challenges to a single switch with multiple CPU ports. Hardware address
learning on the CPU port, if at all available, is of no use and leads to
addresses bouncing and packet drops. All MAC addresses which need to be
filtered to the host need to be installed on all CPU ports as static FDB
entries. This has led to the extension of the bridge switchdev FDB notifiers to
cover FDB entries that are local to the bridge, and which should not be
forwarded.
Secondly, in an H topology it is actually possible to have packet loops with
the TX forwarding offload feature enabled, because TX data plane packets sent
by the stack to one switch might also be flooded through the cascade port to
the other switch, where they might be again flooded to the second switch's CPU
port, where they will be processed as RX packets. Currently, drivers which
support this topology need to be individually patched to cut RX from cascade
ports that go towards switches that have their own CPU port, because the DSA
driver API does not have the necessary insight into driver internals as to be
able to cut forwarding between two ports only in a specific direction.
One of the most important features still absent from DSA is the support for
multiple CPU ports, the ability to dynamically change DSA masters and the
option to configure the CPU ports in a link aggregation group. However, with
many roadblocks such as basic RX filtering support now out of the way, this
functionality will arrive sooner rather than later.
There is also the emerging topic of Ethernet controllers as DSA masters that
are aware of the DSA switches beneath them, which is typical when both the
switch and the Ethernet controller are made by the same silicon vendor.
Right now DSA can freely inherit all master->vlan_features, such as TX
checksumming offloads, but this does not work for all switch and DSA master
combinations, so it must be refined and only the known-working master and
switch combinations inherit the extra features.
On the same topic of DSA-aware masters, SR-IOV capable masters are expected to
still work when attached to a DSA switch, but the network stack's model of this
use case is unclear. VFs on top of a DSA master should be treated as switched
endpoints, but the VF driver's transmit and receive procedures do not go
through the DSA tagging protocol hooks, and these packets are therefore
DSA-untagged. So hardware manufacturers have the option of inserting DSA tags
in hardware for packets sent through a VF that goes through a DSA switch. It is
unclear, however, according to which bridging domain are these VFs being
forwarded. An effort should be made to standardize the way in which the network
stack treats these interfaces. It appears reasonable that DSA switches might
have to register virtual network interfaces that are facing each VF of the
master, in order to enforce their bridging domain, but this makes the DSA
master and switch drivers closely coupled.
On the other hand, letting other code paths than the DSA tagging protocol
driver inject packets into the switch risks compromising the integrity of the
hardware, which is an issue that currently exists and needs to be addressed.
As a conclusion, taming DSA switches and making them behave completely in
accordance with the network stack's expectations proves to be a much more
ambitious challenge than initially foreseen, thus the fight continues.
The Confidential Computing microconference focuses on solutions to the development of using the state of the art encyption technologies for live encryption of data, and how to utilize the technologies from AMD (SEV), Intel (TDX), s390 and ARM Secure Virtualization for secure computation of VMs, containers and more.
The Intel Trust Domain Extension (TDX) technology extends VMX and MKTME to enhance guest data security by isolating guests from host software, including VMM/hypervisor. Live migration support for such isolated guests (i.e. TDs) facilitates the deployment of TD guests in the cloud.
This talk presents the QEMU/KVM design of TDX live migration and initial PoC results for the migration performance evaluation. A common framework is added to the QEMU migration to support TD guests and other similar technologies (e.g. SEV guests). For TD guest live migration, the guest shared memory pages are migrated in plaintexts. The guest private memory pages, vCPU states and TD scope states are encrypted via a migration key when they are exported by KVM from the TDX module. A migration stream in the workflow has a KVM device created and the device creates shared memory between KVM and the QEMU migration thread to transport the encrypted guest states.
Discussion on Live Migration of AMD SEV encrypted VMs.
Link to the latest posted (KVM) patch for SEV live migration :
https://lore.kernel.org/lkml/cover.1623174621.git.ashish.kalra@amd.com/
Discussions on Guest APIs, specifically if the APIs can cover both
AMD SEV and Intel TDX platforms and exploring common interfaces
which can be re-used for both the above platforms, for example,
exploring a common hypercall API interface, with reference
to the posted KVM patch-set.
Link to related discussion on the same topic:
https://lore.kernel.org/lkml/YJv5bjd0xThIahaa@google.com/
SEV Live Migration Acceleration uses an alternative migration
approach relying on a Migration Helper (MH) running in guest
context. The fast migration for encrypted virtual
machines typically use a Migration Handler that lives in OVMF.
As part of this microconference, we can have additional
discussions on the design and development of the MH, especially,
the suggested approach to use KVM/Qemu Mirror VM concept to
run the MH in a Mirror VM/vCPU which runs in parallel to the
primary encrypted VM in the same Qemu process context.
Links to posting for the above on KVM and Qemu development
lists : https://lore.kernel.org/lkml/SN6PR12MB276727DE9AFD387B11C404418E3E9@SN6PR12MB2767.namprd12.prod.outlook.com/
Intel TDX is an upcoming confidential computing platform for running encrypted guests on untrusted hosts on Intel servers. It requires para virtualization to do any required emulation inside the guest. There are some unique challenges, in particular in hardening the Linux guest code against untrusted host input through MMIO, port and other IO, which is a new security challenge for Linux. The guest has to "accept" all memory and to get acceptable boot performance this acceptance has to be done lazily. We'll give an overview of the current TDX status, talk about the challenges and hope for a good discussion.
Debug Support for AMD SEV Encrypted VMs.
Discussion on QEMU debug support for memory encrypted guests like AMD SEV/Intel TDX.
Debug requires access to the guest pages, which are encrypted when SEV/TDX is enabled.
Discussion on exploring common interfaces which can be re-used for both
AMD SEV and Intel TDX platforms with regard to encrypted guest memory access for
debug in Qemu.
Latest posted patches on qemu-devel list from the Intel TDX team:
[RFC][PATCH v1 00/10] Enable encrypted guest memory access in QEMU
https://lore.kernel.org/qemu-devel/20210506014037.11982-1-yuan.yao@linux.intel.com/
Link to the last posted patch-set from AMD:
https://lore.kernel.org/qemu-devel/cover.1605316268.git.ashish.kalra@amd.com/
Original discussion thread on qemu-devel list :
https://lore.kernel.org/qemu-devel/20200922201124.GA6606@ashkalra_ubuntu_server/
Nowadays, containers are a private and public cloud commodity. Isolating and protecting containerized workloads not only from each other but also from the infrastructure owner is becoming a necessity.
In this presentation we will describe how we’re planning to use confidential computing hardware implementations to build a confidential containers software stack. By combining the hardware encryption and attestation capabilities that these new ISAs provide, the proposed software architecture aims at protecting both container data (downloaded from container image registries and generated at runtime) and code from being seen or modified by cloud providers and owners.
As the Kata Containers project already uses hardware virtualization to provide a stronger container isolation layer, we will first explain why and how we want to use the Kata runtime and agent as the foundation for running confidential containers.
Then we will look at the container specific requirements that we need to take into account for building that software stack. Short boot times, low memory footprint or the inherently dynamic, ephemeral and asynchronous nature of container workloads are some of the technical challenges that we’re facing when it comes to running confidential containers. The final part of the talk will go through some of the technical solutions we’re building to address those challenges. In particular we will speak about:
Transparent memory and cpu state encryption: As the Kata runtime can run on top of heterogeneous nodes, running different confidential computing implementations (TDX, SEV, etc), we have to build a small framework for transparently enabling the underlying encryption technologie whenever a confidential container is scheduled on a given node.
Container image service offload: The entire container software ecosystem assumes container image layers can be downloaded, unpacked and mounted from the host itself. This obviously breaks the confidential computing security model and that brings the need for offloading at least part of the container image management from the host to the guest.
As confidential computing gains traction, several technologies that are based on a secure hypervisor are emerging.
Besides SEV (AMD), PEF (Power), and TDX (Intel), IBM Z's Secure Execution enables running a guest that even an administrator cannot look into or tamper with.
At the same time, it becomes desirable to run an OCI container workload in a secure context.
The Kata Containers runtime is based on VMs and thus, Secure Execution can be leveraged.
Initially, Kata had the goal of protecting the host from malicious guests, but the vice versa approach is now being discussed and worked on, with some patches landed, but other patches required in adjacent projects like containerd.
I work in IBM's Linux on Z department, enabling Kata on the IBM Z and LinuxONE platform, including Secure Execution.
I propose a talk where first, a general overview of Secure Execution is given: what the threat and security models are and how a user would go about running a protected workload.
This helps the audience learn about a confidential computing solution that is distinct from discussed x86 approaches, in that images to be launched in Secure Execution are encrypted and can only be decrypted in a secure context, as opposed to x86 firmware attestation approaches.
It is then described how Secure Execution maps to the challenges in confidential computing including Kata and Kubernetes, concerning the need to control and provide certain resources from the host.
Note: Samuel Ortiz of Apple has also proposed to speak about general confidential computing challenges in Kata in this microconference.
Even though I will introduce Kata and confidential computing so that the talk makes sense on its own, it is probably better if I speak after him.
We’ll enumerate pain points that we’ve encountered in deploying (or trying to deploy) Linux CVMs on Google’s public cloud, called Google Compute Engine (GCE), which is built on top of Linux. Example pain points include RMP violations crashing host machines, kexec and kdump not working on SNP-enabled hosts, guest kernel SWIOTLB bugs, incomplete/lacking test infrastructure, and more! Then, as a group, we can see what problems are interesting to the wider community, and discuss how to prioritize them.
Attestation is an important step in the setup of a confidential enclave in a public cloud environment. Through this process a guest owner can externally validate the software being run in their enclave before any confidential information is exposed. In this talk, we discuss the design and challenges of measuring and validating a guest enclave, and safely injecting guest owner secrets into the enclave. Our discussion will focus on the AMD SEV architectures (SEV, SEV-ES, and SEV-SNP) and how their hardware-enforced attestation and pre-attestation procedures map onto the deployment of guest VMs and confidential containers (i.e., Kata Containers).
By attending this talk, you will gain an understanding of the attestation and measurement features of the AMD SEV architectures, as well as the challenges of doing attestation for confidential VMs and containers/Pods in a public cloud. In addition, we will overview other attestation approaches such as those of Intel TDx, SecureBoot, and other software-based techniques.
Confidential Computing can enable several use-cases which rely on the ability to run computations remotely on sensitive data securely without having to trust the infrastructure provider. One required building block for this is verifiable control flow integrity on the remote machine: ensuring that the running compute is doing what it's supposed to.
With hand-written Intel SGX the code surface is usually limited, but with a fully fledged VM it becomes more difficult. One example for this is securing the control flow of the VM's boot process:
In the last years we have seen multiple projects securing the boot process of confidential VMs (cVMs) by allowing to boot from encrypted disk images. These approaches usually rely on the injection of a secret during the boot process. While this enables use-cases like hiding the raw disk image from the platform provider (i.e. the entity running the hardware and hypervisor stack), we are not capable of creating hardware-backed proofs of the measurement (i.e. hash) of the code (and data) which is being executed on that cVM, also called remote attestation.
Modern x86 extensions (like AMD SEV-SNP, Intel TDX) allow measurement of the initial boot image before VM startup and cryptographically bind this measurement in a remotely verifiable attestation document.
The question that now arises is how much code surface the initial measurement should contain and if existing firmwares/bootloaders can be used securely. Taking AMD SEV-SNP as reference hardware we implemented two working proof-of-concepts:
a) Minimal firmware based on existing software: Here we leveraged the work which has been done on OMVF and grub for enabling booting of encrypted images. In that case the attested firmware only consists of the OMVF firmware and the grub bootloader. We extended grub to perform a measurement of the operating system (OS) image during loading and assert the measurement with the known good value baked into the attested firmware. In our case the verified linux image is a EFI unified kernel image which allows us to cryptographically bind the kernel image as well as the initramfs image, and the kernel command line. This approach adds OMFV and grub to the audit surface and makes it hard to provide control flow guarantees. For example without extra hardening the OMFV's EFI shell or the grub shell can be easily used to load a malicious OS image.
b) Custom firmware with OS embedding: In that case the attested firmware also consists of the entire operating system. Using the rust hypervisor firmware as a basis we added support for linking in a disk blob into the firmware. This is done by a virtual "block device" reading from a known in-memory location. With that we get a single measurement over the entire software stack, including the operating system (and potential applications/data). A caveat here is that so far we rely on an ELF binary to be loaded by the VMM (QEMU) and not a flat rom image. Hence, the attested measurement of the hardware will deviate from the direct hash of the firmware file being loaded, requiring some extra steps to verify the attestation. Also measuring a large firmware might be time consuming on the slow Secure Processor (found in AMD SEV-SNP).
The Microconference topic we are proposing would consist of:
The File system microconference focuses on a variety of file system related topics in the Linux ecosystem. Interesting topics about enabling new features in the file system ecosystem as a whole, interface improvements, interesting work being done, really anything related to file systems and their use in general. Often times file system people create interfaces that are slow to be used, or get used in new and interesting ways that we did not think about initially. Having these discussions with the larger community will help us work towards more complete solutions and happier developers and users overall.
Files are currently managed in PAGE_SIZE units. As DRAM and storage capacities increase, the overhead of managing all these pages becomes more significant. The memory folio patchset lets us cache files in larger units.
In this session, we shall discuss:
File ownership is a global property on most systems that have a uid and gid concept. On POSIXy systems the chown*() syscall family allows to change the owner of a file or directory. If the ownership of a file is changed it will be changed globally affecting each user on the systems equally. But various use-cases exist where this can be problematic:
- Portable home directories that are used on different computers where the user is assigned a different uid and gid.
- Filesystems that allow to merge or unionize multiple filesystems are often shared between different users.
- Containers making use of user namespaces also affect file ownership.
- Avoiding the cost of recursive ownership changes.
Idmapped mounts solve these problems and others by allowing mounts to change file. This talk we will take a look at how idmapped mounts work, outline the work we've done and what is still left to do and potential new ideas to make this an even more powerful concept.
I would like to chair a discussion at LPC to discuss atomic file writes for userspace applications. Do we want to expose such a capability to programs, and if so, how?
I propose filesystem implementations provide a general-purpose interface in software. As proposed, the FIEXCHANGE_RANGE system call requires the ability to exchange the contents of two files, with a promise that once we commit to the exchange, it must either succeed completely.
Atomic file writes can be performed by creating a temporary file, cloning the contents, making arbitrary updates to the temporary file, and calling FIEXCHANGE_RANGE to commit the changes. There are no restrictions on length, number of updates, etc.
The ability to exchange the contents of files atomically is a requirement for online repair of XFS metadata; upon finishing the functionality I realized that we could expose it to userspace to provide atomic file updates.
NOTE: This is a separate topic from enabling userspace to access hardware atomic writes. That is a simple matter of making the advertised device capabilities (and alignment/size restrictions) discoverable and adding a flag to io_uring/pwritev2 for directio writes.
File system shrink allows a file system to be reduced in size by some specified size blocks as long as the file system has enough unallocated space to do so. This operation is currently unsupported in xfs. Though a file system can be backed up and recreated in smaller sizes, this is not functionally the same as an in place resize. Implementing this feature is costly in terms of developer time and resources, so it is important to consider the motivations to implement this feature. This talk would aim to discuss any user stories for this feature. What are the possible cases for a user needing to shrink the filessystem after creation, and by how much? Can these requirements be satisfied with a simpler mkfs option to backup an existing file system into a new but smalller filesystem? In the cases of creating a rootfs, will a protofile suffice? If the shrink feature is needed, we should further discuss the APIs that users would need.
Beyond the user stories, it is also worth discussing implementation challenges. Reflink and parent pointers can assist in facilitating shrink operations, but is it reasonable to make them requirements for shrink? Gathering feedback and addressing these challenges will help guide future development efforts for this feature.
The focus of this session is on mitigating the effects of unreliable storage devices. This author works at a cloud vendor (as is fashionable now), and one of the large story arcs of the past few years has been that storage devices do not seem as reliable as we thought even a few years ago.
Specifically, I've observed that as the world moves from direct-attached spinning rust to software-defined storage on cheap devices, we increasingly must deal with large devices that corrupt data, temporarily stop responding (due to problems on the network/control plane/hypervisor/whatever), or have some odd means to request re-reads
XFS sort of mitigates some of these problems by enabling sysadmins to configure its response to certain kinds of hardware errors (mostly EIO and ENOSPC). Other filesystems lack these control knobs; how might we standardize them? The block layer has some retry capabilities, but no filesystems touch them. We don't have a general corrupted-read retry mechanism, and have not succeeded in adding one.
So what I want to know is: Who cares? Are sysadmins and users happy with the current patchwork? Do they accept the defaults? Would they like more control or better communication between layers?
This is a BOF for people to get together to discuss unresolved issues in the community and to talk about the roadmap for new feature development and ongoing technical debt payoff. We have not had such a forum since LSFMM in 2018.
Roadmap topics include:
This forum is open to all filesystem developers, though the focus is very obviously on XFS.
Prepared presentation
In this talk we present an overview of gprofng, a next generation profiling tool for Linux.
This profiler has its roots in the Performance Analyzer from the Oracle Developer Studio product. Gprofng is a standalone tool however and specifically targets Linux. It includes several tools to collect and view the performance data. Various processors from Intel, AMD, and Arm are supported.
The focus is on applications written in C, C++, Java, and Scala. For C/C++ we assume gcc has been used to build the code. In the case of Java and Scala, OpenJDK and compatible implementations are supported.
Among other things, another difference with the widely known gprof tool is that gprofng offers full support for shared libraries and multithreading using Posix Threads, OpenMP, or Java Threads.
Unlike gprof, gprofng can also be used in case the source code of the target executable is not available. Another difference is that gprofng works with unmodified executables. There is no need to recompile, or instrument the code. This ensures that the profile reflects the actual run time behaviour and conditions of a production run.
After the data has been collected, the performance information can be viewed at the function, source, and disassembly level. Individual thread views are supported as well. Through command line options, the user specifies the information to be displayed. In addition to this, a simple, but yet powerful scripting feature can be used to produce a variety of performance reports in an automated way. This may also be combined with filters to zoom in on specific aspects of the profile.
One of the very powerful features of gprofng is the ability to compare two or more profiles. This allows for an easy way to spot regressions for example.
In the talk, we start with a description of the architecture of the gprofng tools suite. This is followed by an overview of the various tools that are available, plus the main features. A comparison with gprof will be made and several examples are presented and discussed. We conclude with the plans for future developments. This includes a GUI to graphically navigate through the data.
This talk will discuss the methods used in constructing the recent improvement in complex divide in libgcc where the gross error rate dropped from more than 1 per 100 tests to less than 1 per 10 million tests. The change in accuracy is platform independent while the modest performance loss varies with platform. We also discuss flaws and likely areas for addressing reducing remaining small errors.
The malloc library provided by glibc offers considerable flexibilty in deciding when to use mmap for larger allocations and when to use sbrk/trim. The default settings for the decision thresholds are reasonable for many applications. Three tunables are available to adjust these settings. The limits on these settings have not been changed since 2006. Server class systems now have much more memory available and other performance tradeoffs have changed dramatically in the last 15 years. We propose significant increases to the limitations on the MALLOC_MMAP_THRESHOLD_ tunable. (current default = 128K; current maximum 32M). This change will not affect existing usage while allowing select applications to improve their malloc performance, sometimes dramatically.
On systems with copy relocation:
- A copy in executable is created for the definition in a shared library at run-time by ld.so.
- The copy is referenced by executable and shared libraries.
- Executable can access the copy directly.
Issues are:
- Overhead of a copy, time and space, may be visible at run-time.
- Read-only data in the shared library becomes read-write copy in executable at run-time.
- Local access to data with the STV_PROTECTED visibility in the shared library must use GOT.
On systems without function descriptor, function pointers vary depending on where and how the functions are defined.
Issues are:
- The address of function body may not be used as its function pointer.
- ld.so needs to search loaded shared libraries for the function pointer of the function with STV_PROTECTED visibility.
Here is a proposal to remove copy relocation and use canonical function pointer:
1. Accesses, including in PIE and non-PIE, to undefined symbols must use GOT.
2. Read-only data in the shared library remain read-only at run-time
3. Address of global data with the STV_PROTECTED visibility in the shared library is the address of data body.
4. For systems without function descriptor,
- All global function pointers of undefined functions in PIE and non-PIE must use GOT.
- Function pointer of functions with the STV_PROTECTED visibility in executable and shared library is the address of function body.
Intel LAM (Linear Address Masking) Extension allows software to locate metadata in data pointers and dereference them without needing to mask the metadata bits. It supports:
I am presenting a proposal to enable Intel LAM in Linux:
The existing implementation of the OpenACC "kernels" construct in GCC
is unable to cope with many language constructs found in real HPC
codes which generally leads to very bad performance. This talk
presents upcoming changes to the "kernels" implementation that improve
the performance significantly:
BoF to discuss topics related to concurrency and offloading work onto AMD and NVIDIA accelerators using OpenMP and OpenACC.
In particular the implementation of the missing OpenMP 5.0 & 5.1 features, including memory allocators, unified shared memory, C++ attributes, etc.
Related topics and trends can also be discussed, be it base language concurrency features, offloading without using OpenMP/OpenACC, other accelerators.
Motivation to contribute and barriers faced by newcomers and contributors to join and stay in Open Source Software projects have been intriguing researchers since the early 2000s. The literature on motivation was updated on recent work and showed that for more than 55% of contributors who answered the questionnaire, the motivation shifted after joining. Those contributors joined OSS for one reason and continued for another reason. The world is dynamic, and so are we. If the reasons to participate can change from past to present, what about the future? Linux Kernel wants to keep contributors around by understanding what they seek for their future, ultimately influencing communities' sustainability.
We will present results from a survey with Linux Kernel contributors that was created to understand why people participate in Linux Kernel projects, their goals for the future, and what would make them leave or continue contributing. The survey is part of a Diversity & Inclusion initiative to attract and retain a diverse set of contributors in Linux Kernel.
The availability of BPF allows the improvement of preexisting perf features or the
addition of new ones without requiring kernel changes.
The first use of BPF to augment perf is to use BPF programs to profile other BPF
programs with 'perf stat', this is already upstream and set the stage for
further uses. This provides functionality similar to 'bpftool prog profile' while reusing
lots of 'perf stat' features that were developed and improved by the perf tooling
community.
Then we had bperf, to share hardware performance counters, aggregate data in BPF
maps that then get read by 'perf stat' as if it was a normal perf event that then reuses
all the perf tooling features.
Some improvements, such as scaling cgroups perf monitoring were first attempted by
modifying the kernel. But after several attempts, one is being made using BPF with
encouraging results. It works by hooking into cgroup scheduling and doing aggregation
that is made available to 'perf stat' via bperf.
Such use of BPF for aggregating information in the kernel instead of changing the perf
subsystem was well received by a perf kernel maintainer, which is encouraging.
Future work will use BPF to enable perf_events when some specific trigger condition
takes place, so that only a window determined by two probes gets sampled.
Also being considered is the conversion of some perf subcommands that analyze tracepoints
like perf sched/lock/etc to use BPF to aggregate things in the kernel instead of passing
vast amounts of data for aggregation in userspace while keeping the existing, familiar
tooling interface.
This shows how the perf and BPF communities are working together to improve Linux tooling,
provide ways to scale profiling and to improve observability of BPF programs, it is
expected that by presenting this talk we get suggestions for further improvements.
It is generally known that Linux memory reclaim uses LRU ordered lists to decide which pages to evict to free memory for new pages. It might be less known that there are separate lists for file (page cache) and anonymous pages, and that both are further split in active and inactive parts. There are however lots of subtle details of how the relative sizes of these four lists are balanced, and things also changed recently with e.g. addition of workingset refault detection.
This talk will summarize the current reclaim implementation in detail, and also major proposed changes such as multigenerational LRU.
0day bot has reported many strange kernel performance changes that the bisected culprit commits have nothing to do with the benchmark, which make patch authors confused or even annoyed. Debug shows these mostly are related to the random code/text alignments changes, false sharing, or adjacent cacheline prefetch, which is caused by the commit, as all components of kernel are flatly linked together.
There have been around 20 reported cases checked (all discussed on LKML, like[1][2][3][4]), and this talk will try to:
* analyze and categorize these cases
* discuss the debug methods to identify and root cause
* discuss ideas about how to mitigate them and make kernel performance more stable.
Some patches has been merged, some are to be posted, and some are under development and test. Will discuss them and get advice/feedback.
[1].https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/
[2].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
[3].https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
[4].https://lore.kernel.org/lkml/20210420030837.GB31773@xsang-OptiPlex-9020/
The RISC-V microconference focuses on the development of RISC-V.
The RISC-V platform specification[1] describes a minimum set of hardware/software requirements to ensure the interoperability of software across compatible platforms. Currently, it defines two platforms i.e. OS-A platforms capable of booting rich operating systems such as Linux, FreeBSD, Windows and M platform aimed to work with RTOS and baremetal. The platform specification is currently under public review and we are collecting feedback from the RISC-V community. We would like to discuss the specification in this forum as well to get a broader feedback which is imperative for the success of the platform specification.
[1] https://github.com/riscv/riscv-platform-specs/blob/main/riscv-platform-spec.adoc
The RISC-V Advanced Interrupt Architecture (RISC-V AIA) and RISC-V Advanced CLINT (RISC-V ACLINT) are non-ISA specifications which define next generation interrupt controller, timer, and inter-processor interrupt (IPI) devices for RISC-V platforms. The RISC-V AIA and ACLINT devices will support wired interrupts, message signaled interrupts (MSIs), virtualized message signaled interrupts (virtual MSIs), flexible machine-level timer, machine-level IPIs, and supervisor-level IPIs.
Both RISC-V AIA and ACLINT specifications are in final stages for being ratified and have been validated using QEMU, OpenSBI, Linux RISC-V, and Linux RISC-V KVM. This talk will involve an overview of RISC-V AIA and ACLINT specifications, detailed software status, and open items.
RISC-V platform specification mandates the Advanced Configuration and Power Interface(ACPI) as the Hardware discovery mechanism for the server class platforms. There are some new ACPI tables that need to be defined for RISC-V. Code changes are required in qemu, tianocore(EDK2), and OS. This is still a work in progress but the talk will provide details about the planned specification updates and a demo with basic ACPI-enabled Linux kernel booting on Qemu virt platform.
D1 is Allwinner's first SoC based on the RISC-V ISA. It integrates the 64-bit C906 core of Ali T-Head, supports RVV, 1GHz frequency. Because some of the features are not included in the RISC-V spec, Linux upstream met some problems. Let's review and discuss the issues:
2 & 3 are minimum requirements for D1 bring up, let's focus on them first. 4 - 6 could help D1 work better and we just have a quick review of them. 7 is about alternative discussion, eg: how we use the errata_list.h for dma_sync ops.
ifunc is a widely used mechnish for specialized those performance
critical functions in glibc, like memcpy, strcmp and strlen.
It’s not used in upstream glibc for RISC-V yet, but with several new
extensions becoming ratified soon, users will desire to have
vector-optimized routines to boost their program.
It’s a generic infrastructure for GNU toolchain, so we don’t need too
much work to enable that in theory, but the real world isn’t so
wonderful…
Here is the list of the puzzle for the RISC-V ifunc, some is there and
some is missing:
- Relocation for ifunc.
- https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/131
- Mapping symbol.
- https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/196
- New asm directive to enable/disable any extension in specific code
region (like .option rvc/.option norvc, but more generic one.)
- https://github.com/riscv-non-isa/riscv-asm-manual/pull/67
- New function target attribute for C/C++
- e.g. int sse3_func (void) attribute ((target ("sse3")));
- hwcap and hwcap2
Most items are toolchain stuff, but last item for hwcap, it should
coordinate between glibc and linux kernel to implement a new mechanism
to discover the machine capability.
Architectures competing with RISC-V have expended considerable time and resources on optimizing their development tools for improved performance on industry-standard benchmarks. For the future growth of the RISC-V ecosystem, a concerted effort to optimize the generated code for performance will be required. This effort will in a large part be independent of the underlying microarchitecture and can be distributed across our entire ecosystem, if we develop the necessary tools and infrastructure to assess for gaps, distribute the work and cooperate.
We propose a data-driven methodology, based on the gathering and comparison of hot-block information, instruction-type histograms and dynamic instructions counts, to evaluate the performance of compilers for RISC-V using qemu. Based on example findings and data, we will illustrate the proposed workflow and how it can allow the prioritisation of potential optimizations based on an expected gain.
We aim to motivate increased cooperation and the creation of a data-driven workflow built around standard tools (primarily plugins for qemu and analysis tools) to continuously monitor and improve the quality of the RISC-V compilers.
The Real-time microconference focuses on finishing the last lap of getting the PREEMPT_RT patch set into mainline. Many of these missing pieces, however, are not at the core of real-time features (like locking, and scheduling), but instead, on other subsystems that compose the kernel, like file systems and memory management. Making this Linux subsystems compatible with PREEMPT_RT requires finding a solution that is acceptable by subsystem maintainer, without having these subsystems suffer from performance or complexity issues.
Welcome Message
This topic will present the current workflow for maintaining the PREEMPT_RT, and
discuss the challenges of maintaining the PREEMPT_RT mode when the merge is done.
The osnoise and timerlat tracers landed up in 5.14.
In addition to the tracing aspects, these two tracers also report performance metrics relevant to real-time. However, it is not easy to manually parse these metrics.
The rtla (real-time linux analysis) is a user-space interface for these tracers. It works by using the libtracefs to set up a tracing section and to collect data and trace information. It has an intuitive interface, and will also serve as the basis for the other real-time related tracers.
The idea is to present and discuss this new tool in this MC topic.
This slot is reserved for a break, or a chat if you so wish.
Running CPU-intensive high-priority real-time applications on a
real-time Linux kernel (based on the PREEMPT_RT patchset) can lead to
situations where the kernel's own housekeeping tasks such as per-cpu
kernel threads get starved out, resulting in system instability
(hangs/unresponsive system). The Real-Time Throttling feature in the
Linux kernel is ineffective in addressing this problem as it does not
protect low-priority real-time kernel threads (such as ktimersoftd).
The stalld userspace daemon was introduced to solve this problem, and
is quite effective in principle; but it has a number of limitations
that makes it hard to use in practice, especially in production
deployments. We propose implementing stalld-like starvation avoidance
for kernel threads directly in the Linux kernel, to address all the
practical limitations of stalld. This design scales well with the
number of CPUs, has minimal monitoring overhead (CPU usage), and
compartmentalizes the fault-domain such that a misbehaved or
misconfigured real-time application does not bring down the entire
system.
The Telco industry is undergoing a major revamp of its infrastructure
at the edge (cell towers) as well as the core datacenter, in order to
meet the demands of 5G networking. As part of this effort, the
underlying infrastructure called the Radio Access Network (RAN), which
was traditionally implemented in hardware (FPGAs) for low-latency
predictable real-time response, is being replaced with
software-defined RAN applications running on real-time Linux kernel
(PREEMPT_RT). These soft real-time applications involve running
CPU-intensive high-priority real-time tasks, to meet the stringent
latency requirements as defined by the 5G/3GPP specification.
There are a number of challenges that the Linux real-time stack needs
to address to support this new class of workloads. This proposal
focuses on system stability issues when running CPU-intensive
high-priority real-time applications on the PREEMPT_RT Linux kernel
and highlights the open issues and proposes a novel design to address
the limitations of existing solutions by implementing kernel-thread
starvation avoidance in the Linux kernel.
In the Telco/5G Radio Access Network (RAN) usecase, deploying the
application involves running high-priority CPU hogs such as "L1 app"
(based on Intel's FlexRAN and DPDK Poll-Mode-Driver). These
latency-sensitive tasks are bound to isolated CPUs and they run
infinite polling loops (in userspace) with high real-time priority
(typically SCHED_FIFO/90+). In this scenario, even if the L1 app RT
tasks don't invoke kernel services by themselves, generic (non-RT)
workloads running on non-isolated CPUs (such as Kubernetes control
plane tasks) can cause per-CPU kernel threads to wake up on every CPU.
However, such kernel threads on isolated CPUs running the L1 app RT
tasks will get starved out, since the L1 app never yields the CPU.
One of the consequences of starving out essential kernel threads is
system-wide hangs. As an example, if a container gets destroyed (from
non-isolated CPUs), the corresponding network namespace teardown code
in the Linux kernel queues callbacks on per-CPU kworkers, and invokes
flush_work(), thus expecting the per-CPU kworker on every CPU to
participate in the teardown mechanism. As a result, the container
destroy will get hung indefinitely due to kthread starvation on CPUs
running the L1 app RT tasks. Furthermore, since this code path holds
the rtnl_lock, any other task that touches kernel networking will end
up getting stuck in uninterruptible sleep ('D' state) too (eg: sshd,
ifconfig, systemd-networkd etc.), thus cascading to a system-wide
hang.
This pattern of kernel subsystems invoking per-CPU kernel threads for
synchronization is quite pervasive throughout the Linux kernel, and
the resulting kthread starvation issues go well beyond the specific
networking scenario highlighted above. Furthermore, even essential
real-time configuration tools and debugging utilities such as tuned
and ftrace/trace-cmd themselves rely on kernel interfaces that can
induce such starvation issues.
The community tried to address the problem of system instability
caused by running CPU-intensive high priority real-time applications
in LPC Real-Time microconference 2020 by introducing stalld. The
stalld userspace daemon monitors the system for starving tasks (both
userspace and kernel threads), and revives them by temporarily
boosting them using the SCHED_DEADLINE policy. It achieves this
revival and system stability by operating within tolerable bounds of
OS-jitter as configured by the user.
We have been using stalld along with RAN applications and it has been
quite effective in maintaining system stability. However, we have also
come across a number of limitations in stalld, owing to its design as
well as the choice to implement starvation monitoring and boosting in
userspace. We would like to bring out stalld's pain-points and then
discuss a prototype that we have developed to address these concerns,
by implementing stalld-like kernel-thread starvation avoidance
directly in the Linux kernel.
Stalld spawns a pthread for every CPU to monitor and boost starved
tasks on the respective CPU. However, in RAN usecases, due to the
use of CPU isolation, all of stalld's threads are forced (bound) to
run only on the housekeeping CPUs, which are typically a small
subset of the available CPUs in the system. For example, on a 20 CPU
server with CPUs 2-19 isolated to run RT tasks, potentially 20
stalld threads compete for CPU time on housekeeping CPUs 0-1, trying
to monitor and boost starved tasks on all the 20 CPUs.
Since stalld runs as a normal priority task, higher priority tasks
(or even a high volume of similar priority tasks) running on the
housekeeping CPUs can starve out stalld itself. Attempting to solve
this problem by turning stalld into an RT application is risky, as
it can make the situation worse -- since all of stalld's per-CPU
monitoring threads put together can potentially consume all the
available CPU time on the housekeeping CPUs (depending on how
aggressive the stalld configuration is), real-time stalld can end up
causing starvation itself!
On systemd-based Linux installations, stalld logs its output related
to starvation conditions and boosting events to journalctl logs via
systemd-journald. However, in most situations involving system-wide
hangs, systemd-journald gets stuck in uninterruptible state too,
leaving no trace of stalld's execution flow and boosting decisions.
One of the other concerns with stalld's design is the use of per-CPU
threads for starvation monitoring and boosting, which can be CPU
intensive. To address this problem, stalld supports a
single-threaded mode of operation to monitor the entire system, but
trades-off the time-to-respond to starvation conditions in exchange
for lesser CPU consumption. However, this is a tricky trade-off for
the system administrator in practice, since typical starvation
issues arise from per-CPU kthreads woken on every CPU and demand
quick boosting/revival on every CPU for system stability.
Considering these limitations of stalld for practical deployments, we
have developed a prototype design to address these concerns by
implementing stalld-like kernel-thread starvation avoidance directly
in the Linux kernel.
Our design to address the limitations of stalld builds on the
following key insights:
System-wide hangs (as described above) are almost always caused by
starving kernel threads, which may be the result of a misconfigured
real-time application. However, ensuring that kernel threads never
starve (using an in-built starvation-avoidance algorithm in the
kernel) will keep the OS stable, while limiting the hangs or
starvation issues to the misbehaving application itself. A
misconfigured RT application can no longer bring down the entire OS.
In a typical real-time RAN application deployment, CPU isolation is
used to move all movable tasks to housekeeping CPUs, so as to run
the real-time application on the isolated CPUs. In such a
configuration, the only remaining kernel threads on the isolated
CPUs are non-migratable per-CPU kthreads such as ktimersoftd,
per-CPU kworkers etc., and those are the ones that are likely to get
starved out. Therefore, the problem of identifying starved
kernel-threads and reviving them via priority boosting is naturally
CPU-local, and it can be implemented without the need for
system-wide monitoring or cross-CPU coordination.
The Linux kernel scheduler uses a per-CPU design for scalability.
Hence, implementing per-CPU kernel thread starvation avoidance by
directly hooking onto the scheduler should automatically scale well.
Implementing starvation monitoring and revival for kernel-threads in
the Linux kernel itself offers a number of surprising benefits,
including the ability to elegantly side-step entire problem classes
altogther, as compared to a userspace solution, as noted below.
3A. Efficiency
-
The in-kernel implementation allows hooking the starvation
avoidance algorithm to specific events of interest within the
scheduler (such as task wakeups) which helps minimize unnecessary
periodic monitoring activity, thus saving CPU time.
3B. No risk of starving the starvation avoidance mechanism
-
In NOHZ_FULL mode, a single task can effectively monopolize the
CPU without ever entering the kernel; but luckily this also means that
there is no chance of starvation since there is only one task
eligible to run on that CPU. Waking up any other task targeted for
that CPU will invariably invoke the scheduler, which gives the
opportunity to run starvation avoidance as needed.
This design also side-steps problems that arise with userspace
solutions such as deciding the scheduling policy and priority at
which stalld runs so as to not get starved itself.
We have developed a prototype that implements the design envisioned
above by using scheduler hooks in the Linux kernel as well as hrtimer
callbacks. A brief outline is presented below.
When a task gets enqueued into a CPU's runqueue, the "stall monitor"
code arms a starvation-detection hrtimer (if not already armed) to
fire after a (user-configurable) starvation-threshold, iff the task
that was enqueued was a kernel thread.
Once the starvation-detection timer fires, the stall monitor code
checks if the set of runnable kernel threads on that CPU have been
starving for the threshold duration. If it detects starvation, it
arranges to boost the kernel threads (one-by-one) using the
SCHED_DEADLINE policy in the irq_exit() path of the hrtimer interrupt,
and arms a deboost hrtimer to fire after the (user-configurable) boost
duration.
The deboost timer's callback restores the scheduling policy and
priority of the boosted kernel thread to its original settings.
We are still working on revising this basic design and implementation,
and we are looking forward to share more details at the conference and
seek the Linux real-time community's invaluable feedback for further
improvements or better alternatives.
The Telco Radio Access Network (RAN) for 5G is an exciting avenue that
brings a new class of real-time workloads to Linux. While the Linux
real-time stack based on the PREEMPT_RT patchset has been used with
great success for decades with tightly controlled real-time
applications and system configuration, the Telco/RAN usecase
challenges the status quo by demanding lower real-time latency than
ever before, while co-existing with non-real-time workloads as generic
(i.e., not tightly controlled) as Kubernetes.
One of the major pain-points faced by the industry in running these
workloads on Linux is instability of the underlying OS itself, often
times triggered by the very tools that are used for Linux real-time
system configuration, tracing and debugging! In this proposal, we
discussed the most promising current solution in this problem space,
namely stalld, and highlighted its limitations as observed in
practical deployment scenarios. We proposed an alternative design that
addresses these limitations by implementing stalld-like kernel thread
starvation avoidance in the Linux kernel itself.
We are looking forward to the Linux community's insightful feedback on
our design, as well as invaluable suggestions more broadly on solving
OS stability issues for RAN-like usecases that involve running
CPU-intensive high priority real-time tasks.
The community has an agreement that a new futex syscall is needed to add new features to help with performance and scalability issues. However, after some patches proposed with different implementations approaches, the path to get it merged is not clear. The goal of this session is to get maintainers and developers together to figure out which is the best approach to make progress in the new interface.
Since 2018 there has been a dedicated effort to rework printk. Originally fueled by the need to make printk real-time friendly, the task quickly evolved to address many other existing problems within the printk subsystem. Since 5.8 there has been a steady flow of these improvements getting merged into mainline, but several RT-critical pieces are still remaining: sync mode, kthread printers, atomic consoles, pr_flush().
In this session we will take a look at these needed features, talk about why their current PREEMPT_RT implementation is not acceptable for mainline "as is", and discuss the plan for moving forward.
In this talk, Thomas Gleixner will present the status of the PREEMPT_RT, along
with a section of questions and answers regarding the upstream work and the
future of the project.
The Android microconference focuses on cooperation between the Android and Linux communities.
Shortly after last year's Plumbers Conference the initial version of Generic Kernel Images (GKI) shipped in products based on the 5.4 kernel and Android 11. Devices that shipped with a 5.4 kernel are compatible with GKI. Kernel developers can replace the system image with the publicly available GSI image and replace the boot image with GKI and the device will boot and run Android 11.
In Android 12 devices running the 5.10 kernel, the product kernel is GKI, which means kernel fragmentation is nearly eliminated. Kernel development on Android devices will be much easier and much of the difficulty delivering security patches to devices in the field is removed. With a single core-kernel, and an upstream-first process, the gap between the Android kernel and mainline Linux is drastically reduced.
This session will be a brief discussion on the status of GKI in Android 12 followed by Q&A.
Android has been benefiting from extensive use of the cgroup V1 interface to boost important tasks (the top-app and foreground groups) and limit unimportant ones (background). Our recent investigations have shown that combining CPU shares in addition to the newly introduced util-clamp feature can improve user-visible jank specifically in cases where background load is high. Unfortunately, util-clamp and CPU shares are both attached to the CPU controller which constrains userspace's ability to classify tasks and drive these features independently. The issue becomes even bigger when we plan to migrate to cgroup v2. In addition to this, the util-clamp max aggregation can be ineffective because of co-scheduling leading to sub-optimal energy consumption. This talk will describe those problems in more detail and dis
Stacking file systems based on FUSE are intended to go through complicated code paths implemented by the FUSE service, to enforce special access policies or manipulate data at runtime, based on what is the request received by the FUSE file system and the data in the lower file system.
Android relies on FUSE to enforce fine-grained access policies depending on file contents and requesting users, and may modify file contents at run-time.
These benefits come with the cost of an increased overhead to traverse the whole FUSE pipeline, worsened by the multiple switches between kernel-space and user-space. FUSE performance may result in less than 30% compared with direct access to the lower file system files.
FUSE passthrough is a first solution that has been proposed upstream to reduce this performance gap, allowing the FUSE service to provide some files with direct access to the lower file system, that would be internally rerouted by the FUSE driver. This solution is already available in a number of Android devices, but is still under discussion in the mailing list.
Another work-in-progress extension to FUSE passthrough is the extension of the FUSE driver with additional logic based on BPF, still managed by the FUSE service. This solution aims at extending the FUSE passthrough performance benefits also to file system operations and improving the FUSE driver flexibility and updatability with BPF programs without the need for modifying the kernel.
dm-snapshot was a huge step forward for Android updates, but it can have greatly outsized disk space requirements for relatively small binary patches. Since dm-snapshot is closely tuned to the underlying exception store, it is not easily amenable to custom storage formats.
We have addressed this by implementing dm-snapshot in userspace via a new "dm-user" device-mapper module (like FUSE, but for block devices). Since the entire OS runs off this block device, performance is a primary motivation over NBD/iSCSI, and we are interested in how to achieve high-performance userspace block devices in the upstream kernel.
The Android community has been using thermal core infrastructure for both Tj and Tskin solutions for years, and many thermal DVFS features from various Vendors/OEMs are built upon this, which usually require changes in the thermal governor and framework. However, with GKI introduced, it is now forbidden for OEM to put a custom thermal governor as a module which limits the solution and sometimes leads to sub-optimal code which combines an in-driver governor. In addition, there are many learnings from use of the thermal core infrastructure for both Tskin and Tj solutions together, especially on the way they interact with each other. This talk will describe those problems in more detail and discuss potential solutions and improvements to the current situation.
A discussion on how we can find a cost-effective solution to attribute shared buffers to their allocating processes. Other than being useful for memory accounting/debugging, this could also lead the way to a solution to set limits on how much memory a process can allocate.
fw_devlink has been enabled as "on" by default since version 5.13 and ensures a device's probe() is never called before its supplier devices have successfully bound to a driver.
This talk will be focusing on any remaining issues and future improvements. Some of the topics that'll be discussed include:
The Rust for Linux project is adding support for the Rust language to the Linux kernel. We have a partial implementation of the Android Binder driver, as well as PL061 GPIO and NVMe drivers in Rust. Our goal is to make Rust available to kernel developers so that drivers can be written more expeditiously, with most potential memory bugs caught at compile-time, while at the same time preserving performance characteristics.
We show brief examples of how this is achieved and what real drivers look like in Rust, contrasting them with their C counterparts. We'd then like to discuss concerns, objections, potential unforeseen difficulties, general feedback, etc. that members of the community may have. We're also interested in hearing about existing pain points when writing drivers in C so that we can try to improve the experience in Rust.
Most Android vendors currently ship kernels that include Laurent Dufour's speculative page fault patchset from about 2.5 years ago. The patch set was rejected upstream at that time, due to code complexity, but provides a significant benefit to application startup times. I have been working on a new spin on the same basic idea, and came up with a patchset version which is (IMO) simpler and more bisectable. I would like to discuss performance results and gauge what our options are for upstreaming this.
While there are only a small number of devboards in AOSP, a number of vendors and community members have created external projects to enable their devices against AOSP.
Some examples:
After seeing some of the excellent work being done in the GloDroid project and realizing there is a fair amount of duplicated effort in keeping a device current with AOSP, I thought there might be a better opportunity for devboard vendors and community members who are focusing on AOSP to collaborate.
I'll cover my thoughts on what sort of collaboration might be useful, along with some of the potential pitfalls, and see what interest or ideas folks have on how we might work together and share more experience and knowledge as a community.
I hope have discussion on the topic from GloDroid maintainers, LineageOS developers, as well as other Linaro and Google developers and hopefully more.
The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
This talk will review the goals and requirements for a BPF memory model and look at more recent work on deriving memory-model litmus tests from example BPF programs. These examples will cover ordering within and among BPF programs, but also ordering with the kernel code that BPF programs can interact with.
The report covers the use of flow label in modern network environment and the effect of TCP hash 'rethinking' upon negative routing event on the operations.
Since the cilium/ebpf pure Golang library was last presented at LPC 2019, a lot has changed. eBPF is now seemingly on everyone's radar, the eBPF Foundation is a thing, and more people are using and writing Go-based tools and services than ever. What does this mean for the library and the ecosystem around it? Who uses it, who's been contributing, and which use cases does the library enable today?
In this talk, we'll mainly discuss the following topics:
Q&A will open after each proposal.
With the rapid adoption of Cilium as the BPF-based datapath for Kubernetes as
well as integration into popular devops tooling such as kind [0] which allows
for running local Kubernetes clusters using Docker container 'nodes', we see
more advanced use (and corner) cases which have not yet been tackled from an
BPF and networking angle. Therefore, in this slot, we discuss on various loosely
coupled issues in the networking stack which we are working on in the context
of Cilium's BPF datapath:
We will provide a brief overview of the use cases related to the above, and give
an outline for kernel extensions we are looking into.
[0] https://kind.sigs.k8s.io/
[1] https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
Iptables has become a synonym of a firewall in Linux world. Although there is a
nftables which is supposed to replace iptables, iptables will exist for
decades more because of its popularity and ubiquity.
With the growing widespread use of BPF technology and its benefits there is a
temptation to apply the technology for the firewalling purposes.
Despite its advantages iptables is also known for its dark side - performance
and security related issues. What if it's possible to keep the iptables' ABI
and replace its implementation with something more performant and secure by
nature?
Such an approach will keep the existing solutions to work and remove an
overhead of switching to a new technology.
There was a RFC patchset back in 2018 which proposed a BPF based firewall -
bpfilter. From a bird's eye view bpfilter is a compiler implemented as a user
mode helper kernel module. bpfilter analyses an iptables' ruleset and
synthesizes an equivalent BPF program. When bpfilter kernel module is loaded
it starts a userspace process that has an IPC with its kernelspace part. Most
of the bpfilter functionaility is implemented in the userspace process what
significantly simplifies its development and improves security. The kernel part
hooks into the kernel iptables ABI and transparently for the userspace consumer
passes control to the userspace process. bpfilter userspace process "compiles"
iptables' ruleset into a BPF program and passes control back to the kernel.
This approach allows to transparently replace iptables' implementation without
breaking its consumers and gain all the benefits of BPF ecosystem.
While the initial patchset was abandoned in 2021 there was an attempt to
ressurect the patchset. Two versions of the updated patchset were submitted to
the bpf@ mailing list and the third iteration is in the process of preparation.
Currently bpfilter is able to process basic rules in INPUT and OUTPUT chains
and translate them into equivalent XDP and TC programs. bpfilter has an easy
way to add new matches and extensions in terms of iptables.
The idea to treat a firewall as a compiler is seductive - as such an approach
provides more opportunities for performance optimisations due to a more precise
context. Combining it with the existing BPF performance and security features
and putting on top of it its userspace nature - this might sound as the next
firewall for Linux.
The annual GNU Toolchain mindfulness and meditation session. A cordial Question and Answers session with the GCC Steering Committee, GLIBC, GDB and Binutils Stewards also will be entertained.
This is a lightning talk.
One of the hurdles necessary to overcome for the M1 Darwin GCC port is
supporting the Darwin ABI specification. GCC is designed to process
argument passing the same way, regardless of whether the argument is
named or variadic. This however does not leave scope to accommodate the
Darwin modifications to the AArch64 ABI, which specifies that named
stack-allocated arguments are passed naturally aligned, but variadic
arguments are passed word-aligned.
To overcome this, we propose extending the GCC target hook API to carry
the additional information necessary to let the backend make its own
decisions about stack layout. The extension will not affect existing
targets, and is opt-in by nature.
The second issue we tackled was support for the GCC nested function
extensions to the C language. This is traditionally implemented using
trampolines injected onto the stack at runtime, which requires an
executable stack. Since Darwin's stack is not-executable, and the
target doesn't make use of function descriptors, we required a
solution to support nested function calls that didn't require changing
the ABI.
Our preliminary plan is to generate the trampolines into an mmaped
executable page: The trampolines will be generated when required
within a function, and deallocated when the control leaves the
enclosing scope.
Recent x86 processors support "non_temporal" stores which bypass the cache when storing data. It is widely understood that normal stores to cache are appropriate when it is likely that the data may be needed before the cache is full. It is also understood that stores of large blocks of data which exceed the available cache allow the overall application to run faster when the block of stores bypass the cache, leaving other locally used data in the cache. A recent change (since reverted) tuned the library routine for memcpy to optimize based on best results assuming a single core was the sole user of the cache instead of allowing for multi-core server chips which have multiple cores sharing a chip. The specifics of the two cases will be presented followed by discussion of how similar single core vs multi-core optimizations might be handled in standard software libraries.
There are multiple security features that have been requested for the Linux Kernel for a long time (https://outflux.net/slides/2020/lpc/gcc-and-clang-security-feature-parity.pdf). This wishlist includes wipe call-used registers on return, auto-initialization of stack variables, unsigned overflow detection, etc …
Some of these security features have been available in CLANG, or other compilers for some time. The lack of these features in GCC makes it less competitive than other compilers regarding security.
For over a year, we have been working hard in order to make GCC comparable with, or even better than other compilers in this area.
The focus of this talk is on two security features that we have recently implemented in GCC11, or that we are currently working on for GCC12.
The first feature is called "wipe call-used registers on return”. This is a technique to mitigate ROP (Return-Oriented Programming) and addresses the register erasure problem as mentioned in the "SECURE project and GCC” talk at Cauldron 2018 (https://gcc.gnu.org/wiki/cauldron2018#secure).
This project has been completed and the corresponding patch has been committed to GCC11. In this patch, we have added the new "-fzero-call-used-regs” option, plus the new function attribute “zero_call_used_regs", to GCC.
To improve kernel security, this new feature is now used in the Linux Kernel. See https://patchwork.kernel.org/project/linux-kbuild/patch/20210505191804.4015873-1 keescook@chromium.org for more details.
The second feature is called "stack variables auto-initialization”. It is a technique to provide automatic initialization of automatic variables. LLVM has supported the -ftrivial-auto-var-init=pattern/zero option to provide this functionality. This is currently implemented as a plugin in the Linux kernel source tree, but ideally the compiler supports it natively, without the need for an external plugin.
This project is ongoing. The 7th version of the patch has been submitted to GCC upstream for review and discussion (https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576341.html).
In the talk we provide a high-level overview of these two features. This includes a description of the issues, the motivation, major considerations, some interesting implement
Discuss topics related to the rs6000 / Power / PowerPC toolchain, including support for Power10.
The GNU C Library is used as the C library in the GNU systems and most systems with the Linux kernel. The library is primarily designed to be a portable and high performance C library. It follows all relevant standards including ISO C11 and POSIX.1-2008. It is also internationalized and has one of the most complete internationalization interfaces known.
This BoF aims to bring together developers of other components that have dependencies on glibc and glibc developers to talk about the following topics:
... and more.
io_uring is an asynchronous I/O API crafted for efficiency, where one of the reasons for using shared rings is to reduce context switching. It got lots of of features since introduction, and pushing it further we want to give away some of the control over submitting and controlling I/O to BPF, minimising the number of context switches even more.
We'll go over the current design [1] and decisions, issues and plans, and hopefully it will engage a discussion and give impetus to curious minds to try it out and share ideas on how to tailor the API to fit their use cases.
[1] https://lore.kernel.org/io-uring/a83f147b-ea9d-e693-a2e9-c6ce16659749@gmail.com/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b
In this talk, we would like to propose adding roles to memory pages. We contend that the current monochromatic memory model cannot address modern systems' security and performance needs.
We want to discuss two recent projects that perform memory segregation. DMA Aware Malloc for Networking (DAMN) that protects against DMA attacks (e.g., project thundeclap) while providing the same performance at +100Gb/s as with iommu=off.
And also, a Memory Allocator for I/O (MAIO), which facilitates overhead free zero-copy networking for user-space applications.
We implement memory segregation by adding extra metadata into Tail pages of huge (i.e., compound) pages. This additional metadata allows for
fast address translation and any additional operations per segment role.
The shared common DNA of both projects is a memory allocator that allots memory for specific operations. This memory can later be reclaimed by a simple put_page. While the memory pools are based on compound pages, different memory allocation techniques can be implemented, e.g., page_frag or single 4K page allocations. We take special care to ensure that the individual page ref_count is used by get/put_page used rather than the head pages ref_count.
This form of memory segmentation isolates general kernel memory from segmented memory that a device or a user also uses. This isolation of memory is vital to facilitate fast and secure I/O operations.
Both DAMN and MAIO are prime examples that demonstrate how to solve complex I/O problems by adding segmentation to the existing memory model.
Following a previous talk, oomd: a userspace OOM killer, Facebook has since come up with a simplified interface for oomd that removes some of the barriers of configuring oomd. Integration with systemd allows more users to reuse their knowledge of configuring systemd daemons. And by removing some of the complexity of coming up with an OOM kill policy, we enable more users to do cgroup and PSI-based OOM kills.
This talk will cover the key features of oomd that were preserved in systemd-oomd, what changes were made to ease the kill policy decisions, and how this translates to the settings we've adopted in Fedora. We will close with a discussion of future work for systemd-oomd.
Over the last years, many discussions took place in Linux Foundation's ELISA workgroup (elisa.tech) about possible approaches to qualify Linux for safety-critical systems. It is a consensus that one of the main challenges for the qualification of Linux is the lack of SW Architectural Design documentation, especially concerning the kernel internal components/drivers/subsystems. Such documentation is fundamental in functional safety as it provides the baseline required to assess the OS design against the allocated safety requirements (safety analysis). This Architectural Design is also necessary to evaluate the completeness and correctness of the associated test campaign.
However, given the complexity of Linux, the challenge is finding a documentation format that is complete enough to justify the assessment while still keeping a maintainable granularity.
This talk will present an SW architectural design model that, working at the granularity level of the single drivers/subsystems, uses a formal method (automata) to describe the interaction of a target subsystem/driver with the rest of the kernel, whereas a natural language description (kernel-doc headers) is used to describe the behavior of the target subsystem/driver itself.
During the talk, the authors will present how to use computer-aided design tools to help to derive the automata models of target subsystems. They will also show how to take advantage of the proposed Runtime Verification Interface [1] to transform these models into runtime verification monitors that are usable either during the verification phase (to cross-verify the kernel and the documentation) or to monitor safety-related aspects of the system at runtime, avoiding unsafe states.
The discussion of this topic in a more development centric conference (instead of a more safety related audience) is necessary to get the direct feedback of kernel developers/maintainers about the approach and the maintainability of the SW Architectural Design documentation.
[1] https://lore.kernel.org/lkml/cover.1621414942.git.bristot@redhat.com/
The System Boot and Security microconference focuses on the firmware, bootloaders, system boot and security around the Linux system. It also welcomes discussions around legal and organizational issues that hinder cooperation between companies and organizations to bring together a secure system.
The grub2 bootloader is a trusted component of the secure boot process, including "traditional" GPG-based secure boot, UEFI-based secure boot, and the logical partition secure boot process being developed by IBM. Grub2 is mostly written in C and has suffered from a number of memory-unsafety issues in the past.
Rust is a systems programming language suitable for low-level code. Rust can provide valuable tools for safer code: code in 'safe' Rust has stronger guarantees about memory safety, while 'unsafe' code has to be contained in specially marked sections. It is reasonably easy for Rust code to interoperate with C.
Grub2 is based on a modular design. Potentially vulnerable components such as image and file-system parsers are written as individual modules. Can we progressively rewrite these modules in a safer language?
I will discuss my progress enabling Rust to be used as a language for grub development, issues I have encountered, decisions we will have to make as the grub community, and next steps from here.
In the bootloader as well as firmware, there is a lot of useful information on how the system is set up. However, there has been a lack of transportation in sending this information to the operating system. Initially, we designed a log to record messages from the GRUB2 bootloader so the TrenchBoot project could view how the platform was being setup during boot. After some discussion, we realized this could be useful for other projects and we could extend our design to work for other boot components. In this presentation, we will look at ways to collect information from the firmware and bootloader for the operating system.
A specification for Dynamic Root of Trust for Measurement (DRTM) on the Arm architecture will be available Fall 2021. DRTM allows a system in a potentially unknown or untrusted state to boot an OS or hypervisor into a known and trusted state.
This topic will present an overview of DRTM on Arm to provide context, followed by discussion around several topics that have implications for the Linux kernel:
The ability to do a Trusted Computing Group (TCG) Dynamic Launch of a system has been commercially available in x86 processors since 2006 with the introduction of Intel TXT for Intel processors and by AMD-V for AMD processors. Over the years the technology has mainly been used by limited number of security-sensitive projects. The TrenchBoot Project has been working to make the underlying hardware technology more integrated and to be an out-of-the box solution usable by the general Open-Source Operating System user. Towards that goal the project has been working to upstream a into the Linux kernel the ability to be directly launched by a TCG Dynamic Launch in a unified manner. The first patchset submitted is focused in enable this approach for Intel TXT, with support for AMD and Arm to come soon after. This purpose of this topic is to engage the Linux developer community for feedback on the current patches and discuss ways in which progress towards merging could be made.
The Testing and Fuzzing microconference focuses on advancing the current state of testing of the Linux kernel. We aim to create connections between folks working on similar projects, and help individual projects make progress.
The Linux Plumbers 2021 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel. We aim to create connections between folks working on similar projects, and help individual projects make progress.
We ask that any topic discussions will focus on issues/problems they are facing and possible alternatives to resolving them. The Microconference is open to all topics related to testing & fuzzing on Linux, not necessarily in the kernel space.
Potential topics:
Things accomplished from last year:
Confirmed to-be attendees:
Many bugs are easy to detect: they might cause assertions failures, crash our system, or cause other forms of undefined behaviour detectable by various dynamic analysis tools. However, certain classes of bugs, referred to as semantic bugs, cause none of these while still resulting in a misbehaving faulty system.
To find semantic bugs, one needs to establish a specification of the system’s intended behaviour. Depending on the complexity of the system, creating and centralising such specifications can be difficult. For example, the “specification” of the Linux kernel is not found in one place, but is rather a collection of documentation, man pages, and the implied expectations of a vast collection of user space programs. As such, detecting semantic bugs in the Linux kernel is significantly harder than other classes of bugs. Indeed, many test suites are meant to detect regressions, but creating and maintaining test cases, as well as covering new features requires significant amounts of engineering effort.
Differential fuzzing is a way to automate detection of semantic bugs by providing the same input to different implementations of the same systems and then cross-comparing the resulting behaviour to determine whether it is identical. In case the systems disagree, at least one of them is assumed to be wrong.
syz-verifier
is a differential fuzzing tool that cross-compares the execution of programs on different versions of the Linux kernel to detect semantic bugs. It was developed as part of the syzkaller
project which also provides unsupervised coverage-guided kernel fuzzing.
To generate programs, syz-verifier
uses a declarative system call description language called syzlang
. This allows generating valid random programs (sequences of system calls) the same way as syzkaller
does. The programs are then dispatched for execution on different versions of the Linux kernel. After programs finish executing, the produced results (currently only errnos
returned by each system call) are collected and verified for mismatches. In case a mismatch is identified, syz-verifier
reruns the program on all kernels to ensure it is not flaky (i.e. consistently reproducible rather than triggered due to some background activity or external state). If the mismatch occurs in all reruns, syz-verifier
creates a report for the program.
The traditional approach of testing software on real hardware usually involves creating a rootfs which contains the test suites that need to be run, along with its run-time dependencies (network, mounting drives, time synchronization, ...).
Maintaining one rootfs per test suite is a significant packaging burden, but also prevents running multiple test suites back to back which slows down testing. The alternative is also not a clear win as this makes the creation of the rootfs harder when having conflicting requirements between test suites or if a test suite silently modifies some configuration which would impact other test suites, potentially leading to test failures being mis-attributed.
Fortunately, Linux namespaces and OCI containers are now becoming commonplace and can now be used to package our test suites along with their dependencies, without having to integrate them all in one image. Provided that you have a well configured host kernel and OS, this enable running test suites in relative isolation thus reducing the chances of interference between test suites. Finally, the packaging problem can be alleviated by having the test suites provide releases as containers, thus allowing re-use in many CI systems without modifications.
In this presentation, we will further present the benefits of containers, and introduce boot2container: A podman-powered initramfs that gets configured declaratively using the kernel command line, is deployable via PXE thanks to its small size (<20 MB), and that makes it easy to share files with/from the test machine via an S3-compatible object storage.
Join to hear about the next KCIDB release, new features and plans, including the new report format and subscription/notification system. Provide feedback and discuss ideas for further development. Get help submitting your data or joining the development.
KernelCI's KCIDB is an effort to unify kernel test reporting schema and protocol, and provide a service for aggregating, analyzing, reporting, and accessing test results received from various Kernel testing systems. We are already receiving data from ARM, Gentoo's GKernelCI, Red Hat's CKI, Google's Syzbot, Linaro's Tuxsuite, and of course the native KernelCI tests, and we're working on receiving data from more systems.
See our dashboard at https://kcidb.kernelci.org/
In this talk, we will show how to construct evidence of correctness through
testing and formal verification. In our case study, we test the long-standing
Red-Black tree implementation in the kernel against a variant in a functional
programming language. This variant has been formally verified in the interactive
theorem prover Isabelle [1]. To our surprise, the kernel Red-Black tree
implementation is a variant that is not known in the literature of functional
data structures so far. We are glad that we still found it to be correct with
newly identified invariants for the correctness proof.
[1] https://isabelle.in.tum.de/
Smatch is one of the main static analysis tools used in the kernel. These days simple static analysis checks are increasingly implementing in the compilers. For Smatch the new work is in more complicated cross function analysis that compilers cannot handle.
This talk will give a brief introduction to the new Smatch Param/Key API which makes it easier to write advanced cross function checks and removes a lot of boilerplate code.
Then it will cover Smatch's "Sleeping in atomic" check. Checking for sleeping in atomic bugs requires complicated cross function analysis. This is an example of an advanced check with a lot of moving parts.
Finally, the talk will cover an in development check for race conditions. In some ways this is the most complicated Smatch check ever. Hopefully we can have a discussion about how to make this check better.
Both AMD and Intel have presented technologies for confidential computing in cloud environments. The proposed solutions — AMD SEV (-ES, -SNP) and Intel TDX — protect Virtual Machines (VMs) against attacks from higher privileged layers through memory encryption and integrity protection. This model of computation draws a new trust boundary between virtual devices and the VM, which in so far lacks thorough examination.
To enable the scalable analysis of the hardware-OS interface, we present a dynamic analysis tool to detect cases of improper sanitization of input received via the virtual device interface. We detail several optimizations to improve upon existing approaches for the automated analysis of device interfaces. Our approach builds upon the Linux Kernel Library and clang’s libfuzzer to fuzz the communication between the driver and the device via MMIO, PIO, and DMA. An evaluation of our approach shows that it performs 570 executions per second on average and improves performance compared to existing approaches by an average factor of 2706.
Using our tool, we analyzed 22 drivers in Linux 5.10.0-rc6, thereby uncovering 50 bugs and initiating multiple patches to the virtual device driver interface of Linux.
The Rust for Linux project is adding support for the Rust language to the Linux kernel. A key part of such an effort is how to approach testing for code written in the new language.
It covers:
The past year has been an exciting one for KUnit, but there's still a long way to go to test a project as large and complicated as the Linux kernel. In this talk, we'll go over what KUnit has been doing since last year, and discuss how we can increase KUnit’s adoption throughout the Linux kernel.
We'll begin with an overview of new and improved features that have been added to KUnit, such as QEMU support in kunit_tool, SKIP test support, as well as improvements to documentation. We'll also touch on features and ideas that we have been experimenting with, and the challenges and opportunities they have presented.
We will then discuss KUnit's growing use, before transitioning into how we can increase adoption of KUnit across different parts of the kernel: for example, by migrating suitable ad-hoc tests into KUnit. We'll also talk about the challenges of testing drivers and subsystems, and how we are trying to build up a comprehensive set of tests in a major Linux kernel subsystem as a model to show how it can be done in other subsystems.
At this point, we will transition to having a group discussion about how we can grow KUnit usage across the kernel, what the complexities of testing different subsystems are, and which of these features and plans seem most useful to the community.
The Tracing microconference focuses on improvements of the Linux kernel tracing infrastructure. Ways to make it more robust, faster and easier to use. Also focus on the tooling around the tracing infrastructure will be discussed. What interfaces can be used, how to share features between the different toolkits and how to make it easier for users to see what they need from their systems.
Short introduction to the Tracing Microconference
The topic aims to present various challenges we have ran into during the implementation of DTrace on top of Linux tracing facilities such as BPF. We hope to have open discussion on how we can get around some of these challenges because they are likely to be things other projects will run into as well. In addition, we want to share some of the workarounds we came up with, and hopefully spark discussion on how to propose fixes rather than depending on creative workarounds.
Summary
We have many user processes today that run in various locations and control groups. To know
every binary location for each version becomes a challenge. We also have common events that
trace out among many processes. This makes using uprobes a challenge, but not impossible.
However, having a way for user processes to publish data directly to trace_events enables a
much easier path toward collecting and analyzing all of this data. We do not need to track
per-binary locations, nor do we need to enter the control groups to find the real binary paths.
Today the main way to create and get data into a trace_event from a user mode program is by
using uprobes. Uprobes require the locations of each binary that wants to be traced in addition
to all of the argument locations. We propose an alternative mechanism which allows for faster
operation and doesn't require knowing code locations. While we could use inject and dynamic_events
to do this as well, user processes don't have a way to know when inject should be written to.
Knowing when to trace
In order to have good performance, user mode programs must know when an event should be traced.
Uprobes do this via a nop vs int3 and handle the break point in the die chain handler. To account
for this a tracefs file called user_events_mmap will be created which will be mmap'd in each
user process that wants to emit events. Each byte in the mmap data will represent 0 if nothing
is attached to the trace_event, and non-zero if there is. It would be nice to use each bit of
the byte to represent what system is attached (IE: Bit 0 for ftrace, bit 1 for perf, etc). This
has the limitation however of only being able to support up to 8 systems, unless bit 7 is reserved
for "other". User programs simply branch on non-zero to determine if anything wants tracing. To
protect the system from running out of trace_events the amount of user defined events is limited
to a single page. The kernel side keeps the page updated via the underlying trace_events register
callbacks. The page is shared across all processes, it's mapped in as read only upon the mmap syscall.
Opening / Registering Events
Before a program can write events they need to register/open events. To do this an IOCTL is issued
to a tracefs file called user_events_data with a payload describing the event. The return value of
the IOCTL represents the byte within the mmap data to use to check if tracing is enabled or not. The
open file can now be used to write data for the event that was just registered. A matching IOCTL is
available to delete events, delete is only valid when all references have been closed.
Writing Event Data
Writing event data using the above file is done via the write syscall. The data described in each
write call will represent the data within the trace_event. The kernel side will map this data into
each system that is registered, such as ftrace, perf and eBPF automatically for the user.
Event status pseudo code:
page_fd = open("/sys/kernel/tracing/user_events_mmap");
status_page = mmap(page_fd, PAGE_SIZE);
close(page_fd);
Register event pseudo code:
event_fd = open("/sys/kernel/tracing/user_events_data");
event_id = IOCTL(event_fd, REG, "MyUserEvent");
Write event pseudo code:
if (status_page[event_id]) write(event_fd, "My user payload");
Delete event pseudo code:
IOCTL(event_fd, DEL, "MyUserEvent");
Providing adequate observability in containerized workloads is getting more important. Sophisticated instruments are being developed to understand what is really going on, but most of the effort approaches the problem from top to bottom, operating at the abstraction layers of container orchestration.
What if we take the opposite approach and use Linux system tracing to unfold the container and look inside. How can we check what is being executed, what files are being open etc.
We currently have a POC based on kprobes that traces the system calls of a Docker container. It is written in Python and can be easily adapted to any changing environment, but it would be nice to have a standard API that would just work for all containers (not only Docker). Can we standardize the tracing of containers? How can we expand this to tracing containers on more than one machine?
When invoked from system call enter/exit instrumentation, accessing user-space data is a common use-case for tracers. However, tracepoints currently disable preemption around iteration on the registered tracepoint probes and invocation of the probe callbacks, which prevents tracers from handling page faults.
Discuss the use-cases enabled by allowing system call entry/exit tracepoints to take page faults, and what is missing to upstream this feature.
https://lwn.net/Articles/835426/
https://lwn.net/Articles/846795/
Upstreaming the LTTng kernel tracer [1] (originally created in 2005) into the Linux kernel has been a long-term goal of the LTTng project.
Today, various tracing technologies are available in the Linux kernel: instrumentation with tracepoints, kprobes, kretprobes, function tracing, performance counters through perf, as well as user-visible ABIs, namely Ftrace, Perf, and eBPF. There are however areas in which the LTTng kernel tracer has unique capabilities which other tracers lack.
Efficiently tracing system call entry/exit while fetching system call input/output parameters from user-space is a use-case the LTTng kernel tracer can cover, thanks to its ring buffer design which allows preemption.
Discuss the challenges and establish a roadmap towards upstreaming the pieces of the LTTng kernel tracer required to trace system calls into the Linux kernel.
[1] https://lttng.org
Problem Statement:
Linux tracing provides mechanism to have multiple instances of 'tracing' and each instance of tracing have individual events directory which is known as 'Eventfs Tracing Infrastructure' i.e. '/sys/kernel/debug/tracing/events'.
'Eventfs Tracing Infrastructure' contains lot of files/directories although depending upon the Kernel config, still the number of files/directories ranges more than 10k which consumes memory in MBs. Further creating new instance of 'Linux tracing', creates its own copy of 'events'.
Solution:
As per the usage, it would be creating only the relevant directories/files at runtime and deletes them if it's not require anymore. This is based upon 'Virtual file system'.
POC/Code:
Please refer prototype code here:
https://gitlab.com/akaher/linux-trace/-/commits/ftrace/eventfs/
With the new DYNAMIC_FTRACE_WITH_ARGS feature that x86 (and hopefully soon other archs have), the function tracer callback gets all the registers needed to see the arguments by default (but not all registers). In theory, we can use something like BTF, which can describe the arguments of every function, and use it to trace them.
Currently, BPF can do this on a function by function basis, where it retrieves the arguments via generated code (with the help from BTF). But for function tracing, generated code is not needed. Just a quick lookup of how the arguments are defined, and how to use the pt_regs to to retrieve them.
Secondly, once the arguments are retrieved, a generic way to write this to the ring buffer would also be needed.
All the functionality to do this is now available in the kernel (DYNAMIC_FTRACE_WITH_ARGS and BTF). How to implement it, is another question that needs to be solved, and this session will focus on that.
Currently there's three infrastructures that can trace the exit of the function.
kretprobes
function_graph
BPF direct trampolines
Each one does it differently, and they can stumble over each other when they trace the same function call return. There should be a way that all three can somehow use the same infrastructure. At least maybe two of them?
There's been prototypes to do this, but nothing satisfactory as of yet. Perhaps a meeting of the minds can help make this work?
The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
Prior to LWT (Lightweight Tunnels) and modern eBPF, the only way to send encapsulated packets to multiple destinations was achieved by creating multiple tunnel devices which didn’t scale well when thousands of different destinations were needed.
In the past Google solved this problem by introducing custom patches on top of the ip gre device to allow sockets to provide the destination address and encapsulation protocol to change the encapsulation headers in flight, but thanks to advancement of eBPF this logic can be completely implemented outside of the kernel in a less intrusive way and with all of the benefits that come with eBPF.
In this presentation I’m going to talk about how eBPF was used to encapsulate packets using the eBPF TC filter and the cgroup hooks, discuss what the differences are between this approach and LWT, explain how this feature was easily extended to support a more interesting feature: “encapsulation headers reflection” which is used to store the encapsulation headers of incoming traffic and reflect them on the responses making it transparent for the application. During the talk I'm also going to discuss the pain points found during the implementation which lead us to non obvious solutions.
The goal is to have an open discussion about how the problem was solved and the obstacles faced and highlight possible eBPF/kernel features that would have been nice to have i.e BPF_MAP_TYPE_NS_STORAGE (namespace storage).
This talk highlights a few rough edges in the overall BPF user experience that we have observed while building services with BPF at Cloudflare. We will showcase a set of problems, analyze their cause, and present possible workarounds. The goal of the talk is to share collected know-how with other users, and trigger discussions on potential improvements.
Collected cases fall into two distinct categories:
Within the first group we are going to cover such topics as:
In the second category, we’ll cover various clang / LLVM optimizations that cause generated C to fail with only small input changes:
We’ll also discuss how we’re switching to a hybrid static C & generated eBPF model, and fuzzing the eBPF generator.
In this talk, we describe important challenges in L4 and L7 load balancing for the consistent routing of packets across hosts as well as across sockets within a host, once a packet is received in the XDP based L4LB. We then describe how we leverage recent additions on the BPF programs to address those challenges.
Typically some form of Consistent Hashing is used to pick an end host for incoming packets within an L4 LB [2]. Such mechanisms, however, pose challenges in maintaining routing consistency over a long window of time without sharing routing states among the L4LBs. In Facebook, we devised a novel server-id based routing for completely stateless routing of both TCP and QUIC connection. For routing of TCP packets, we leverage tcp_hdr_opt [1] to encode server_id between the endpoints.
‘Zero downtime restart’ [3] supported by many L7 Proxies, such as Proxygen in Facebook, require lots of custom userspace solution for routing consistency, especially for UDP payloads. Further, maintaining uniform load across individual sockets and CPU cores in a host is not straightforward without custom solutions. We describe how we leverage SOREUSEPORT_SOCKARRAY to create a framework that allows us to efficiently and effectively address both problems by:
a) Being able to make routing decision in picking up a socket on per packet (UDP) and per connection (TCP) basis
b) Being able to granularly target individual CPU core to handle incoming packets
This has allowed us to run at scale with minimal operation load and further simplify our implementation to execute disruption free restart of L7 proxy [3].
References
The BPF verifier is an integral part of the BPF ecosystem, aiming to
prevent unsafe BPF programs from executing in the kernel. Due to its
complexity, the verifier is susceptible to bugs that can allow malicious
BPF programs through. A number of bugs have been found in the BPF
verifier, some of which have led to CVEs (1, 2, 3). These bugs are severe,
since the verifier is on the critical path for ensuring kernel security.
Due to its design, the verifier is also overly strict: it may reject many
safe BPF programs because it lacks sophisticated analyses to recognize
their safety. When a BPF program is rejected by the verifier, it can be a
frustrating experience (4). To get their program accepted by the verifier,
developers often have to resort to ad-hoc fixes, tweaking C source code
or disabling optimizations in LLVM. This solution becomes brittle as
developers write more complex BPF programs and new optimizations are
introduced in LLVM.
In this talk, we argue that a more systematic approach is to freeze
the kernel side of the BPF verifier and move most of its complexity
to user space. To do so, we introduce formal, machine-checkable proofs
of the safety of BPF programs. Applications provide proofs that their
BPF programs are safe, and a proof checker in the kernel validates the
proofs. By decoupling proof validation from generation, this achieves
two goals. First, the kernel side of the interface is fixed to be a
specification of BPF program safety and the proof checker, avoiding
the ever-growing complexity of the BPF verifier in the kernel. Second,
applications can choose an appropriate strategy to generate proofs for
their BPF programs. Since the proofs are untrusted, there is no risk of
applications introducing bugs from complex proof strategies.
We have been building a prototype BPF verifier using this approach.
Our prototype uses the logic of the Lean theorem prover (5), which has
been thoroughly analyzed (6) and has multiple independent implementations
of proof checkers (7). We are developing two automated strategies for
generating proofs. The first strategy mimics the current BPF verifier.
It implements an abstract interpreter for BPF programs that uses ranges
and tristate numbers to approximate sets of values of BPF registers. The
second strategy uses symbolic execution to encode the semantics of a
BPF program as boolean constraints, which are discharged using a SAT
solver. Both strategies produce proofs that are validated by the proof
checker, avoiding the possibility of introducing bugs like those that
have been found in the current verifier.
Our goal is to present an alternative approach to building the BPF
verifier, and explore the advantages and limitations of this approach.
We would like to start a discussion on ways to combine both approaches
in a pragmatic way.
We present Pixie’s protocol tracer, which uses eBPF to provide instant observability into application messaging without requiring code instrumentation. Pixie’s protocol tracer uses eBPF kprobes on networking-related system calls to capture communication data, which it then parses into protocol messages. The messages are inserted into structured data tables that are easily queried by application developers to help them gain insight into their application behavior.
We contrast our syscall tracing approach against other approaches (e.g. libpcap and uprobes), and discuss pros and cons. We share what worked well with our approach, and also the challenges we faced, including eBPF-related challenges of tracing syscalls that have a multitude of usage patterns.
Finally, we discuss the limitations of kprobe based tracing, in particular with respect to stateful protocols like HTTP/2 and encrypted connections like those that use TLS. We describe our complementary approach that uses eBPF uprobes on user-space libraries to capture the data in these scenarios.
We hope the technical details presented here will be of value to the eBPF community, and we are eager to hear from the eBPF community about potential improvements and suggestions for future directions.
The 1999 revision of ISO C removed implicit function declarations from the language. Instead, all functions must be declared (with or without a prototype) before they can be called. In previous language versions, a function f
was implicitly declared as extern int f ();
if the identifier f
was used in a call expression (such as f (1, 2, 3.0)
).
When GCC switched the default to C99 mode, it was impossible to disable implicit function declarations by default because too many autoconf
checks (and similar compile-time inspection) failed, which often resulted in successful compilation (and testing) of programs without the intended feature set.
Over the years, not much progress has been made on this issue. For example, a GCC configure
test was only fixed in 2019. Recently, Apple Xcode enabled -Werror=implicit-function-declaration
by default, but apparently without fixing resulting problems across free software upstreams.
This session intends to explore whether it is time to make a concerted effort at this problem, and how to approach it.
Intended format: short prepared presentation (10 minutes), followed by discussion.
A demonstration of debugging OpenMP/OpenACC kernels using GDB, and a quick overview of the how it was achieved and what still needs to be done.
We discuss implementation of new inter-procedural mod/ref pass. The pass is collecting information about memory locations modified or read by a given function as well as information useful for points-to analysis (such as information about whether given parameter can escape to global memory or to return value of the function).
First version of mod/ref pass was contributed to GCC 11 and is enabled by default. We also discuss improvements done for GCC 12 and some basic benchmarks.
This is a joint work with David Čepelík.
CORE-V is a family of RISC-V processor cores developed to commercially robust standards by the Open Hardware Group, a consortium of industrial and academic organizations.
In the first part of this talk we give an update on the work on the GNU tool chain for the CV32E40P, the first of the CORE-V family with custom extensions for branching, autoincrement load/store, hardware loops, multiply accumulate and general CPU use. This is a joint effort by Embecosm and the University of Bologna, and has relied on the GVSoC simulator developed as part of the PULP project.
The second part of the talk looks at the use of GVSoC as a GCC tool chain test target. GVSoC is a RISCV virtual platform, which is a fully open-sourced tool designed to drive future architectural research in the area of highly parallel and heterogeneous RISC-V based IoT platforms. Consisting of a highly configurable event-driven full-platform simulator, GVSoC is capable of performing extremely accurate timing simulations. By reaching 25 MIPS and 100% functional accuracy, the virtual platform supports simulating a broad range of hardware IP blocks, including standalone RISC-V cores, multi-core accelerator Clusters, memories, DMAs, and many other components. While efficient C++ models describe hardware IP blocks and flexible Python scripts instantiate components, a powerful built-in Instruction Set Simulator (ISS) enables simulating complete Parallel Ultra-Low-Power (PULP) systems.
To support the GNU tools test suite targetting CV32E40P core execution, we expanded GVSoC ISS integrating the CORE-V Instruction Set Architecture (ISA) extensions. Along with it, we extended the DejaGnu testing framework, adding a custom baseboard that describes linker and compiler options. Lastly, we relied on a pre-compiled platform-dependent runtime linked by the DejaGnu tool at testing time to enable a faster execution.
A central part of this work is that the tool chain should be upstreamed as a vendor variant, thus riscv32-corev-elf-gcc rather than riscv32-unknown-elf-gcc.We shall conclued this talk by looking at the work remaining before this can be submitted.
A central part of this work is that the tool chain should be upstreamed as a vendor variant, thus riscv32-corev-elf-gcc rather than riscv32-unknown-elf-gcc. In this talk we will look at the work remaining before this can be submitted.
This is more of a placeholder than anything else: There's an email thread going around that was a bit inconclusive as to whether on not we should have one of these so I figured it'd be easier to just make one.
There are a number of optimizations done in the middle end that would benefit from understanding the amount of register pressure. Unrolling, inlining, and parallel reassociation are some that come to mind immediately. I think it would be good to have a discussion about how these optimizations might get pressure information to know how aggressive they should be.
The Kernel Dependability and Assurance Microconference focuses on infrastructure to be able to assure software quality and that the Linux kernel is dependable in applications that require predictability and trust.
Introduction to the track and welcome speakers and audience.
Redundancy and diversity are a well recognized way to detect and control SW systematic failure. Runtime Verification Monitors provide a diverse redundancy mechanisms for critical components in the Kernel
This session will give an overview of Kernel CI and CKI projects, how to obtain code coverage figures, what are the current gaps and possible improvements in view of coverage and traceability requirements to be met in functional safety systems
I'm the author of GCC's static analysis pass, -fanalyzer. I've been experimenting with extending it to add kernel-specific diagnostics: detecting infoleaks and unsanitized syscalls at compile-time. I'd like to discuss these and other ideas for improving the test coverage of our kernel builds.
Security and safety engineering, as well as quality management, share a common goal: Avoiding or eliminating bugs and complete bug classes in software. Hence, these fields of engineering may share methods, tools, well-known best practices, and development efforts during the software development. However, these fields of engineering also have
different (partly competing) goals and priorities. Understanding these different goals and priorities of different stakeholders can be summarized as “A bug is NOT a bug is NOT a bug”.
Let us go through what is there in the kernel community and discuss alignment of on-going and future work, in a structured moderated way.
In this discussion, I would like to touch on:
- Various attempts of defining “a bug” and its implications
- Classifying “bugs” into bug classes
- Assessing suitable bug tracking methods, tools and best practices for different bug classes.
- Assessing impact for different bug classes and decisions in follow-up work to the bug fixing that may be taken depending on the bugs’ impact and the stakeholders’ priorities.
Freedom From Interference (FFI) is a key claim that must be satisfied in functional safety systems supporting applications with mixed criticality: this session introduces cgroups and namespaces to have an open discussion on how they can contribute to FFI.
This session gives you a overview of Kselftest and KUnit frameworks, how to use them for unit, regression testing.
Kernel Dependability & Assurance Wrapup
The Red Hat kernel team recently converted their RHEL workflow from PatchWork to GitLab. This talk will discuss what the new workflow looks like with integrated CI and reduced emails. New tooling had to be created to assist the developer and reviewer. Webhooks were utilized to automate as much of the process as possible making it easy for a maintainer to track progress of each submitted change. Finally using CKI, every submitted change has to pass CI checks before it can be merged.
We faced many challenges, especially around reviewing changes. Resolving those led to a reduction of email usage and an increase in cli tools. Demos of those tools will be included.
Attendees will leave with an understanding of how to convert or supplement their workflow with GitLab.
DAMON[1] is a framework for general data access monitoring of kernel
subsystems. It provides best-effort high quality monitoring results while
incurring only minimal and upper-bounded overhead, due to its practical
overhead-accuracy tradeoff mechanism. On a production machine utilizing 70 GB
memory, it can repeatedly scan accesses to the whole memory for every 5ms,
while consuming only 1% single CPU time.
On top of it, a data access pattern-oriented memory management engine called
DAMON-based Operation Schemes (DAMOS) is implemented. It allows clients to
implement their access pattern oriented memory management logic with very
simple scheme descriptions. We implemented fine-grained access-aware THP and
proactive reclamation using this engine in three lines of scheme and achieved
remarkable improvements[2].
As of this writing (2021-05-28), the code is not in the mainline but available
at its development tree[3], and regularly posted to LKML as patchsets[4,5,6].
Nevertheless, the code has already merged in the public Amazon Linux kernel
trees[7,8], and all Amazon Linux users can use DAMON/DAMOS off the box. We are
also supporting the two latest upstream LTS stable kernels[9,10].
In this talk, I will briefly introduce DAMON/DAMOS and present how you can
write a fine-grained data access pattern oriented lightweight kernel module on
top of DAMON/DAMOS. With the talk, I will write an example module and evaluate
its performance on live. A data access-aware proactive reclamation kernel
module for production use will also introduced as a use case. After that, I
will discuss my future plans for improving DAMON and improving other kernel
subsystems using DAMON/DAMOS.
[1] https://damonitor.github.io (https://damonitor.github.io/)
[2] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
[3] https://https://github.com/sjp38/linux/tree/damon/master (https://https//github.com/sjp38/linux/tree/damon/master)
[4] https://lore.kernel.org/linux-mm/20210520075629.4332-1-sj38.park@gmail.com/
[5] https://lore.kernel.org/linux-mm/20201216084404.23183-1-sjpark@amazon.com/
[6] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@amazon.com/
[7] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[8] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
[9] https://github.com/sjp38/linux/tree/damon/for-v5.4.y
[10] https://github.com/sjp38/linux/tree/damon/for-v5.10.y
User Interrupts is a hardware technology that enables delivering interrupts directly to user space.
Today, virtually all communication across privilege boundaries happens by going through the kernel. This includes signals, pipes, remote procedure calls and hardware interrupt based notifications.
User interrupts provide the foundation for more efficient (low latency and low CPU utilization) versions of these common operations by avoiding transitions through the kernel. User interrupts can be sent by another user space task, kernel or an external source (like a device).
The intention is to describes the general infrastructure being developed to receive user interrupts and deep-dive into a single source: interrupts from another user task.
The goal of this session is to:
- Get feedback on the overall software architecture.
- Discuss the main opens.
New storage features, especially in NVMe, are emerging fast. It
takes time and a good deal of consensus-building for a device-feature
to move up the ladders of kernel I/O stack and show-up to user-space.
This presents challenges for early technology adopters.
The passthrough interface allows such features to be usable (at least
in native way) without having to build block-generic commands,
in-kernel users, emulations and file-generic user-interfaces. That said,
even though passthrough interface cuts through layers of
abstraction and reaches to NVMe fast, it has remained tied to
synchronous ioctl interface, making it virtually useless for fast I/O path.
In this talk I will present the elements towards building a scalable
passthrough that can be readily used to play with new NVMe features.
More specifically, recent upstream efforts involving:
Performance evaluation comparing this new interface with existing ones
will be provided.
I would like to gather the feedback on the design-decisions, and discuss
how best to go about infusing more perf-centric advancements (e.g.
async polling, register-buffer etc.) into this path.
[1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@gmail.com/
[2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@kernel.dk/
[3] https://lore.kernel.org/linux-nvme/20210325170540.59619-1-joshi.k@samsung.com/
You're a company that is working on a suite of products that spans every conceivable gadget built for the smarthome - from a simple thermostat to security alarms, from set top boxes to internet gateways, mobile phones and tablets and even servers running in the cloud.
Linux is a fairly obvious choice to build these products on top of - it scales well for devices with more than 128MB of RAM and storage. On devices at the resource-constrained end of the spectrum, Zephyr is quickly maturing into a competitive option, able to even run on devices with as little as a few hundred of KB of RAM and storage. Both ecosystems have diverse and active communities and an open governance model so it is a no-brainer to use them as the basis for the entire suite of products by the company.
Considering Linux and Zephyr as two parts of a single product platform allows for a coherent view of both ecosystems by developers. You want to make sure that you can apply the same set of software configurations and policies across both ecosystems e.g. library versions, compatible protocol suites, security configurations, OTA mechanisms and even a single set of IP compliance tools.
As an example, when you decide you want to secure all your network communications out-of-the-box in your product platform, you need to:
Now repeat this exercise across every key component of the OS - security policy, networking features, OTA, toolchain hardening, IP compliance tools and you end up with a meta-project that spans and contributes to both ecosystems.
We've started to build such a open product platform with opinionated defaults that follow community best practices at https://ostc-eu.org. And this is our story about the challenges we've seen in getting to a coherent configuration that'll work across the entire suite of products across Linux and Zephyr and how we want to improve this interoperability in the future.
NVDIMM (Non Volatile DIMM) is the most interesting device, because it has not only characteristic of memory but also storage.
To support NVDIMM, Linux kernel provides three access methods for users.
- Storage (Sector) mode
- Filesystem DAX(=Direct Access) mode
- Device DAX mode.
In the above three methods, Filesystem DAX is the most expected access method,
because applications can write data to the NVDIMM area directory,
and it is easier to use than Device DAX mode.
So, some software already uses it with official support.
However, Filesystem-DAX is still "experimental status" in the upstream community due to some difficult issues .
In this session, Yasunori Goto will talk to the forefront of the development of NVDIMM, and Ruan Shiyang will talk about his challenge with the latest status from Open Source Summit Japan 2020.
Let's face it, using synchronization primitives such as RCU can be frustrating. And it is only natural to wish to get back, somehow, at the source of such frustration. In short, it is quite understandable to want to torture RCU. (And other synchronization primitives as well, but you have to start somewhere!) Another benefit of torturing RCU is that doing so sometimes uncovers bugs in other parts of the kernel. You see, RCU is not always willing to suffer alone.
This talk will give an overview of how to torture RCU using the rcutorture test suite. It will also present a few of rcutorture's tricks that permit short tests on a smallish number of modest systems to nevertheless provide some assurance that RCU will run robustly on billions of systems across the inner solar system.
Protection Key Supervisor provides fast, thread-specific manipulation of permission restrictions on kernel pages.
Multiple patch sets have been reviewed recently targeting an initial use case to provide stray write protection to persistent memory.
Persistent memory is mapped into the direct map and unlike regular DRAM it is particularly venerable to programming errors which would result in the corruption of data.
Additional use cases have been explored and will be included in the presentation. Specifically for the hardening of page tables and other sensitive kernel data.
The Performance and Scalability microconference focuses on enhancing performance and scalability in both the Linux kernel and userspace projects. In fact, one of the purposes of this microconference is for developers from different projects to meet and collaborate – not only kernel developers but also researchers doing more experimental work. After all, for the user to see good performance and scalability, all relevant projects must perform and scale well.
Because performance and scalability are very generic topics, this track is aimed at issues that may not be addressed in other, more specific sessions. The structure will be similar to what was followed in previous years, including topics such as synchronization primitives, bottlenecks in memory management, testing/validation, lockless algorithms and RCU, among others.
Traditionally, all RAM is DRAM. Some DRAM might be closer/faster than
others, but a byte of media has about the same cost whether it is close
or far. But, with new memory tiers such as High-Bandwidth Memory or
Persistent Memory, there is a choice between fast/expensive and
slow/cheap.
We use the existing reclaim mechanisms for moving cold data out of
fast/expensive tiers. It works well for that. However, reclaim does
not work well for moving hot data which might be stuck in a slow tier
since the pages near the top of the LRU are the most recently accessed
only if there’s regular memory pressure on the slow/cheap tiers.
Fortunately, NUMA Balancing can find recently-accessed pages
regardless of memory pressure. We have repurposed it from being used
for location-based optimization to being used for tier-based
optimization. We have also optimized it for better hot data
identification, such as to find frequently-accessed pages instead of
recently-accessed pages, etc.
We will show our findings so far, and discuss the remaining problems,
potential solutions, and alternatives.
The patchset email threads are as follows,
https://lore.kernel.org/linux-mm/20210625073204.1005986-1-ying.huang@intel.com/
https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
Large installations require considerable monitoring and control, and the occasional scan of procfs files is often the best tool for the monitoring job at hand. In cases where memory consumption is a concern, /proc/PID/{maps,numa_maps,smaps,smaps_rollup} can be quite helpful.
To your monitoring, anyway.
Unfortunately, some mm-related procfs files need to acquire the dreaded mmap_sem. This can be a problem if the Very Important Process being monitored needs to modify its address space. Especially if your monitoring software has been fenced into a highly CPU-constrained cgroups-based container, in order to avoid interfering with Very Important Processes. Except that all of these procfs files acquire sleeplocks that might also be acquired by your Very Important Process. Plus your monitoring software might be preempted while holding one of these sleeplocks, that after all being the whole point of the aforementioned container. This can (and does) result in severe performance degradation.
Infrequently and intermittently.
We therefore have an abusive stress test that forces this condition to occur on small systems in less than one minute's time [1].
This proposal, if accepted, will demonstrate this test program and a few schemes intended to make procfs-based monitoring safe for Very Important Processes [2].
[1] https://github.com/paulmckrcu/proc-mmap_sem-test
[2] https://git.infradead.org/users/willy/linux-maple.git/shortlog/refs/heads/proc-vma-rcu
The maple tree is an RCU-safe range-based B-Tree that was designed to fit a
number of Linux kernel use cases. Most recently the maple tree has been sent
upstream as a patch set that replaces the vma rbtree, the vma linked list, and
the vmacache while maintaining the current performance level. This performance
should improve as the RCU aspect of the tree is leveraged to remove mmap_sem
contention.
This talk will cover the performance aspects of the tree, some future ideas,
and other areas beyond the VMA that would benefit from the tree.
It is currently possible to do fast hypervisor update by preserving virtual machine state in memory during reboot. This approach relies on using emulated PMEM, DAX, and local live migration technologies.
As of today, there are a number of limitations with this approach:
The interface to preserve VM memory is not very flexible. The size and location of PMEM must be determined prior to hypervisor boot and cannot be changed later. Setting PMEM size and location requires intimate knowledge of the memory layout of the physical machine and thus, the settings are not portable.
Upstream kernel cannot preserve state of the devices. While there was work done by Intel in this direction, the work has not been upstreamed or discussed on public mailing lists. It also has some major limitations: 1) Intel IOMMU specific 2) reboot through firmware is not supported, only can work with kexec reboot 3) device state is preserved in a different memory from VM.
There is no way to preserve states of virtual functions.
In this presentation, we will show a demo of fast hypervisor update. We will have a discussion about how the three stated problems can be resolved.
The goal is to be able to preserve virtual machine state and any devices that are attached to it through kexec reboot and if firmware supports through the firmware. Also, the approach should be expandable to work on any platforms with KVM and IOMMU support.
Preserved-over-kexec memory storage or PKRAM provides an API for saving memory pages of the currently executing kernel so that they may be restored after kexec into a new kernel. PKRAM provides a flexible way for doing this without requiring that the amount of memory used be a fixed size created a priori.
One use case for PKRAM is preserving guest memory and/or auxillary supporting
data (e.g. iommu data) across a kexec reboot of the host, and there is interest in extending it to work with emulated or real persistent memory.
Let's discuss the current state of PKRAM, its limitations, and future direction.
Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This talk will discuss CNA
(compact NUMA-aware lock) as the slow path alternative for the current
implementation of qspinlocks in the kernel.
CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Experimental results with micro and macrobenchmarks
confirm that the throughput of a system with contended qspinlocks can increase
up to ~3x with CNA, depending on the actual workload.
The VFIO/IOMMU/PCI micro-conference focuses on coordination between the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems, and on the kernel code that enables these new system features that often require coordination between the VFIO, IOMMU and PCI sub-systems.
Version 8.2 of the Armv8 architecture introduced some mysterious bits to the PTE entries used by the CPU and SMMU which result in IMPLEMENTATION DEFINED behaviours. These bits are known as Page-Based Hardware Attributes (PBHA) and their opaque nature has resulted in them being disabled upstream.
This session will include a quick reminder of the arm64 MMU, before introducing the concept of PBHA and outlining some possible use-cases in hardware along with the challenges in supporting them in Linux. The hope is both to attract additional use-cases from the audience, but also to discuss the scope of support that may be possible upstream.
5 minute break
DOE (PCI ECN) provides a standard mailbox definition, so far used for query / response type protocols. There can be multiple instances of a DOE on each PCI function, and each instance can support multiple protocols. Currently we have published definitions of the Discovery, CMA, IDE (available from the PCI SIG) and CDAT protocols (available from UEFI forum). Some of these protocols are intended for Linux kernel access (e.g. CDAT), others are less clear but there are possible use cases (CMA, IDE).
Patches to support DOE mailboxes in PCI extended config space have raised questions about how to ensure that these mailboxes, which may be of interest to various software entities (userspace / kernel / firwmare / TEE etc) can be safely used.
The DOE design does not easily allow for concurrent use by different software entities (even if possible, we cannot rely on other software elements doing this safely), so it seems some level of mediation is required. The topics for discussion include:
Do we want to enable any direct userspace access to these mailboxes or should we address on a per protocol basis (if at all)?
Do we need to 'prevent' userspace being able to access these registers whilst the DOE is in use?
How do we know the kernel should not touch a given mailbox (in use by other system software)? Perhaps a code first submission to ACPI to define a mediation mechanism? Is this sufficient for expected use cases? (What other suggestions do people have?)
A very brief overview of DOE and proposed kernel support will be presented to make sure everyone is aware of the background - then straight into the discussion of the above questions.
The PCI ECN defining CMA adds the ability (using a DOE mailbox) to establish the identity and verify the component configuration and firmware / executables.
This is done using the protocols defined in the DMTF SPDM 1.1 specification: https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.1.1.pdf which is also used for the same purpose on other buses such as USB, but we are not aware of any work to support those buses yet. The design is extensible to other buses with an abstracted transport layer (via a single function pointer).
The CMA use of the SPDM 1.1 protocol defines a certificate based public private key authentication mechanism including signed measurements of PCIe component state (firmware and other implementation defined elements) and setup of secure channels for continuing runtime measurement gathering and for other related PCI features such as Integrity and Data Encryption IDE.
An initial implementation will be posted shortly for review, and there are a number of open questions that may benefit from a discussion in this forum:
Is there a sufficiently strong case to support CMA natively in the kernel at all?
Some approaches might push this facility into a trusted execution environment. VFs can implement CMA however, to provide this level of authentication and measurement, when in use by a VM. It would be useful to understand other use cases as they motivate the software design and testing.
Approach to providing authentication of device certificates? SPDM uses x509 certificates and so relies on a chain of trust. What trust model should we apply? Current code assumes a separate keychain dedicated to CMA and root key insertion from userspace (probably initrd).
Method of managing / verifying measurements. The nature of the measurements is implementation defined. In some cases they are not expected to change unless the firmware is flashed, but in others they may change with device configuration. Whilst closely related to the challenges of IMA for files, is it appropriate to reuse that subsystem and tooling?
As it's related, is there interest in supporting kernel managed IDE (link encryption)?
When do we actually want to make these measurements? (On boot, on driver probe, on reset, on first use of a particular feature, on demand from userspace etc?) Currently they are done on driver probe only.
Other, more detailed questions can be addressed as part of normal discussion on list.
References:
https://lore.kernel.org/linux-pci/CAPcyv4i2ukD4ZQ_KfTaKXLyMakpSk=Y3_QJGV2P_PLHHVkPwFw@mail.gmail.com/
https://lore.kernel.org/linux-pci/20210520092205.000044ee@Huawei.com/
10 minutes break
Sharing virtual addresses between DMA and the user process is undoubtedly beneficial. It improves security by limiting DMA to the process virtual address space; The programming model is simplified by eliminating the need for explicit map/unmap operations with behind the scene IO page fault handling. Potential performance gains come after that.
However, applying the same logic to kernel-SVA is not without controversy. The DMA API is the de facto way of doing kernel DMA. It already provides portability and security by means of IOVA. DMA API is IOMMU agnostic and does not support IOMMU specific key concept of Process Address Space ID (PASID) which SVA relies on. IOVA is supported by IOMMU with separate page tables than the CPU counterpart.
In order to support SVA, IOMMU has to walk CPU page tables which undermines security if we allow sharing the entire kernel virtual address (KVA) space. IOTLB flush is also a gap since mmu_notifier is not available for kernel memory.
This proposed session explores the multiple candidates that can make DMA API compatible, KVA usage safe, and address the gap of IOTLB synchronization.
5 minutes break
Current MSIx allows one chance to allocate all required interrupt resources. The rework in progress introduces a new API to allow adding new interrupt resources on demand. We will run through some of the options. VFIO usage isn't correct in its usage today. Quick review on proposed VFIO changes for MSI and MSIx to make sure there is proper feedback to VM's
IMS has some unresolved opens.
- DSA format vs. device-specific format.
- Support for IMS layout in system memory
No slides planned since we need Thomas for this discussion. :-)
10 minutes break
When a device is passed through to user space, DMAs from this device are untrusted by the kernel. I/O page tables must be enabled in the IOMMU so each assigned device can only access the I/O virtual address space that is created by respective device passthrough frameworks (VFIO, vDPA, etc.).
Until now I/O page tables are considered as a device attribute, thus managed through VFIO/vDPA specific uAPIs. However this model doesn't scale toward advanced I/O virtualization usages, e.g. subdevice passthrough which requires more than one I/O page table per device, SVA virtualization which needs to support user-provisioned I/O page table (nested on a kernel page table), and I/O page faults which are necessary for improved memory utilization, etc. Better avoid reinventing the new wheel in every framework.
Having an unified uAPI is the answer here. The proposal is generalizing things about I/O page table management via a new interface (/dev/iommu), while allowing passthrough frameworks to connect their devices with selected I/O page tables via a simple protocol. This approach allows VFIO/vDPA to focus on aspects about device management, leaving DMA isolation enforced through the generic interface. This talk is aimed to get consensus on the overall design choices and execution plan cross multiple subsystems.
As we have reached a consensus on the /dev/iommu proposal (https://lore.kernel.org/linux-iommu/MWHPR11MB1886422D4839B372C6AB245F8C239@MWHPR11MB1886.namprd11.prod.outlook.com/), it's time to have some discussions on the in-kernel APIs between the IOMMU core and the /dev/iommu implementation. This discussion can provide some guidance for the developers who are going to implement /dev/iommu.
5 minutes break
Certain PCIe features aren't handled well in Linux, for instance, hotplug doesn't seem to care about MRL status. There are other implications on features as the following:
Both need to be enabled to the entire path from the root port to the device. If a new device is hotplugged, how are MPS, 10b tag enabled throughout the path?
Linux lacks support for Flattening Portal Bridge (FPB) to improve ability to manage resources in a more structured way.
The Android microconference focuses on cooperation between the Android and Linux communities.
Come join and further discuss what was talked about at the earlier Android Microconference session. This provides space to allow folks who couldn't attend due to conflicts as well as for longer discussions that wouldn't fit into the earlier microconference session. We'll have a few topics scheduled, but also leave open some space for folks to propose their own items.
Continued discussion from the Android Microconference.
Continued discussion from the Android Microconference.
Continued discussion from the microconf
Continued discussion from microconference session
Continued discussion from the Android Microconference.
Birds of a Feather
This is to discuss the idea of limiting the VMAs to growing, reference counting and how locking could be handled for RCU safe VMA lookups.
This BoF is to discuss pros and cons of various approaches to avoid
performance issues resulting from excess modifications of the direct
map, what APIs should these approaches provide and what is the best way
to integrate them with the existing allocators.
Let's continue the platform specification discussion in the BoF.
Some of the things that needs further discussion:
• PCT
• Mandating Compatibility and branding of the RISC-V platforms
• Do we mark various combinations deprecated or not ?
Further discussion on the proposed user interrupts feature
The track will be composed of talks, 40 minutes in length (including Q&A discussion). Topics will be advanced Linux networking and/or BPF related.
This year's Networking and BPF track technical committee is comprised of: David S. Miller, Jakub Kicinski, Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, and Andrii Nakryiko.
As eBPF is getting more popular and mainstream, one of the challenges of making it accessible to more users is how to distribute eBPF powered applications. Unlike simpler applications which involves shipping a binary or a container image, with eBPF we usually need to compile the program for the target kernel. This is a hurdle in adoption by both users and vendors. The CO-RE (Compile Once - Run Everywhere) initiative improved this by introducing a way to ship a compiled artifact, which will work on any supporting distribution. But what is a supporting distribution and what about unsupported distributions? How can we make eBPF CO-RE widely usable in the real world of enterprise users? In this talk we will answer these questions by introducing CO-RE and BTF mechanics, and how to leverage them in a concrete scenario in our project Tracee.
This talk will present K2, an optimizing compiler that uses program synthesis to automatically produce both safe, compact, more performant BPF bytecode. K2 compresses BPF bytecode by 6-26%, improves throughput by 0–4.75%, and reduces average latency by 1.36–55.03%, across benchmarks from Cilium, Facebook Katran, hXDP, and the Linux kernel. We designed several domain-specific techniques to make synthesis practical by accelerating equivalence-checking of BPF programs by 6 orders of magnitude.
The talk will consist of the following parts:
You may find more information including K2’s source code, the full technical paper on K2, and responses to some FAQs at https://k2.cs.rutgers.edu
We’ll discuss some recent and ongoing work we’ve been doing to audit Google’s Linux systems with eBPF. We’ll look at a case study of the problems we’ve solved for logging process lifecycles, and then look at the challenges we’re facing to make these systems as reliable and maintainable as possible. The topics we’ll cover include:
Although an IPv6 only environment is ideal, the path to migration from an IPv4 environment is gradual and will present situations where an IPv6 client will need ongoing connectivity to an IPv4-only server. Such a communication path will need to use one of the existing IPv6 to IPv4 transition mechanisms (such as NAT or a dual IPv4 + IPv6 stack).
We will demonstrate a novel approach to this migration, that uses a unique transition mechanism utilizing the new SECCOMP_IOCTL_NOTIF_ADDFD
flag introduced to the seccomp()
system call, to intercept egress connect calls to opportunistically use a transition IPv4 address when possible, saving applications the pain of dealing with the end host not being reachable, while still living in an IPv6-only environment. Once applied at the beginning of connection establishment, the data path proceeds uninterrupted between the client and the server distinguishing this approach from many other transition/translation mechanisms.
We will also share a performance analysis of this approach, limitations of what we can do with seccomp()
, and future work using this mechanism.
In Linux, the IPv4 code generally uses IPTOS_TOS_MASK (0x1e) when
handling the TOS (Type of Service) of IPv4. This mask follows the
definition of RFC 1349:
0 1 2 3 4 5 6 7
+-----+-----+-----+-----+-----+-----+-----+-----+
| | | |
| PRECEDENCE | TOS | MBZ |
| | | |
+-----+-----+-----+-----+-----+-----+-----+-----+
However RFC 1349 is only one of several contradicting RFCs that
try to define how to interpret the IPv4 TOS. In the end, the IETF
settled on the DSCP+ECN interpretation (RFC 2474 and RFC 3168):
0 1 2 3 4 5 6 7
+-----+-----+-----+-----+-----+-----+-----+-----+
| | |
| DSCP | ECN |
| | |
+-----+-----+-----+-----+-----+-----+-----+-----+
That was 20 years ago, so the layout is finally stable. But as the
diagrams show, RFC 1349 is incompatible with ECN as it already uses
bit 6 in its TOS field.
Therefore, the IPv4 code also uses another mask, IPTOS_RT_MASK (0x1c),
to clear bit 6. This mask is used almost every time the kernel does an
IPv4 route lookup.
Finally, RFC 2474 and RFC 3168 (DSCP+ECN) also cover IPv6. However, the
IPv6 code generally doesn't mask the ECN bits and considers them as
part of the TOS for policy routing.
This situation creates several problems:
Regressions brought by patches "fixing" places where IPTOS_TOS_MASK
wasn't applied (thus breaking users that used bits 0-2).
IPTOS_TOS_MASK is spreading to IPv6 (through RT_TOS()), where it
doesn't make sense at all (IPv6 has never used the RFC 1349
layout).
In some edge cases, IPv4 route lookups are done without masking the
ECN bits (thus giving different results depending on the ECN mark).
New cases are introduced every now and then.
IPv4 and IPv6 inconsistency.
Impossibility to use the full DSCP range in IPv4.
Policy-routing can break ECN with IPv6 and in some IPv4 edge cases.
Parts of the stack define their own mask to respect the DSCP+ECN
layout, but without making it reusable.
The objective of this talk is to bring practical examples of
user-visible inconsistencies and to discuss different ways forward for
minimising them and avoiding more ECN regressions in the future.
It will be oriented towards the following goals (by decreasing order of
perceived feasibility):
Remove all uses of IPTOS_TOS_MASK for IPv6.
Prevent IPv4 policy routing from breaking ECN.
Remove IPTOS_TOS_MASK entirely from the kernel, so people don't
mistakenly copy/paste such code (but keep the definition in
include/uapi of course).
Allow full DSCP range in IPv4.
Prevent IPv6 policy routing from breaking ECN.
Prevent breaking ECN again in the future (for example by defining a
new type for storing TOS values, so that Sparse could warn about
invalid use cases).
Make TOS and ECN handling consistent between IPv4 and IPv6
(somewhat implied by the previous bullet points).
The main road blocks are code churn and drawing the line between bugs
and established behaviours.
Creating diverse communities requires effort and commitment to creating inclusive and welcoming spaces. Recognizing that communities which adopt inclusive language and actions attract and retain more individuals from diverse backgrounds, the Linux kernel community adopted inclusive language in Linux 5.8 release. Understanding if this sort of change has been effective is a topic of active research. This MC will take a pulse of the Linux kernel community as it turns 30 this year and discuss some next steps. Experts from the DEI research community will share their perspectives, together with the perspectives from the Linux community members.
Equity and inclusion in Tech is not just about diversity in hiring, but has profound implications for downstream accessibility, user experience, and the next generation of products. Open source has unique challenges and opportunities to advance DEI and drive more inclusive innovation. To better understand these dynamics and the key resources and solutions needed, The Linux Foundation is conducting research across the entire open source ecosystem. This talk will share some of the key themes and preliminary results from this ongoing effort, followed by an interactive discussion with the Kernal community.
In this interactive session we will discuss why women join OSS, why they stay in OSS and what are their experiences of contributing to OSS. Based on empirical evidence from surveys and interviews, we will brainstorm strategies for welcoming more women to OSS and for improving retention of women in OSS.
Participation of women in Open Source Software (OSS) is very unbalanced, despite various efforts to improve diversity. This is concerning not only because women do not get the chance of career and skill developments afforded by OSS, but also because OSS projects suffer from a lack of diversity of thoughts because of a lack of diversity in their projects. Researchers have been trying to understand the low representation rate of women in OSS, as well as to learn more about their motivations, challenges, biases and the strategies that can be adopted to attract and retain this underrepresented population. Linux Kernel is also investigating those factors to create strategies to increase women’s participation.
Mentoring is crucial for knowledge transfer in open source. But traditional dyadic mentoring formats between an expert and novice are hard to scale. In this talk, I will present the different types of mentoring in open source and focus on implicit mentoring---mentoring taking place in everyday development activities like code-reviews. I will show how implicit mentoring can be automatically identified, how widespread it is, who participates, and how to achieve mentoring at scale by building an appreciative project culture.
"Past performance is not an indicator of future" is a common disclaimer in predictions. That being said, understanding the trends and what's working is helpful for doing some goal setting. This session will present some of the diversity data that's been mined from the kernel contributions over the last 30 years, and then open up a discussion as to what are some realistic goals to set for the next year, next 10 and next 20 years, and how.
Diversity and events have a unique relationship. Events can serve to highlight the lack of diversity in the community, but at the same time create a lot of opportunities to increase diversity as well. In this session, we'll review diversity trends across a decade of events, types of diversity we measure and how those have evolved, efforts to increase diversity and what, if any, impact those have had. We'll also discuss how COVID has shifted this relationship as events pivoted to virtual. Did diversity increase or decrease? As we move back to a version of normal and a return to in-person events, how will this impact diversity + events?
The GPU/media/AI buffer management and interop microconference focuses on Linux kernel support for new graphics hardware that is coming out in the near future. Most vendors are also moving to firmware control of job scheduling, additionally complicating the DRM subsystem's model of open user space for all drivers and API. This has been a lively topic with neural-network accelerators in particular, which were accepted into an alternate subsystem to avoid the open-user space requirement, something which was later regretted.
As all of these changes impact both media and neural-network accelerators, this Linux Plumbers Conference microconference allows us to open the discussion past the graphics community and into the wider kernel community. Much of the graphics-specific integration will be discussed at XDC the prior week, but particularly with cgroup integration of memory and job scheduling being a topic, plus the already-complicated integration into the memory-management subsystem, input from core kernel developers would be much appreciated.
Quick 5 minutes introduction:
Rules of engagement
General logistics
Notes taking strategy
Where to chat/interact
Oher items
In order to meet our fixed frame deadlines (e.g. vertical refresh) whilst still having low power usage, we need to keep our power management policies balanced between performance bursts and deeper sleeps. Between dma-fence being used to declare synchronisation dependencies between multiple requests, and additional hints (e.g. input events suggesting that GPU activity will happen 'soon') we can insert clock boosts to try to head off issues before they happen. Full-system tracing with e.g. Perfetto will also be discussed to get a better picture of the system's behaviour as a whole.
Supporting predictable presentation timing for graphics and media usecases requires a great deal of plumbing through the stack, right up to userspace. Whilst some higher-level APIs have been discussed, there are a number of open questions including how to handle VRR, and how to support this with mailbox-type systems like KMS and Wayland. Outline the current state and wants from all the different angles, and discuss how we could come up with lower-level primitives which allow these systems to be built.
HMM (heterogeneous memory management) was first merged in the Linux kernel in 2017 and has since been adopted by several device drivers. As it integrates the device drivers more closely with the core kernel's virtual memory management, more kernel subsystems are starting to get involved in related code reviews and take notice, e.g. file systems and page cache. As a consequence, we need to consider and document the interactions of ZONE_DEVICE pages and HMM migration semantics with those subsystems. This meeting is to establish the basis for architectural documentation of use to related kernel subsystems such as filesystem and networking.
Both future hardware and also user-visible APIs, are demanding that we discard our previous fence-based synchronisation model and allow arbitrary synchronisation primitives similar to Windows/DirectX 'timeline semaphores'. Outline the problems in trying to integrate this with our previous predictable fence-based model with dma_fence and dma_resv and discuss some potential paths and solutions.
The IoThree's Company microconference is moving into its third year at Plumbers. Talks cover everything from the real-time operating systems in wireless microcontrollers, to products and protocols, to Linux kernel and tooling integration, userspace, and then all the way up to backing cloud services. The common ground we all share is an interest in improving the developer experience within the Linux ecosystem.
Come and knock on our door!
The Internet of Things Microconference is in its third year at Plumbers. Talks cover everything from the real-time operating systems in wireless microcontrollers, to products and protocols, to Linux kernel and tooling integration, userspace, and then all the way up to backing cloud services. The common ground we all share is an interest in improving the developer experience within the Linux ecosystem.
In this introduction, we give a brief overview of the presenters and set the stage for the remainder of the MC.
Zephyr RTOS, the fast-growing, scalable, open source RTOS for resource constrained devices recently gained support for LoRa and LoRaWAN technologies. The addition of LoRa technologies enabled Zephyr to be used in applications where long range coverage is needed. With LoRa/LoRaWAN support in place, Zephyr is emerging as the preferred software stack for the LoRa End nodes while Linux continues to be the de-facto software stack for the LoRa Gateways.
In this discussion, the current status of LoRa and LoRaWAN support in Zephyr will be explored and we will discuss about how to add the persistent storage support for storing parameters such as keys, and devnonce to Non-volatile memory. We will also touch base on the ongoing work towards the addition of LoRaWAN support in the Linux kernel by the community.
mikroBUS is an add-on board socket standard by MikroElektronika that can be freely used by anyone following the guidelines. The mikroBUS standard includes SPI, I2C, UART, PWM, ADC, GPIO and power (3.3V and 5V) connections to interface common embedded peripherals, there are more than 800 add-on boards ranging from wireless connectivity boards to human-machine interface sensors which conform to the mikroBUS standard, out of which more than 140 boards already have device driver support in the Linux kernel.Today, the most straight forward method for loading these device drivers is to provide device-tree overlay fragments at boot time which needs maintaining a huge out-of-tree repository of device tree fragments for each add-on board for each supported socket for each target, moreover device-tree currently does not support instantiating devices on dynamically created greybus peripherals.
mikroBUS driver is introduced in the kernel to solve the problem by enabling mikroBUS as a probeable bus such that the kernel can discover the device(s) on the bus at probe time, this is done by storing the add-on board device driver-specific information on a non-volatile storage accessible over 1-wire on the mikroBUS port. The format for describing the device driver-specific information is an extension to the Greybus manifest. In addition to physical mikroBUS ports on a target, the driver also supports instantiation of devices on remote mikroBUS port(s) on a micro-controller which is visible to the host as a set of greybus peripherals.The choice of greybus manifest for device description makes sure that only one kind of device description is required independent of the way in which the device is connected to the host. The mikroBUS driver does not have any strict associations to the pin mapping of the port and the same framework can be reused for other similar add-on board standards such as FeatherWing, PMOD, Grove or Qwiic. With more than 140 add-on boards having tested support today, the mikroBUS driver helps to reduce the time to develop and debug various add-on boards and support for greybus enables rapid prototyping and deployment of remote systems.
This talk will cover the ideas and implementations for an IoT gateway blueprint
based on Linux and build with Yocto.
Thread technical topics discussed will be OpenThread for connectivity between
the Linux based gateway and Zephyr based nodes, Matter (former CHIP) for
application layer profiles and device types and an OTA service to assist low
resource IoT devices with firmware upgrades. Furthermore, we will discuss additional
network services for native IPv6 as well as NAT64 connectivity.
The process of building an IoT or EC is a very involved and complex process. Many different specialities are involved in creating the hardware as well as the software that powers it. While many of these costs are unavoidable, they have been mitigated in other disciplines of software development: mobile, web, and server. In all three, several frameworks exist to abstract away the underlying hardware and even running constraints. In the presentation:
One of the first questions you need to answer when embarking on an IoT project is do you use a Linux-based platform versus a RTOS like Zephyr. Is one better than the other? In this talk we'll explore the strengths of each approach, what Linux can learn from RTOSes (and vice versa) and even examples where an IoT device would use Linux and an RTOS.
Compute Express Link (CXL) is a cache coherent protocol designed to boost the performance of accelerators and memory expanders. By nature of being an open standard, it also allows vendors to design hardware which will work with a generic CXL driver in the same vein as AHCI, or NVMe (and others). With CXL 2.0 spec release late last year, we're starting to see usage models being actively pushed extend to client, enterprise, and cloud.
Our Linux driver team was tasked with writing the Linux driver without the luxury of having reference hardware to develop on. QEMU is a common solution for this problem. However, a couple of aspects set this journey aside from many others. First, this isn't just a device that requires emulation, but an entire bus, and second, we began development before the spec was even finalized and this allowed us to find spec bugs and gaps.
The talk will cover the plumbing in QEMU used to enable our driver development. It will begin with Compute Express Link 2.0 fundamentals, the challenges that posed, and how those were solved or deferred, in detail, inside QEMU. There will be some time spent on the Linux kernel development done to date, and why QEMU is such an ideal environment for a task like this.
The current work is very bare-bones with respect to the complex topologies and configurations that CXL 2.0 enables. The remainder will be spent on the current gaps in the QEMU emulation and a call for help on how we can fix those and what should be done [if anything] to get this work upstream.
In the recent kernel, Extra Boot Configuration (bootconfig) is available to pass the kernel boot parameters in the structured key-data form instead of single-line command line. The parameters passed via bootconfig are just merged to the kernel command line string(cmdline). Thus the kernel modules/subsystems can continue using kernel cmdline APIs, but can not use the bootconfig APIs for the parameters given by cmdline.
The bootconfig API obviously gives a different programming model for the parameter parsing for kernel modules. The kernel module_params API is passive, main use case is callbacks handles a fixed parameter. On the other hand, the bootconfig API is active, user modules queries the parameters from the bootconfig in their preferred order and the parameter name can be dynamically expanding. If both APIs are available in the kernel modules/subsystems, users can specify more complex configuration not just setting parameter values.
This session will explain what is the bootconfig and the relationship of the cmdline, and discuss what will be the issue to unify cmdline and bootconfig in API level.
There are various types of user-space tracing programs these days. But they are too versatile and each one has to be installed or difficult to use as a beginner.
This talk introduces Guider that provides various and powerful Linux tracing features using ftrace, ptrace, and procfs.
It's repository is https://github.com/iipeace/guider.
In this presentation, we show some work on Measuring Code Review in the Linux Kernel Development Process.
We investigated the following research questions:
We also investigated various characteristics of the patches themselves; such as files, sections
and mailing lists, as the following questions:
As 7.94% of the response traffic is classified as being authored by bots, we also considered where bots are active.
We will present some interesting insights we gained in this research and the diverse set of variables which define the review process. This presentation summarizes the results of a master thesis, finished in spring 2021.
The Toolchains and Kernel microconference focuses on topics of interest related to building the Linux kernel. The goal is to get kernel developers and toolchain developers together to discuss outstanding or upcoming issues, feature requests, and further collaboration.
This is a quick intro to the MC.
The Toolchains and Kernel micro conference focuses on topics of interest related to building the Linux kernel. The goal is to get kernel developers and toolchain developers together to discuss outstanding or upcoming issues, feature requests, and further collaboration.
Suggested Topics:
Achievements since last year’s LPC:
Possible Topics/Attendees:
The Rust for Linux project is adding support for the Rust language to the Linux kernel. If the project is successful, and many drivers start to be written in Rust, then the Rust compiler and associated tools will become a key part of the kernel toolchain.
This raises many questions which we will try to answer and/or discuss with others:
RUSTC_BOOTSTRAP
and why we need it?rustc
requires?bindgen
.gcc-rs
(the new GCC frontend for Rust), rustc_codegen_gcc
(the new rustc
backend for GCC) and mrustc
(the bootstrapping compiler).objtool is heavily used on x86, but isn't currently support upstream by arm64.
In order to avoid depending on objtool to enable any kernel features for arm64 and also to avoid disabling compiler optimisations along the lines of https://git.kernel.org/linus/3193c0836f20 when objtool cannot reconstruct the control flow, how much of its functionality is actually required on arm64 and how much of that could be directly implemented by the toolchain instead?
From:
https://lore.kernel.org/r/YKO/di4h3XGjqu68@hirez.programming.kicks-ass.net
some objtool features on x86 are:
validate stack frames
generate ORC unwind data (optional)
validates unreachable instructions; specifically the lack thereof
(optional)
validates retpoline; or specifically the lack of indirect jump/call
sites (with annotations for those few that are okay). (optional)
validates uaccess rules; specifically no call/ret in between
__user_access_begin() and __user_access_end(). (optional)
validates noinstr annotation; HOWEVER we rely on objtool to NOP
all __sanitizer_cov_* calls in .noinstr/.entry text sections because
__no_sanitize_cov is 'broken' in all known compilers.
generates __mcount_loc section and NOPs the __fentry call sites
(optional)
generates .static_call_sites section for STATIC_CALL_INLINE support
rewrites compiler generates call/jump to the retpoline thunk to an
alternative such that we can patch out the thunk with an indirect
call/jmp when retpolines are disabled. (arch dependent)
rewrites specific jmp.d8 sites (as found through the __jump_table
section) to nop2, because GAS is unable to determine if a jmp becomes
a jmp.d8 or jmp.d32 and emit the right sized nop. (optional)
Both C and C++ started as strictly single-threaded languages, despite significant multi-threaded use more than 30 years ago. Explicit support for multithreaded execution appeared in 2011, but this was by no means the final word. This presentation will give a quick overview of low-level standards-committee concurrency progress since then, including a snapshot of work on hazard pointers, RCU, relaxed accesses, dependency ordering, and the interplay between the C/C++ and Linux-kernel memory models.
The Linux kernel continues to rely on control dependencies as a cheap mechanism
to enforce ordering between a prior load and a later store on some of its
hottest code paths. However, optimisations by both the compiler and the CPU
hardware can potentially defeat this ordering and introduce subtle,
undebuggable failures which may only manifest on some systems.
Improving the robustness of control dependencies is therefore a hotly debated
topic, with proposals ranging from limiting their usage, inserting conditional
branches, introducing compiler support and using memory barriers instead. The
scope of possible solutions has resulted in somewhat of a deadlock, so this
session aims to cover the following in the interest of progressing the debate
and soliciting opinions from others:
LKML mega-thread: https://lore.kernel.org/r/YLn8dzbNwvqrqqp5@hirez.programming.kicks-ass.net
Previous research has demonstrated that the Linux Kernel can benefit greatly from the latest compiler optimization techniques. Binary Optimization and Layout Tool (BOLT) is successfully used to accelerate large applications compiled with PGO and LTO by further improving the code layout to favor underlying hardware page and instruction caching. However, applying BOLT to the kernel faces multiple hurdles as the tool splits and reorders code sequences across function boundaries. The corresponding metadata used for code patching at boot and runtime needs to be updated accordingly. At the same time, BOLT optimizations have to be tailored to meet certain expectations about the properties of the code. Updating exception-handling and stack-unwinding data present another set of challenges. Even allocating memory for the modified code is not as straightforward as is the case with a typical ELF binary. We'll discuss the possible approaches to optimizing the kernel with BOLT and the project's current status.
GCC and Clang both have a variety of security features available, but they are not always at parity with each other. This discussion will review the security features important to the Linux kernel with regard to what's working, what's missing, and what needs adjustment.
Specifically, these areas will be discussed along with anything else that seems relevant:
stack protector guard location (i.e. enabling per-task canaries)
-mstack-protector-guard=sysreg
-mstack-protector-guard-reg=sp_el0
-mstack-protector-guard-offset=0
call-used register zeroing (now in GCC 11)
-fzero-call-used-regs
stack variable auto-initialization (already in Clang, soon to be in GCC 12)
-ftrivial-auto-var-init={zero,pattern}
array bounds checking
-Warray-bounds
-Wzero-length-bounds
-Wzero-length-array
-fsanitize=bounds
-fsanitize=bounds-strict
integer overflow protection
-fsanitize=signed-integer-overflow
-fsanitize=unsigned-integer-overflow
Link Time Optimization
-flto
-flto=thin
backward edge Control Flow Integrity
-mbranch-protection=pac-ret[+leaf]
-fsanitize=shadow-call-stack
CET
forward edge Control Flow Integrity
-fcf-protection=branch
-mbranch-protection=bti
-fsanitize=cfi
Spectre v1 mitigation
-mspeculative-load-hardening
structure layout randomization
__attribute__((randomize_layout))
constant expression for "is an lvalue?"
This event marks the end of Linux Plumbers 2021. We will take a look back at Linux Plumbers 2021, our challenges in organizing it and what our hopes for Linux Plumbers 2022 are.
During Monday's opening keynote we had the chance to look back on the last 30 years of Linux. In the closing keynote we will concentrate on the future of Linux instead. And this requires your input too! We would like to hear what you think will happen with Linux in the future. So please fill out our "Linux Prediction Survey" at
https://docs.google.com/forms/d/e/1FAIpQLSc6mfLXCXQLY6NfT5apbtu2dZVSQHBESVpDxbxqrBP5HnpfTA/viewform?usp=pp_url
We will discuss the results in this session.