LPC 2021 Microconferences
-
Containers and Checkpoint/Restore microconference
-
Confidential Computing Microconference [Summary]
-
Scheduler Microconference
-
Performance and Scalability Microconference [Summary]
-
IoT's Company Microconference [Summary]
-
Tracing Microconference [Summary]
-
Toolchain and Kernel Microconference
-
Real-time Microconference [Summary]
-
Testing and Fuzzing Microconference [Summary]
-
File System Microconference [Summary]
-
VFIO/IOMMU/PCI Microconference [Summary]
-
Open Printing Microconference
-
RISC-V Microconference
-
Kernel Dependability and Assurance Microconference [Summary]
-
System Boot and Security Microconference
-
Android Microconference [Summary]
-
GPU/media/AI buffer management and interop Microconference
-
Diversity, Equity and Inclusion Microconference
Containers and Checkpoint/Restore MC
CFP Ends: Aug 15, 2021
The Containers and Checkpoint/Restore Microconference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.
Contributions to the micro-conference are expected to be problem statements, new use-cases, and feature proposals both in kernel- and userspace.
Suggested Topics:
- How to best use CAP_CHECKPOINT_RESTORE in CRIU to make it possible to run checkpoint/restore as non-root (with CAP_CHECKPOINT_RESTORE)
- Extending the idmapped mount feature to unprivileged containers, i.e. agreeing on a sane and safe delegation mechanism with clean semantics.
- Porting more filesystems to support idmapped mounts.
- Making it possible for unprivileged containers and unprivileged users in general to install fanotify subtree watches.
- Discussing and agreeing on a concept of delegated mounts, i.e. the ability for a privileged process to create a mount context that can be handed of to a lesser privileged process which it can interact with safely.
- Fixing outstanding problems in the seccomp notifier to handle syscall preemption cleanly. A patchset for this is already out but we need a more thorough understanding of the problem and its proposed solution.
- With more container engines and orchestrators supporting checkpoint/restore there has come up the idea to provide an optional interface with which applications can be notified that they are about to be checkpointed. Possible example is a JVM that could do cleanups which do not need to be part of a checkpoint.
- Discussing an extension of the seccomp API to make it possible to ideally attach a seccomp filter to a task, i.e. the inverse of the current model instead of caller-based seccomp sandboxing enabling full supervisor-based sandboxing.
- Integration of the new Landlock LSM into container runtimes.
- Although checkpoint/restore can handle cgroupv1 correctly the cgroupv2 support is very limited and there is a need to figure out what is still missing to have v2 supported just as good as v1.
- Isolated user namespaces (each with full 32bit uid/gid range) and easier way for users to create and manage them.
- Figure out what is missing on the checkpoint/restore level and maybe the container runtime level to support optimal checkpoint/restore integration on the orchestration level. Especially the pod concept of Kubernetes introduces new challenges which have not been part of checkpoint/restore before (containers sharing namespaces for example).
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Containers and Checkpoint/Restore MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads
- Stéphane Graber <stgraber@stgraber.org>, Mike Rapoport <mike.rapoport@gmail.com>, Adrian Reber <areber@redhat.com>, and Christian Brauner <christian.brauner@ubuntu.com>
Confidential Computing MC
CFP Ends: Sept 1, 2021
The Confidential Computing microconference focuses on solutions to the development of using the state of the art encyption technologies for live encryption of data, and how to utilize the technologies from AMD (SEV), Intel (TDX), s390 and ARM Secure Virtualization for secure computation of VMs, containers and more.
Suggested Topics:
- Live Migration of Confidential VMs
- Lazy Memory Validation
- APIC emulation/interrupt management
- Debug Support for Confidential VMs
- Required Memory Management changes for memory validation
- Safe Kernel entry for TDX and SEV exceptions
- Requirements for Confidential Containers
- Trusted Device Drivers Framework and driver fuzzing
- Remote Attestation
For more references, see:
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Confidential Computing MC" for the "Track". More topics will be added based on CfP for this microconference.
MC lead:
- Joerg Roedel <joro@8bytes.org>
Confidential Computing MC Summary
- TDX Live Migration [video][slides]
- Uses separate Migration Trusted Domains (TDs): SrcMigTD and DstMigTD
-
MigTDs are part of the TCB, and they do pre-migration checks and prepare the encryption key for migrating guest states
-
Guest state encryption/decryption is done by the TDX Module when VMM exports/imports it via SEAM calls
-
MigTD to host communication can use a VMM agnostic transport based on vmcall, or a VMM specific transport
-
virtio-vsock
-
hyperv-vsock
-
vmci-vsoc
-
-
Intel provides a rust-based reference implementation for MigTD
-
MigTD is provided by Hypervisor, Guest TD can measure MigTDs provided by the cloud vendor
-
Interface to QEMU is a kvm-migration-device which calls TDX module to re-encrypt pages for migration
-
How to track guest private and shared pages?
-
Bitmap is fast but sparsely populated
-
Region list is slower but likely sufficient
-
-
Live Migration of Confidential Guests [video][slides]
-
Problem: How to migrate AMD SEV guests
-
Solution one: Using only the AMD Platform Security Processor (PSP)
-
PSP establishes migration key and re-encrypts guest pages for migration
-
Approach is slow because PSP has low re-encryption bandwidth
-
-
Solution two: Use a migration helper in a separate VM sharing the same encryption context as the migrated VM
-
Faster because re-encryption happens on the CPU
-
Migration helper needs to be loaded at guest launch and is part of the measurement
-
-
Implementation uses a region list to track shared guest memory regions
-
Open problem: Find a common solution for QEMU which works for SEV (including SEV-ES and SEV-SNP) and TDX
-
-
TDX Linux Support [video][slides]
-
Patches under development
-
SWIOTLB used for IO, Buffers are shared memory
-
Work in progress to split the SWIOTLB spin-lock
-
Lots of time (20%) spent on the spin-locks
-
Hyper-V has a better bounce buffer implementation
-
-
Lazy memory accept is not yet implemented, work ongoing
-
Needs an approach which can be shared with AMD SEV-SNP
-
Memory must not be accepted twice
-
Current approach uses a bitmap with 2M granularity
-
Acceptance information needs to be preserved across kexec
-
-
Trusting device drivers
-
Traditionally Linux trusts the hardware, but in a TDX guest the hardware is emulated and becomes untrusted
-
Drivers need to be hardened against malicious Hypervisor device emulations
-
A driver white-list of allowed drivers in a TDX guest is needed
-
-
-
Debug Support for Confidential Guests [video][slides]
-
Debugging AMD SEV guests
-
TDX debug support builds on-top of SEV debug support
-
AMD PSP is used encrypt and decrypt guest memory
-
Add a QEMU.MemTxAttr.debug flag to indicate memory accesses from a debugger
-
Additional debug ops in 'struct MemoryRegion'
-
Open problem: No access to encrypted guest register state
-
Needed for further SEV-ES and SEV-SNP development
-
Possibly implement decryption of guest register state on automatic exits like SHUTDOWN
-
Currently only possible with out-of-tree patches
-
Upstream solution to this would help a lot
-
-
-
Confidential Computing with Secure Execution (IBM Z) [video][slides]
-
Uses a fully encrypted boot image encrypted with an asymmetric key
-
Key is specific to the host machine and so is the image
-
Decryption happens in Ultra-visor, data not visible to QEMU
-
Ultra-visor is a combination of hardware and firmware and part of the guest TCB
-
-
RootFS is encrypted with LUKS an dm-crypt
-
Kernel and Initrd are encrypted with hardware public key
-
In confidential containers (e.g. Kata), attestation can be substituted
-
Keys for the decryption and verification of container images can be baked into the initrd
-
Initrd encrypted with host key
-
-
-
Confidential Containers [video]
-
Protect containers and host from inter-container attacks
-
Remove Container Service Provider (CSP) from the TCB
-
Put containers into confidential VMs based on kata-containers
-
One VM per pod
-
Problem: Container image provisioning
-
Container images usually come from an untrusted registry
-
Protect them from the host
-
Move container image management inside the guest
-
Container images need to be signed and have verifiable integrity
-
kata agent refuses to run unsigned containers
-
Maybe using dm-integrity and dm-verity
-
-
-
Deploying Confidential VMs via Linux [video][slides]
-
Experience report of deploying SEV in Google Compute Engine (GCE)
-
GCE provides VMs based on SEV (CVMs), SEV-ES not yet supported
-
Problem: Getting fixes into LTS kernels
-
For example: Fixes for SWIOTLB bug were only partially accepted for LTS kernels
-
Need to establish a process to get fixes into distributions
-
SUSE looks at stable patches and at patches with Fixes tags
-
Updating the images with hardened device drivers important
-
GCE VMs do not use virtio, so no virtio hardening was done
-
-
Problem: Testing of encrypted virtualization environments
-
Working on a self-test framework for SEV which can be used by distribution maintainers
-
SEV-ES enablement for kvm-unit-test is being worked on
-
A good start for further testing is to also run all tests for unencrypted VMs also in encrypted VMs
-
How to prioritize SEV and TDX testing wrt. to upstreaming new features?
-
-
-
Attestation and Secret Injection of Confidential VMs, Containers and Pods [video][slides]
-
SEV and SEV-ES support pre-attestation only
-
Measurement and secret injection happen before guest starts executing
-
Anything which is not measured needs to be gated by a secret
-
Approach: Bundle OVMF and Grub and measure them together
-
Grub retrieves key from injected secrets and decrypts root partition
-
-
Also compute hashes of kernel, initrd and command line and put them into guest memory, so that they become part of the initial measurement
-
OVMF compares the hashes with kernel and initrd loaded from disk
-
-
Kernel exports secret later via SecurityFS
-
Used for attestation of software components in the running system
-
Approach can be used for Confidential VMS too, not limited to containers
-
In general discussion was around how to consolidate attestation and measured-boot work-flows
-
It was agreed that more discussion is needed
-
Different approaches for containers are discussed in the Confidential Container Initiative
-
-
Securing Trusted Boot of Confidential VMs [video][slides]
-
Decentriq running SGX in production for several years
-
Problem: Providing Control Flow Integrity (CFI)
-
Minimizing code base by disabling kernel features
-
Harding of the remaining kernel features
-
Code size does not matter as much as number of communication points between guest and hypervisor
-
-
TDX supports attestation of kernel and initrd loaded from unencrypted media
-
Removing grub from the TCB
-
Difficult with standard distributions
-
OVMF too heavy, need a minimal firmware which just boots a Linux kernel
-
libkrun is a possible solution, provides a minimal firmware based on qboot which just jumps to the 64-bit kernel entry point
-
-
Scheduler MC
CFP Ends: Aug 31, 2021
The Scheduler microconference focuses on deciding what process gets to run when and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics at the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.
Suggested Topics:
- Cgroup interface and other updates for core-scheduling
- Cgroup and SCHED_DEADLINE
- Capacity Awareness – For busy systems
- Interrupt Awareness
- Load Balancing:
- Wakeup
- Periodic
- NUMA load balancing
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Scheduler MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Dhaval Giani <dhaval.giani@oracle.com>
- Daniel Bristot de Oliveira <bristot@redhat.com>
- Chris Hyser <chris.hyser@oracle.com>
- Juri Lelli <juri.lelli@redhat.com>
- Vincent Guittot <vincent.guittot@linaro.org>
Performance and Scalability MC
CFP Ends: Aug 31, 2021
The Performance and Scalability microconference focuses on enhancing performance and scalability in both the Linux kernel and userspace projects. In fact, one of the purposes of this microconference is for developers from different projects to meet and collaborate – not only kernel developers but also researchers doing more experimental work. After all, for the user to see good performance and scalability, all relevant projects must perform and scale well.
Because performance and scalability are very generic topics, this track is aimed at issues that may not be addressed in other, more specific sessions. The structure will be similar to what was followed in previous years, including topics such as synchronization primitives, bottlenecks in memory management, testing/validation, lockless algorithms and RCU, among others.
Suggested topics:
- Seamless hypervisor update with IOMMU-type-agnostic, directly-attached devices and virtual functions. Related projects are VMM Fast Restart, PKRAM, and MMU enabled kexec relocation for arm64.
- Performance characteristics of RT spinlocks.
- Accounting CPU-intensive kernel threads in the CPU controller via remote charging (thread 1, thread 2, thread 3).
- Design discussion and performance characteristics of Maple Tree (lwn article).
- mmap_sem contention in procfs (test code, gitweb).
- NUMA-aware spinlocks: (series, lwn article).
- futex2: attempts to tackle the performance limitations of the single NUMA node hash table.
- Batching optimizations in the internals of get_user_pages() and put_user_pages().
- Fast kdump for embedded devices.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Performance and Scalability MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Davidlohr Bueso <dave.bueso@gmail.com>
- Daniel Jordan <daniel.m.jordan@oracle.com>
- Pavel Tatashin <pasha.tatashin@soleen.com>
Performance and Scalability MC Summary
-
Optimize Page Placement in Tiered Memory System [video][slides]
Ying Huang and Dave Hansen presented this work.
- There are alternative types of memory emerging, such as cheap and slow persistent memory (PMEM), expensive and fast high bandwidth memory (HBM), and CXL-connected memory pools over PCI that are cache coherent.
- The proposal is a middle ground between Memory and App Direct modes of persistent memory where the kernel controls memory placement by default but the app/admin can override this if desired.
- Migrate in lieu of discard: When reclaiming a DRAM page off the LRU, move the page to storage and in addition, PMEM.
- In recent kernels, once a page is moved to PMEM, there's no mechanism to move it from PMEM to DRAM. The only way a PMEM page could be promoted back to DRAM in current kernels is if the page came back from swap.
- Pasha Tatashin: It makes sense to proactively move regularly accessed pages to pmem.
- Dave Hansen: That's an orthogonal problem to adding support for moving pages between different tiers of memory.
- Shakeel Butt: Reclaim only looks at evictable pages. Is there a plan to look at unevictable or non-LRU pages like hugetlb pages?
- Dave Hansen: We have no immediate plans to look at that. We've been telling people asking about hugetlb, "switch to THP if you care," and that answer has worked so far. But we could be talked into it.
- Promote PMEM pages to DRAM by expanding NUMA balancing. All PMEM accesses will be considered remote accesses, so they'll always try to be promoted to DRAM. Problems with this:
- What if DRAM is full? Current approach is to promote PMEM pages if the zone is above the high watermark, otherwise wake up kswapd to reclaim until the zone is not only above the high watermark but also further increased to the new promote watermark. This stops promotion in a zone under high memory pressure.
- Davidlohr Bueso was concerned about the performance cost on machines that don't suffer from memory pressure.
- Pasha Tatashin: Is there any consideration for the performance difference for reads and writes on PMEM when considering the hot threshold?
- Ying Huang: We can introduce separate thresholds.
- Promotion and demotion can cause thrashing.
- The unmapped file page promotion series will be posted soon.
- There's an experimental kernel available at with build and config instructions
-
"cat /proc/PID/maps": What Could Possibly Go Wrong? [video][slides]
Paul E. McKenney and Matthew Wilcox presented this work.
- Monitoring application reading /proc/pid/maps (mmap_sem reader) gets preempted by writer and the writer will block, subjective to unbound sleeping. This problem occurs in production and can "hang" the application (high latencies for mmap/mumap).
- One way to address this is via the VMA Maple Tree enabling readers to use RCU and protect writers with a spinlock. The downside is that RCU creates visible inconsistencies and the user reading maps will see overlapping VMAs or newly added ones. However, users already have to take the maps output with a grain of salt.
- Mattew Wilcox noted that RCU freeing page tables was needed to avoid races with munmap: ppc has done it for a while, x86 experiments show promise.
- There was no fundamental conclusion if this were to be acceptable in the future.
- Paul wrote a testcase that can trigger the issue at will.
-
Design discussion and performance characteristics of Maple Tree [video][slides]
Liam Howlett presented on his work on Maple trees to replace the current underlying VMA data structures.
- A new, RCU safe, range optimized Btree (based on ideas from the radix tree). Some properties include self-balancing, cache efficient and leaves at the same height.
- Designed with VMA tree in mind, which would replace the current rbtree+list+cache situation that currently exists.
- The idea is to move things out of under mmap_sem. Readers can use RCU and refcounts while writers must take a spinlock which serializes changes to the tree.
- There are caveats because the Maple Tree provides its own locking: mas_lock/unlock().
- Currently (without moving things out of mmap_sem), performance is around what we have today, so worst case scenario is not bad.
- Davidlohr Bueso noted that per 'gitcheckout' results, the single threaded usecase was only affected minimally, with around a 2% hit.
- Because certain paths can be sleepable, rcu must be dropped; an option would be to explore srcu instead. Performance would hurt somewhat (extra barriers, but still wait-free), but code could be simplified.
- Liam noted that other potential users of the maple tree beyond the mm could be the IDA/IDR and the page cache, each with their own benefits of better range support.
- There were very few conclusions around the subject and overall.
-
Preserving state for fast hypervisor update [video][slides]
Pasha Tatashin presented on his work.
- Fast Hypervisor: being able to provision cloud servers.
- Hot-patching and Live Migration are not good solutions.
- Fast Local Update: Building Blocks
- Local live migration
- Emulated PMEM
- DAXFS
- EXT4 (or any DAX enabled FS)
- Demo Cloud Hypervisor
- Show data with QEMU update.
- Emulated PMEM Interface
- Dan William: current PMEM interface is bad, but there is soft preserving device state across reboot.
- David Woodhouse: preserving can done using device drivers, and they could collect all the IOMMU, VMX etc data in a single package.
- Dan William: Can we use PM? Yes, PM could be used, but requires an agent in VM.
- Support for Virtual Functions
- Future: Multi-root update
- KY: How devices are shared?
- Once the new hypervisor is finished booting, it take the control of devices together with the VMs from the first hypervisor.
- Mark Rutland raised issue about sharing CPUs on arm64 between OSes is dangerous, as there is system-wide state. (e.g. the shared interrupt controller state).
- David Woodhouse: I prefer the model of just letting the vCPUs continue running...
- Pasha Tatashin: The keep VMs running works only on nicely partitioned system.
- David Woodhouse: That's true, yes. That trick does rely on a 1:1 mapping of vCPU to pCPU.
- Then again, on an overcommitted system where there are more awake vCPUs than pCPUs, they were used to a certain amount of steal time anyway. Perhaps just not that much. you could extend your 'minimal vcpu_run loop' concept to do some primitive scheduling too... but fairly soon you've written another whole hypervisor :)
-
PKRAM feature development [video][slides]
Anthony Yznaga discusses the current state and what's next for Preserved-over-kexec memory storage.
- Preserving stream of bytes and page data across reboots.
- Use cases: database cache, live update, iommu state.
- Defragmentation can be added.
- A space for pkram can be pre-reserved.
- Simple use of PKRAM with tmpfs shows a number of optimizations to improve performance: parallelisation of work to preserve and restore tmpfs files and defer initialization of page structures of preserved pages.
- Some of the limitations include no support for huge pages, does not work for firmware reboot and additional overhead, among others.
- Pavel Tatashin: If pkram could be extended to survive across the firmware reboot, it would be more appealing.
- David Woodhouse: device drivers could use pkram to pass state. tmpfs is just one example.
-
Alex Kogan discussed his work on a NUMA-aware variant of qspinlock
- NUMA-aware userspace locks haven't made their way into the kernel because they take up space proportional to the number of nodes, whereas spinlocks should remain small since they're embedded in many kernel data structures.
- Alex discussed the tradeoff between performance and strict fairness (FIFO) when considering the current qspinlock vs CNA.
- CNA maintains two queues, primary and secondary, where the primary queue consists of lock waiters on the same node as the lock holder and the secondary queue has all other waiters.
- To ensure long-term fairness, flush the secondary queue back to the primary one after a configurable amount of time or number of intra-node handovers have happened.
- Davidlohr Bueso: Users will get the tunable wrong, so hide it from the user.
- There are certain cases, for instance when a task has disabled irqs, when that task will stay in the primary queue even if it's running on a different node than the lock holder.
- Tim Chen: This will improve overall throughput, but I'm concerned about the worst case latency. Do you have data on this?
- Alex: We haven't looked into this aspect.
- [unknown] We should keep all the RT tasks in the primary queue.
- Davidlohr Bueso: PREEMPT_RT should disable CNA as well since these systems want determinism.
- Paul McKenney: There are RT apps that use NOHZ_FULL and they're going to be spinning at normal priority, not RT priority. Another RT use case to consider. There are in-kernel primitives to detect this situation that could be used.
- Alex: Do the folks concerned about worst-case latency have suggestions for benchmarks?
- Matthieu Desnoyers: When you walk the primary queue, you're doing remote accesses when moving waiters to the secondary queue. Why not increase the memory footprint of CNA and use per-cpu data to create per-node queues of waiters to avoid remote accesses?
- Davidlohr Bueso: More macrobenchmarks will be more persuasive. Can just be a general purpose workload, doesn't need to target locks specifically.
- Tim Chen: You might try TPC-C. A benchmarking team at Intel has seen quite a bit of spinlock contention there.
IoThree's Company MC
CFP Ends: Aug 24, 2021
The IoThree's Company microconference is moving into its third year at Plumbers. Talks cover everything from the real-time operating systems in wireless microcontrollers, to products and protocols, to Linux kernel and tooling integration, userspace, and then all the way up to backing cloud services. The common ground we all share is an interest in improving the developer experience within the Linux ecosystem.
Suggested topics:
- Automotive initiative involving vehicle control, navigation, and autonomous driving
- Real-Time (RT) Linux and Safety Critical (ELISA) projects
- Linux kernel / wpan-tools MAC Layer patch integration ongoing to support scanning, joining, forming networks
- The wpanusb module is close to being integrated in Linux to provide a common transceiver interface for RIOT OS, Zephyr and others
- Support for multiple network interface autoconfiguration in Zephyr
- Greybus pthread and dynamic thread stack support for Zephyr
- Greybus minimize socket / certificate overhead for Zephyr
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "IoT's Company MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Christopher Friedt <chris@friedt.co>
- Jason Kridner <jkridner@beagleboard.org>
- Drew Fustini <drew@beagleboard.org>
IoThree's Company MC Summary
-
Overview of LoRa and LoRaWAN support in Zephyr [slides][video]
-
LoRa:
-
LoRa API supported: config, send, recv, test
-
P2P, no gateway needed
-
-
LoRaWAN:
-
Complete list of APIs in slides
-
Only Class-A tested, but other classes should work
-
Battery level, ADR, OTAA, and ABP are supported
-
Based on LoraWAN Spec v1.0.4 and Regional Parameters 2-1.0.1
-
-
Improvements planned in Zephyr for LoRa/LoRaWAN
-
Proposal for LoRa and LoRaWAN in Linux kernel
-
A socket for LoRa and LoRaWAN
-
PF_LORA, PF_LORAWAN
-
LoRa as PHY: device drivers for LoRa modules
-
LoRa as Soft MAC: stack written from scratch
-
-
Long-standing effort by Andreas Farber, Jian-Hong Pan
-
Very much needs volunteers to upstream
-
Not updated in 3 years
-
Needs to be merged upstream in small, reviewable parts
-
-
-
Wireshark has lorawan filters
-
Good devices to use with Zephyr / Linux support
-
STM32WL, RAK5205, LoRa-E5, SX1272MB2xAS Shield, Dragino LSN50, etc - end nodes
-
Packet sniffers needed!
-
-
-
mikroBUS Driver Add-on Boards [video][slides][demo (no audio)]
-
mikroBUS:
-
an add-on board socket standard by MikroElektronika
-
includes SPI, I2C, UART, PWM, ADC, GPIO and power
-
800 Click modules available today!
-
Uses 1-wire EEPROM for board identification
-
-
Status in Linux:
-
Expose mikroBus as probe-able bus
-
Devices are probed with combination of
-
Devicetree overlay fragment
-
Board-specific EEPROM ID
-
-
How is mikroBUS different over Greybus?
-
mikroBUS is a variant of the gbphy class
-
gbphy cport devices created
-
gbphy device has associate SPI, I2C, GPIO, controllers
-
probe devices according to Greybus Manifest
-
Probe board ID / metadata in EEPROM
-
Instantiate mikroBUS
-
-
140 mikroBUS Click add-on boards tested and ready to use today!
-
-
-
Sort of like a transport + discovery mechanism for non-discoverable buses (SPI, I2C, UART, PWM, GPIO)
-
Originally for Project Ara modular phone
-
RPC protocol to manage and control remote devices
-
Greybus allows nodes to be locally connected or remote via TCP / IP
-
Devices and busses appear in Linux as if they were locally connected
-
Keep intelligence in the host, not the remote sensor
-
-
What Next?
-
Several mikroBUS patches to upstream
-
Need additional protocol specification in Greybus (e.g. PWM)
-
Need an organization to adopt maintaintership of Greybus
-
-
Needs UART support + PWM DT bindings
-
a few other open issues, PR’s welcome!
-
-
Contact Jason Kridner for BeagleConnect Freedom Beta program
-
-
-
IoT Gateway Blueprint with Thread and Matter [slides][video]
-
Current State of IoT (in Home Automation context)
-
A box from every different vendor (Apple, Google, Amazon, …) - needs to be reduced!
-
Ignore branding and vendor lock-in - technology only!
-
Brainstorming: ASOS Gateway (based on Yocto & Linux)
-
WiFi AP, BLE, OpenThread Border Router
-
Basic Web UI
-
Matter protocol support
-
sysotad - platform-specific OTA updates
-
containerized services / software-related updates
-
Yocto used to build Linux & Zephyr
-
Gateway will be run Linux, Devices will run Zephyr
-
Share libraries between host and device (mbedTLS, openthread, matter)
-
-
-
Predictions
-
Vendors will eventually not bundle gateways, but..
-
.. unlikely that there will ever be 1 box for all HA
-
6lowpan / IPv6 will become dominant
-
Devices may be legally required to be capable of OTA updates
-
-
Guidelines
-
Use end-to-end solutions (IPv6 device to device or device to cloud)
-
Focus on IPv6-only solutions (OpenThread, Matter)
-
ISPs: provide IPv6 prefix delegation to the home, if not already
-
NAT64 for IPv4-only transit networks
-
-
-
Very active Open Source project
-
Product and App compatibility
-
Supported by numerous Zephyr devices
-
IPv6-based (WiFi, Ethernet, 802.15.4)
-
-
-
Active Open Source project
-
Connectivity Standards Alliance (formerly Zigbee)
-
Google, Apple, driving development
-
IPv6-based (WiFi, Ethernet, 802.15.4)
-
Member-only forums
-
-
IEEE 802.15.4 / linux-wpan updates
-
MLME scanning, joining, coordinator, beacons need to be mainlined!
-
Very similar conceptually to 802.11 (WiFi)
-
Several companies have tried over the years but become too busy maintaining their own gateway (ironically)
-
More functionality is needed in-kernel rather than in userspace if we want a common linux-wpan platform abstraction
-
-
-
Apps not boilerplate, leveraging Android’s CHRE in Zephyr [slides][video]
-
Main motivation
-
Get rid of code and processes that are done over and over again
-
Reduce time to market
-
Reduce cost of maintainership
-
Reduce implementation complexity
-
Improve testability
-
Improve modularity
-
Focus on WHAT to build instead of HOW
-
-
Zephyr is currently being integrated into Chromium OS Embedded Controller
-
Add Android’s Context Hub Runtime Environment to Zephyr
-
-
CHRE in a nutshell
-
nanoapps have 3 entry points:
-
nanoappStart()
-
nanoappStop()
-
nanoappHandleEvent()
-
-
nanoapps run in a low-power environment
-
Offload processing from applications processor
-
E.g. lid angle calculation, wifi events, sensor calibration, geofencing
-
RE manages events in a pub/sub model between nanoapps, devices, and other event sources
-
CHRE itself is implemented in a restricted subset of C++
-
nanoapp entry points have C linkage (to simplify RTOS integration)
-
Application is responsible for transport
-
Data serialized / deserialized using Protocol Buffers (gRPC?)
-
-
Sensor Framework included with CHRE
-
Timestamp spreading
-
Sample-rate arbitration
-
Sample-batching
-
Supports power management
-
-
-
Embedded Linux & RTOSes: Why not both? [slides][video]
-
Board Support:
-
Linux and Zephyr both do this well, with a device model, interface-based drivers, Devicetree (or ACPI) interfaces for platform data
-
Kconfig for software configuration, Devicetree for hardware configuration (Zephyr)
-
Other solutions include CMSIS (ARM Cortex-M)
-
Vendor HALs are generally not portable
-
-
Real-time:
-
Linux has come a long way for soft-real time
-
RTOSes are designed for hard real-time (e.g. via interrupts)
-
-
Programming Languages:
-
Linux supports virtually any language on most platforms
-
RTOSes typically support C, but we are seeing increased support for MicroPython, Rust, WebAssembly, etc
-
-
Distros
-
Linux provides shared code / portability, learning resources, and professional support. Multiple distros, even distribution builders like Yocto
-
RTOSes are typically proprietary with paid support, sponsorship for cloud services
-
-
Updates
-
Bootloaders
-
Software Integrity
-
Confidential Computing
-
In summary, Linux does a number of things really well, and there are a number of areas where RTOSes could take advantage. Zephyr does a great job of this by adopting Devicetree and Kconfig from the Linux kernel. RTOSes generally are able to offer more traditional real-time capabilities.
-
Tracing MC
CFP Ends: Sept 10, 2021
The Tracing microconference focuses on improvements of the Linux kernel tracing infrastructure. Ways to make it more robust, faster and easier to use. Also focus on the tooling around the tracing infrastructure will be discussed. What interfaces can be used, how to share features between the different toolkits and how to make it easier for users to see what they need from their systems.
Suggested topics:
-
Tracepoints that allow faults. It may be necessary to read user space address, but currently because tracepoints disable preemption, it can not sleep, nor fault. And then there's the possibilities of causing locking issues.
-
Function parameter parsing. Now that on x86 function tracing has full access to the arguments of a function, it is possible to record them as they are being traced. But knowing how to read the parameters may be difficult, because it is necessary to know the prototype of the function to do so. Having some kind of mapping between functions and how to read their parameters would be useful. Using BTF is a likely candidate.
-
Consolidating tracing of return of a function. Currently there's three use cases that hook to the return of a function, and they all do it differently. kretprobes, function graph tracer, and eBPF.
-
User space libraries. Now that libtraceevent, libtracefs, and libtracecmd have been released, what tooling can be built around them. Also, improving the libtraceevent API to be more intuitive.
-
Improving the libtracefs API to handle kprobes and uprobes easier.
-
Python interface. Working on getting the libraries a python interface to allow full tracing from within python scripts.
-
Tracing containers. What would be useful to expose on creating and running containers.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Tracing MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Steven Rostedt <rostedt@goodmis.org>
- Yordan Karadzhov <ykaradzhov@vmware.com>
Tracing MC Summary
-
DTrace based on BPF and tracing facilities: challenges [video][slides]
- DTrace scripts are compiled into BPF functions. There are pre-compiled functions in C to BPF. BPF is still "Bleeding edge" but production kernels usually prefer "stability" over "bleeding edge". BPF was missing "bounding values" from loading and reading from the stack (but it was stated that this was fixed on 7/13/2021 - v5.14-rc4), but this is still not in production systems. Although it was backported, there are other issues (not specified) that are not.
- Current solution uses BPF maps, but this has limitations, such as verifier cannot validate anything stored or loaded on the maps, and the values are pointers.
- Possible suggested solutions:
-
Allow BPF maps to have a larger size
-
Use multiple map values (but is cumbersome)
-
New per-cpu memory resource. Does not need to be visible to user space.
-
Preallocated with bpf_malloc/free helpers.
-
-
Complex scripts and function loops. Perhaps a BPF helper for loops can be added that is safe to use.
-
No way to currently save pt_regs from tracepoints. But as tracepoints are jumps and not break points they are similar to calling a function. How would one save registers from a function call?
-
Issues with the verifier where GCC/LLVM can produce "verified" code but the fails the kernel verifier.
-
Enabling user mode programs to emit into trace_event / dyn_event [video][slides]
-
Want a way to allow user mode applications to send data to a user mode defined trace event. Currently possible with uprobes, but is hard for C#, Java, etc to be attached to uprobes.
-
Problem: Many processes running in cgroups using different languages (Python, Rust, Go, C/C++, C#, Java), but want a single monitoring agent in the root namespace. There could be multiple tooling to trace (LTTng, uprobes, perf, etc) Need a way to have consistent data across the tools. Do not want daemons or agents running in the container name spaces.
-
Proposed Solution:
-
Have the user applications create / open a "user_events_data" file in tracefs. Get an event_id from this file by passing in an event name via an ioctl(). (Similar to the "inject" file for tracepoints).
-
Use a mmapped page shared between the kernel and user space, where the bits of a byte lets you know what is attached (bit 0-6 for ftrace, perf, etc). Bit 7 reserved for "others". Zero mean nothing in consuming / tracing that event. The user application will check this byte, and if it is set, it will then write the trace event data into a file which will be passed to the connected consumers (the tracers).
-
trace_markers was mentioned, but they do not have all the features that a trace event has (attaching triggers, histograms, synthetic events, etc).
-
Discussion over field argument formats were made.
-
Issues with it being a binary blob. Should be easily parsable.
-
Needs to be accessible from non root. Can change permissions of files in tracefs.
-
-
-
Container tracing: [video][slides]
-
First, how do we define a container? As containers are just a bunch of user tasks. Yordan Karadzhov said it is trivial to figure out what tasks are part of a container if you start tracing it during its creation. But finding out what is part of a container after the fact is more difficult. Christian Brauner said most of the time the PID name space defines the container, but that is not always the case. He suggested using a "tagging" method, where one could tag the process creating a container, or adding tasks to a container, and all the children will maintain this "tag". Then to define the container the tag will be used.
-
Mathieu said that LTTng traces by inode numbers. But is missing the "user given" name of the container, and tagging would solve this too.
-
Mathieu also said that we need to be concerned about "nested containers".
-
Masami asked about tracing from within the container, but Steven shot it down due to the fact that once you can see the events of the kernel, you are no longer inside the container. Although, it should be fine for privileged containers.
-
Christian said there's some ways to get by that with eBPF from inside a container.
-
Mathieu suggested having "safe" tracepoints that the container is allowed to use.
-
Yordan asked about other use cases for container tracing, like for security, network activity across nodes, etc, but the conversation still went back to what is a container.
-
Christian mentioned that there is mechanisms in core scheduling that could be used as well.
-
-
Tracepoints that allow faults [video][slides]
-
Tracing of system calls happen at the start of the system call. If there's a pointer to user space, and that address is not yet mapped in, then it can not be retrieved because tracepoints are protected by preemption disabled and retrieving user space memory requires scheduling.
-
Suggested extending the tracepoint and tracepoint APIs to include callbacks with preemption enabled, but the tracers will need to know this, as they currently expect preemption disabled. Will require updating to use task trace RCU mechanisms.
-
Steven mentioned that the use of synthetic events connecting the entry and exit of a system call to trigger the event on the exit of the system call where the user space addresses would be resolved. Mathieu asked about exec, as the old addresses would be discarded. Although, the exec system call usually doesn't suffer not having the address loaded, as the path is usually loaded as one of the exec system call parameters.
-
Mathieu suggested that we could just have preemption disabled before calling the perf and ftrace tracers.
-
Peter Zijlstra noticed that the rcu task trace sends IPIs to CPUs that are running NO HZ user space tasks, which must be fixed first.
-
-
LTTng as a fast system call tracer [video][slides]
-
How to get at least part of LTTng into mainline.
-
Need to break it up and find a minimal set for upstreaming. Could look at the systemcall with page faults first.
-
Steven wants more kernel developers to be interested in having LTTng upstreamed, or at least users that are pushing for it. Having a demand is needed, instead of just Steven and Mathieu deciding it.
-
Need a way to pass argument types "user" and "kernel" to the tracers.
-
Masami suggested using ioctl() as they have lots of data structures.
-
Steven suggested using BTF for defining arguments for system calls.
-
-
Eventfs based upon VFS to reduce memory footprint [video][slides]
-
The tracefs memory footprint is large due to the events directory. Every event system has a directory dentry, as well as an enable and filter file.
-
The events themselves each have their own directory as well as several files to enabled the events and to enable triggers, filters, and to read the event formats. The event directory was measured to be around 9 MB. When an instance is created (for more than one trace buffer) it duplicates the events directory allocating a new 9 MB of memory. There needs to be a way to have just one copy of the meta data, and dynamically create the directories and files as they are referenced. Ideally, this will remove 75% from the 9MB.
-
Ted Tso mentioned that sysfs has the same complaints about the size of the files for the objects and it could use this too.
-
There was a lot of acceptance that this would be good to achieve, not only for the proposed eventfs, but for the other pseudo file systems as well.
-
-
Function tracing with arguments [video][slides]
-
Currently function tracing only traces the function name (being traced) and the parent. As of 5.11 the function trace (on x86_64) gets access to the registers and stack that is needed to retrieve the parameters for every function by default. Now all that is needed is to implement a way to do so. Needed is the way to know what is needed for the arguments on a function by function basis. BTF currently is in the kernel with that information, but there isn't a fast way to retrieve it (needed at time the functions are being traced).
-
Masami mentioned that BTF describes the arguments for each function but does not describe the registers to retrieve those arguments. That is different on each arch. It was mentioned to record the raw data (just having knowing what regs and stack is needed to save into the trace buffer, then post process the names and parsing at the time of reading the trace.
-
BTF information may be tricky as finding data for modules may be different than for the core kernel. The split BTF base for modules is may not be global unique.
-
-
Merging the return caller infrastructures [video][slides]
-
There are currently three subsystems that do tricks to add a callback to the return of a function: kretprobes, function graph tracer, and BPF direct trampolines. Kretprobes and function graph tracer "hijack" the return pointer and insert a return to a trampoline that does the callback and then returns to the original return address that was saved on a shadow stack. BPF has the trampoline call the function being traced and simply has the return side on the trampoline do the callback and return normally itself. But this has issues if there are parameters on the stack, as those need to be copied again when calling the traced function.
-
Peter stated that having one infrastructure would be helpful for the CFI "shadow stack" verification code.
-
Steven stated that function_graph tracer is simplest because it is fixed in what it does (record return and timestamp). The kretprobe calls generic code that may expect more (need to know more about the state of the system at the return, regs, etc).
-
Masami wants to make a single "shadow stack" that both function graph tracer and kretprobes can use. Having multiple users of the shadow stack can probably work the same way tail calls work. The first caller "hijacks" the return address and places the address of its return trampoline on the stack, then when the next user of the shadow stack does its replacement, it will "hijack" the stack in the same location and save the return to the previous return trampoline onto the shadow stack and replace it on the main stack with the address of its return trampoline. When its return trampoline is called, it will put back the return to the previous trampoline and call that.
-
Masami mentioned that the kretprobes can be updated to use a generic shadow stack, as it currently uses a shadow stack per probe. Peter said that he had coded to do that somewhere, but doesn't think it went anywhere. Need to investigate that further. Steve said that he would rip out the logic of the function graph tracer's shadow stack and work on making a generic. On x86, Peter said that task stacks are now 16K each. Steven thinks that 4K for the shadow stack should work. But work needs to make sure that if there's no room on the stack, the tracer needs to test if it gets the stack and be able to safely fail if there's no room left on the stack.
-
BPF can't do it generically, because it only saves the necessary args per function.
-
Toolchains and Kernel MC
CFP Ends: Aug 14, 2021
The Toolchains and Kernel microconference focuses on topics of interest related to building the Linux kernel. The goal is to get kernel developers and toolchain developers together to discuss outstanding or upcoming issues, feature requests, and further collaboration.
Suggested topics:
- Upstreaming Rust Support
- Using Clang's locking annotations
- Memory ordering progress in the C/C++ standards committees
- Toolchain security feature requests
- Post Link Optimization of the kernel with Binary Optimization and Layout Tool (BOLT)
- Objtool on arm64[4]
- DWARF, CTF and BTF
- BPF/BTF/CORE support in the GNU Toolchain
- Using BTF for ABI analysis
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Toolchains and Kernel MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Jose E. Marchesi <jose.marchesi@oracle.com>
- Nick Desaulniers <ndesaulniers@google.com>
Real-time MC
CFP Ends: Sept 10, 2021
The Real-time microconference focuses on finishing the last lap of getting the PREEMPT_RT patch set into mainline. Many of these missing pieces, however, are not at the core of real-time features (like locking, and scheduling), but instead, on other subsystems that compose the kernel, like file systems and memory management. Making this Linux subsystems compatible with PREEMPT_RT requires finding a solution that is acceptable by subsystem maintainer, without having these subsystems suffer from performance or complexity issues.
Suggested topics:
- New tools for PREEMPT_RT analysis.
- How do we teach the rest of the kernel developers how not to break PREEMPT_RT?
- Stable maintainers tools discussion & improvements.
- The usage of PREEMPT_RT on safety-critical systems: what do we need to do?
- Make NAPI and the kernel-rt working better together.
- Migrate disable and the problems that they cause on rt tasks.
- It is time to discuss the "BKL"-like style of our preempt/bh/irq_disable() synchronization functions.
- How do we close the documentation gap
- The status of the merge, and how can we resolve the last issues that block the merge.
- Invite the developers of the areas where patches are still under discussion to help to find an agreement.
- How can we improve the testing of the -rt, to follow the problems raised as Linus tree advances?
- What’s next?
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Real-time MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Daniel Bristot de Oliveira <bristot@redhat.com>
- Clark Williams <williams@redhat.com>
- Steven Rostedt <rostedt@goodmis.org>
- Dhaval Giani <dhaval.giani@oracle.com>
- Kate Stewart <stewart@linux.com>
Real-time MC Summary
-
Maintaining PREEMPT_RT: Now and Then [video][slides]
- Regarding the current approach to manage patches, the PREEMPT_RT uses quilt for queueing patches, because git is not viable for initial modifications. The changes that are applicable to upstream are submitted to Linus' tree.
- One of the main questions for the future is how things will work upstream. Sebastian asks that the RT developers be consulted before attempting a fix for a warning. Daniel Bristot then asked: how will the developers know who to contact for -rt bugs when the rt merge happens? Steven Rostedt and Clark Williams asked about a methodology to know who to contact in case of problems. But Thomas mentioned that the things will work as they work nowadays with the existing problems. Thomas said it's obvious who to talk to when testing picks up a splat with PREEMPT_RT on and it goes away when that is off. Steve agrees that this problem is not just rt, but other hard areas such as RCU. Clark worries that people outside of our community may still not know who to go to. The final agreement is that these problems should be discussed on the rt-linux-users List.
- Juri Lelli suggested more automated CI to automate the BUG reports, and Luis Claudio Suggested a list of tests to use as CI, using the already existing infrastructure from companies (intel, red hat, ...). Mark Brown mentioned that there are other companies, mostly in the embedded world that could also help on that. Daniel Wagner working on getting Lava testing working for PREEMPT_RT on embedded boards. Clark mentioned the list of problems that we can have with different configs, and also mentioned BIOS being a problem for getting real-time performance, and tuning that can interfere with automated testing of real-time performance. But then Daniel said that the BIOS problem is not necessarily an issue: Instead of focusing on comparing all machines one to another, the results should be compared only for that previous versions on the same machine.
-
MRTLA: An Interface for osnoise/timerlat tracers [video][slides]
- Nowadays there are two main types of RT workloads: periodic (timer-based) and polling (busy loop). cyclictest is the most used periodic test, and sysjitter/oslat are poll-mode tests. When a user sees bad results, it is hard for most people to get started troubleshooting. The tracers, themselves, already give you valuable information, though with the risk of high volume. (Natural to a tracing session).
- Recently, two new tracers were added to the kernel, the timerlat, and osnoise tracers which aim to provide a metric similar to cyclictest and sysjitter/oslat. While they are good at providing snapshots of the results in the trace, they are not good for collecting long-term results. Daniel is proposing a tool named RTLA to overcome this limitation.
- RTLA stands for Real-time Linux Analysis tool. It is an interface for osnoise and timerlat tracer and provides a "benchmark like" interface for these tools.
- Thomas asked why not in rust, and Daniel said that most of the infrastructure around it is in C, like libtracefs. Daniel also mentioned that he is not using eBPF for now, but will likely use it in the future. Daniel and John Kacur discussed how we might integrate rtla as a measurement module in rteval, and it seems to be feasible.
- Finally, Daniel made questions about tracing features and discovered that he misunderstood the "size" of kernel histograms. He will fix it in the next version. Daniel raised the idea of adding features to libtracefs. Steven said that requests should be filed in Bugzilla. Daniel requested non-consuming text output from the trace, but that would be hard. So Daniel will continue with the multiple instances approach.
-
Linux kernel support for kernel thread starvation avoidance [video][slides]
- Using PREEMPT_RT for 5G networks requires less than 10 us latency. In these cases, the users want to run a busy-loop tool that takes 100% of the time, using FIFO priority. As a side effect causing starvation of housekeeping threads from the kernel that needs to run on all CPUs. The starvation of threads can cause some malfunctions on other subsystems, for example, container destroy causes the hang immediately.
- While the kernel has a mechanism to avoid starvation, named real-time throttling, it lacks precision on the us scale. To work around this problem, people at red hat have developed a tool named stalld. The tool works in user-space, parsing sched debug options, detecting real-time tasks causing starvation, and boosting starving threads using SCHED_DEADLINE. The tool works but presents some limitations.
- The authors mentioned that stalld does not scale because it needs one monitoring thread per cpu, that the tool can starve on locks, most notably on those locks taken to log. The authors proposed an in-kernel, per-cpu algorithm, to detect starving threads using hooks to the scheduler events and per-cpu high-resolution timers.
- Daniel Bristot (stalld's author) mentioned that the mentioned limitations of stalld were gone. Daniel has implemented a single-threaded version that had no drawback when compared to the multi-threaded version. The CPU usage was also dramatically reduced. To solve the starvation limitation, stalld can run with FIFO or even SCHED_DEADLINE priority.
- But the main point raised by Daniel is that using a per-cpu approach, with on interrupts to track starvation is that the mechanism can cause noise to the busy-loop thread, which is not what users want, and that is why stalld monitors the CPUs remotely. Stalld does not run any code on the monitored/isolated CPU, thus reducing the system noise to the minimum possible.
- Thomas says the real issue is with the other areas of the kernel running threads on the NOHZ_FULL CPU. Daniel said that this was a reason why stalld was implemented in user-space: when the fully isolated CPU becomes a reality, stalld can be threshed away, without adding restrictions to the kernel.
-
futex2: next steps [video][slides]
- futex2 is a new set of syscalls to solve long-standing issues with the current interface. It solves NUMA, wait on multiple and variable size issues. The futex2 is already a patchset under discussion, refactoring to make patches smaller, easier to comprehend.
- The discussion went toward the problem with the structure to define the time. timespec is not the best way to go, peter would like to see __u64 for timeouts. André asked if that is a good way, and Thomas said that he agrees with __64 - even if it will be a problem in the long term future.- Steven mentioned working around the future problem via libc. Thomas, said we want both absolute and relative values for the time value. In the end, the recommendation was to stay with __kernel_timespec rather than __u64 time value. Also recommending that a clockid be added as an argument to futex_waitv() so that we can select the clock being used for the time value.
- futex_waitv structure looking at dropping the __reserved member to save memory, argument on whether the memory saved is worth the effort. Arnd asserts that we should optimize for 64-bit performance going forward. The conclusion is to move forward with the __u64 uaddr struct. NUMA aware futex2 is interesting but needs to look at it much harder, need buy-in from users (glibc, databases, JITs using raw futexes).
-
printk and real-time [video][slides]
- The topic started explaining why printk is so important to PREEMPT_RT users. One of the main challenges is to ensure that printk can print from any context. To do that the printk call part needs to be decoupled from actual printing output to terminal (any type of terminal), and this is done using a ring-buffer to store the printk messages.
- The 5.15 will have completely safe lockless ringbuffer, but this component is not upstream yet. The idea going foward is to use printk caller when not in the panic(), with atomic consoles that will be used in this situation (panic consoles). Panic consoles do not need to worry about synchronization, with the exception of the serial terminals?
- KGDB is a special case where we are trying to debug the kernel, so it is a special case. The debate going on about how to transfer ownership of console cpu-lock to kgdb. Rename the lock to "cpu_sync" could be an option since the console cpu-lock isn't really a lock but just a mechanism to syncronize access to consoles (somewhat cooperatively)
- Regarding atomic consoles, they are implemented on the mainline with polling APIs. But they are only implemented on PREEMPT_RT with physical 8250 UART. Developers are currently trying to find best path for implementing them for PREEMPT_RT.
-
PREEMPT_RT: Status and QA [video]
- The first question was regarding the PREEMPT_RT merge. The answer was that the kernel 5.15 has locking and mm patches for PREEMPT_RT. The major parts of PREEMPT_RT are already mainline. But still won't boot due to open issues. For instance, namespace issues still exist, but it is not the major concern. Thomas points to problems in the network side as the most critical part. There are also some mm and fs problems, but they are not a big point because these problems are not strictly related to the PREEMPT_RT.
- Softirq latencies have problems with the non-rt kernel: is any work of the RT patchset helping that? Mainly because of latency problems faced with android. Thomas replied that all the work done in this regard is a band-aid. We need to get out from softirq - and softirq disable limitations.
- Another question was if there is any plan to get ktimersoftd (timer softirq specific thread) back?
- Thomas said that It might happen, but it is not a priority - it is a budget limitation - that limits the bandwidth.
- Do we need to care about arch-specific problems for PREEMPT_RT? Thomas said that there is not too much to worry about them.
- Finally, in a discussion about system safety, Thomas said that there are already cases of Linux being used on critical systems, like on military systems. but things are way more complex in the ADAS. Daniel Bristot said that there will be a talk about this topic the next day. This talk was A maintainable, scalable, and verifiable SW architectural design model for the Linux Kernel.
Testing and Fuzzing MC
CFP Ends: Sept 10, 2021
The Testing and Fuzzing microconference focuses on advancing the current state of testing of the Linux kernel. We aim to create connections between folks working on similar projects, and help individual projects make progress.
We ask that any topic discussions will focus on issues/problems they are facing and possible alternatives to resolving them. The Microconference is open to all topics related to testing & fuzzing on Linux, not necessarily in the kernel space.
Suggested topics:
- KernelCI: Extending coverage and improving user experience.
- Growing KCIDB, integrating more sources.
- Better sanitizers: KFENCE, improving KCSAN.
- Using Clang for better testing coverage: Now that the kernel fully supports building with clang, how can all that work be leveraged into using clang's features?
- How to spread KUnit throughout the kernel?
- Testing in-kernel Rust code.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Testing and Fuzzing MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Sasha Levin <sashal@kernel.org>
- Guillaume Tucker <guillaume.tucker@collabora.com>
Testing and Fuzzing MC Summary
-
Detecting semantic bugs in the Linux kernel using differential fuzzing [video][slides]
- Semantic bugs are hard to find because they typically can't be detected by current static and runtime analysis tools. The proposed solution is differential fuzzing, by providing the same input to different implementations of the same system and comparing the outputs
- For the Linux kernel, the input is system calls. Different kernel revisions can be compared, such as stable branches. syz-verifier automates this by running different kernel versions in multiple VMs and comparing the results.
- Many avenues can be explored to go further by reducing nondeterministic behaviour and false positives.
-
Bare-metal testing using containerised test suites [video][slides]
- Each hardware test environment is different, across all the different test farms. Using containers is a way to help harmonising the user-space.
- boot2container (b2c): a minimalist initrd and kernel config fragment with just enough to run commands in containers
- One limitation is the need to have a working network interface in the kernel to download the Docker image. The minimum amount of RAM needed is 64MB which can also be a limiting factor on tiny devices.
- b2c is already being used for integration and unit tests.
-
Common Test Report Database (KCIDB) [video][slides]
- Unified database for collating test results from various test systems using a single schema. This could then be used to send email notifications with all the available data.
- There is a proof-of-concept web dashboard based on Grafana. There are also command-line tools and a Python package available for integration with other test systems.
- Next important feature to enable is regression tracking. Collecting use-cases about how this would fit in various workflows.
-
Testing the Red-Black tree implementation of the Linux kernel against a formally verified variant [video][slides]
- This is done by comparing an in-kernel implementation with a formally verified one.
- Three different test case generators: random, exhaustive and symbolic. The time to run them and their coverage were measured. The total coverage of the core part of the code amounted to 99.45%.
-
New smatch developments [video][slides]
- New Param/Key API
- Extends how the database stores information about parameters
- Reduces amount of code needed
- Makes it easy to interact with the DB
- Sleeping in Atomic
- Previously a list of functions that affect preemption, but we don't know how the preempt count looks like under the hood
- Some functions may or may not modify preempt count (such as the _trylock() family of functions)
- New code in check_preempt.c tracks the preempt count and properly handles in_atmoic() checks
- Can be used for other reference counters, modularization helps a lot
- Race Conditions
- Tends to produce more false positives than real bugs because some checks act as a fast path before taking a lock.
- Improved by Lukas Bulwahn and Julia Lawall: infer which locks are needed by locking at the statement after a lock was used, still produces a bunch of warnings.
- Currently analysis of race conditions is useful for attackers who can sift through false positives, but very difficult for maintainers who want to avoid warning free code. Could be improved with more annotations.
- Possibly fuzz locks by adding delays into locking operations.
- New Param/Key API
-
Fuzzing Device Interfaces of Protected Virtual Machines [video][slides]
- Many proposed technologies to protect guests from malicious hosts (SGX, SEV, etc). Encrypted guest memory with integrity protections. Guest state is protected and the guest's page table is integrity protected.
- Virtual devices used to be trusted, but under this new model they could be used as an attack vector either during runtime (received data, leaking of data or pointers) or during initialization (forcing failures during initialization for guests to mishandle).
- How do we fuzz the HW/OS interface? Timing attacks, state accumulation, games with IRQs, incorrect error handling (BUG()), etc
- Developed a new tool that targets driver fuzzing, built on top of lkl and libfuzzer. Targets drivers loaded as shared library with virtio, pci, and platform device stubs. Intercepts I/O (DMA, MMIO, and PIO) and removes delays in the code (sleep, schedule_timeout, etc)
- Even with the above, coverage is somewhat low (<25%)
- Found 50 bugs across 22 drivers, different categories, very few of the bugs are actually exploitable (but still could be used in conjunction with different issues).
-
Rust built-in testing features [video][slides]
- In the unit test case, testing is combined with the functional code but lives in it's own module. Integration tests can live in their own files.
- Rust has a notion of tests, and can tag functions as tests, allowing them to be automatically treated as such
- Easy to enable or disable specific tests
- Assertions can be part of the documentation and be run during testing, testing is part of the actual documentation
- Support for build time tests (comparable to BUILD_BUG_ON())
- Work in progress on separating tests into the kernel and userspace portions, annotating, being part of a single test
- How do we combine the kernel's existing test infra into rust? How do we do kselftests? kunit tests?
-
KUnit: New Features and New Growth [video][slides]
- About 3x increase in the amount of tests in the kernel in the past year
- Many of the new kernel features came with KUnit tests (DAMON, KFENCE, etc)
- QEMU has support to run KUnit tests via kunit_tool
- .kunitconfig allows users to easily configure the tests they want to run, easy to customize
- Improved documentation: kernel testing guide describes the differences between various kernel testing and validation tools. kunit_tool docs were also improved.
- Looking to standardize KTAP between kselftest and kunit
- What are the incentives to add more tests? Making the tests easier to run means that they'll run in more contexes, making them more visible publicly.
- We need to better explain the difference between kunit and kselftests, users are not sure which to use and when
File System MC
CFP Ends: Sept 15, 2021
The File system microconference focuses on a variety of file system related topics in the Linux ecosystem. Interesting topics about enabling new features in the file system ecosystem as a whole, interface improvements, interesting work being done, really anything related to file systems and their use in general. Often times file system people create interfaces that are slow to be used, or get used in new and interesting ways that we did not think about initially. Having these discussions with the larger community will help us work towards more complete solutions and happier developers and users overall.
Suggested topics:
- DAX - are we finally ready for prime time?
- Optimizing for cloud block devices. How do we deal with unstable transport? Do we need to rethink our IO path?
- Atomic writes, and FIEXCHANGE_RANGE
- Writeback throttling - we have a lot of different solutions, are we happy with the current state of affairs?
- Page Folios
- RWF_ENCODED
- Performance testing
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "File System MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Josef Bacik <josef@toxicpanda.com>
- Amir Goldstein <amir73il@gmail.com>
- Ted Ts'o <theodore.tso@gmail.com>
- Jan Kara <jack@suse.cz>
File System MC Summary
-
Efficient buffered I/O (Page folios) [video]
-
Matthew Wilcox talked about the work that filesystem developers need to do in order to convert disk or network filesystems to use Folios
-
Josef Bacik and Ted Ts’o offered to help with performance testing
-
Covered better by https://lwn.net/Articles/869942/
-
-
Idmapped mounts [video][slides]
-
Christian Brauner gave a quick overview about the idmapped mounts feature that was merged for v5.12, what they are used for and how they are implemented in the VFS.
-
More background can be found at https://lwn.net/Articles/837566/
-
At the moment, only privileged users (typically systemd or container runtime) are allowed to create idmapped mounts for the use by processes running inside a user namespace
-
In a followup session, Jan Kara talked about Project ids and what their semantics should be w.r.t idmapped mounts use cases
-
The existing project id semantics are quite old, so the community may propose new semantics to meet modern use cases, but first, a proper specification for the new semantics needs to be drafted
-
-
Atomic file writes [video]
-
Darrick Wong talked about a proposal for implementing filesystem level atomic writes using FIEXCHANGE_RANGE ioctl (https://lwn.net/Articles/851392/)
-
The proposed API is a lot more flexible and predictable than the hardware atomic write capability that is available on some storage arrays, that is very hard to use in portable applications
-
-
Filesystem shrink [video][slides]
-
Allison Henderson talked about the technical challenges of shrinking the logical size of a filesystem, in cases where thin provisioning is not provided by the underlying block layer.
-
Ted Ts'o explained that the requirement is driven by the fact that Cloud vendors charge customers by the logical size of the provisioned block device and not by the actual Cloud backend storage usage - if we could get Cloud vendors to change their pricing model, there would probably be no need to shrink filesystem logical size
-
-
Bad Storage vs. File Systems [video][slides]
-
Ted Ts’o and Darrick Wong talked about lessons learned from running filesystems on top of unreliable Cloud backend storage.
-
Different use cases vary greatly in what is the best thing to do when I/O error occurs when writing data or metadata blocks
-
Josef Bacik held a strong opinion that error handling should be delegated to applications and that it should not be a filesystem decision
-
Ted Ts’o argued that some mechanisms, such as forcing a kernel panic, makes sense to do in the kernel without involving userspace. Another mechanism to delegate the decision to system administrators might involve using eBPF.
-
-
Development Roadmaps [video]
-
Darrick Wong talked about some of the XFS work done in 2021 and what is planned for 2022
-
Josef Bacik talked about some of the btrfs work done in 2021 and what is planned for 2022
-
Matthew Wilcox talked about some more plans for improvements of page cache after Folios
-
Josef Bacik, Ted T’so and Darrick Wong talked about their regression test setups and how to better collaborate the regression tracking efforts
-
VFIO/IOMMU/PCI MC
CFP Ends: Sept 10, 2021
The VFIO/IOMMU/PCI micro-conference focuses on coordination between the PCI devices, the IOMMUs they are connected to and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems, and on the kernel code that enables these new system features that often require coordination between the VFIO, IOMMU and PCI sub-systems.
Suggested topics:
-
-
Write-combine on non-x86 architectures
-
I/O Page Fault (IOPF) for passthrough devices
-
Shared Virtual Addressing (SVA) interface
-
Single-root I/O Virtualization(SRIOV)/Process Address Space ID (PASID) integration
-
PASID in SRIOV virtual functions
-
Device assignment/sub-assignment
-
-
-
IOMMU virtualization
-
IOMMU drivers SVA interface
-
I/O Address Space ID Allocator (IOASID) and /dev/ioasid userspace API (uAPI) proposal
-
Possible IOMMU core changes (e.g., better integration with device-driver core, etc.)
-
-
PCI
-
I/O Address Space ID Allocator (IOASID)
-
INTX/MSI IRQ domain consolidation
-
Gen-Z interconnect fabric
-
ARM64 architecture and hardware
-
PCI native host controllers/endpoints drivers current challenges and improvements (e.g., state of PCI quirks, etc.)
-
PCI error handling and management e.g., Advanced Error Reporting (AER), Downstream Port Containment (DPC), ACPI Platform Error Interface (APEI) and Error Disconnect Recover (EDR)
-
Power management and devices supporting Active-state Power Management (ASPM)
-
Resources claiming/assignment consolidation
-
Prefetchable vs non-prefetchable BAR address mappings
-
Thunderbolt, DMA, RDMA and USB4 security
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "VFIO/IOMMU/PCI MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Alex Williamson <alex.williamson@redhat.com>
- Bjorn Helgaas <bjorn@helgaas.com>
- Joerg Roedel <joro@8bytes.org>
- Krzysztof Wilczyński <kw@linux.com>
- Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
VFIO/IOMMU/PCI MC Summary
-
Page-Based Hardware Attributes (PBHA) on arm64 [video][slides]
- Will Deacon introduced the PBHA as a topic that is closely related to the ARM architecture but it's also not necessarily VFIO/IOMMU/PCI specific, but while working with the Android project and its ecosystem Will also works with a lot of different vendors and because variety of SoCs, and the PBHA is a feature of the ARM platform that is most definitely coming to become widely supported. Will would like to see Linux supporting PBHA officially in the future so that the kernel could benefit from what it has to offer.
- Will then explain what PBHA is, how it works (detailing the elements of involved ARM architecture) and what problems it aims to solve (and why is he reaching out to the wider kernel development community to bring their attention to PBHA), especially since ARM leaves a lot of implementation details to the vendor. Will proceed them to provide a few examples of how PBHA might be used with a specific example of how PBHA can be used to work with and include the System Level Cache (SLC) on an ARM platform where possible use-cases would include:
- Performance and prefetch hints
- Caching policy
- QoS (high/low priority operations)
- Define cache data format (potentially error-prone)
- Will, following his introduction to the challenges provided by PBHA, asked the audience the question of how Linux could support it, especially since some of the parts of the PBHA functionality are vaguely defined by ARM architecture, what sort of plumbing would be needed in the kernel, etc.
- Jason Gunthorpe mentioned that the implementation of PBHA reminds him of how Intel supports the TPH bits in PCIe, especially when it does to the vendor-defined implementation details that are also present in the TPH specification. Jason then mentioned to Will Deacon that there are, of course, crucial differences in the implementation, for example, the TPH works on a per-transaction basis whereas PBHA works on a per-page basis. Jason and Will Deacon then discussed some similar use cases between Intel and ARM architectures.
- Hur Hussein asked whether it would be possible to use the PBHA bits for memory/system partitioning - assign parts of the memory to specific processors, for example, to implement a multi-kernel?
- Catalin Marinas joined the conversation between Jason Gunthorpe and Will Deacon providing more specific details surrounding DMA APIs and potential issues related to DMA coherency.
- Jason Gunthorpe asked about NUMA support and PBHA to which Will Deacon responded that he would need to look into it since Android devices do not support NUMA like some of the large systems do, especially since PBHA is not only limited to Android.
- Marc Zyngier then talked with Will Deacon about how to define what the PBHA bits were meant for, given that this information is often baked into the underlying hardware outside of software (and firmware) control.
- Arnd Bergmann suggested that perhaps details about what these bits do could be added to the Device Tree (DT) so that they could be advertised to the kernel, and Will agreed with Arnd. Marc Zyngier added that we would need to find a way to associate a semantic to them (these PBHA bits) which would be cumbersome. Marc and Will Deacon then agreed that naming them sensibly would be problematic. Marc followed up highlighting potential security aspects of PBHA concerning the interaction between host systems and guests.
- Peter Zijlstra asked if Will Deacon has some examples of an actual SoC that implements the PBHA bits, and if we know what they do there? Will then followed up explaning that the SLC example is based on a very similar existing implementation.
- Guo Ren joined the conversation mentioning that the PBHA is like the Bufferable bit in ARM32 architecture which is also pass-through to the interconnect.
- Will Deacon then asked to get in touch with him if someone is interested in the topic of bringing PBHA support to Linux, possible use-cases, etc.
-
PCI Data Object Exchange (DOE), Component Measurement and Authentication (CMA) / SPDM 1.1 - Mediating access and related issues [video][slides]
- Jonathan Cameron opened with an introduction describing DOE, which is a requirement for CLX 2.0 (to access device provided table called CDAT which is alike ACPI table), explained that CMA will most likely start to appear in device very soon, and so would PCIe/CXL IDE (a form of link encryption) is very likely to appear in devices also very soon. Jonathan described each component as sort of a stack that includes the DOE at the top, CDAT (CXL Table Access) and CMA (Component Measurement and Authentication), SPDM (Security Protocol and Data Model) and IDE (Integrity and Data Encryption), where DOE is a transport for CDAT/CMA, which in turn is a protocol transport for SPDM and IDE, etc.
- Jonathan then explained what DOE is and how it works, mentioning that DOE is defined by PCI-SIG ECN as a mailbox-based interface that falls within the Extended Capabilities of PCI Configuration space.
- Jonathan then added that crucial to DOE is the discovery protocol as there can be multiple protocols (such as e.g., CMA, IDE, CXL Table Access, etc.) on a single DOE and therefore there has to be a mechanism in place that allows for discovering, especially as there can be multiple mailboxes (as there can be multiple DOE instances) per PCI function, therefore, there has to be a mechanism in place allowing to discover what the given mailbox is for.
- Jonathan then explained that there is a problem with the discovery protocol currently possessing a problem as it can affect other operations in-flight and even read someone else's data.
- Jonathan added that vendors can choose what the protocols are and each protocol is labelled by a Vendor ID and vendor-controller Data Object Type.
- Jonathan then continued, explaining that the DOE can be used by multiple different services such as the kernel, userspace, hypervisor, TEE (Trusted Execution Environment), firmware, etc., and that there are no built-in facilities to ensure safe concurrent access to DOE, there is no way to do hardware-level mediation, or exclusive access flag.
- Jonathan then expanded on the discovery protocol problem mentioned before where an attempt to use the discovery protocol from one component to ascertain what is supported can break an ongoing communication between some other components, for instance, it could lead to broken/corrupted IDE communication leading to data corruption.
- Lorenzo Pieralisi joined the conversation to clarify that this problem is because the mailbox-based protocol specification only supports a single query-response mechanism at a time.
- Jonathan Cameron agreed with Lorenzo Pieralisi and moved his focus onto the userspace access describing some of the challenges related to accessing DOE from userspace especially given that the DOE protocols can and most definitely will be defined by vendors.
- Mark Rutland added that we probably want to enforce kernel-side mediation (at least as the default), to give us the freedom to do so in future rather than userspace foreclosing that possibility.
- Jonathan, continuing, talked about how we could reliably expose access to DOE from userspace, adding that this would require kernel-userspace access mediation, and also mentioned that there would not be a sensible obvious way to harden kernel against broken userspace implementations.
- Jonathan, following on the subject of userspace access, proceed to offer a few suggestions concerning DOE access from userspace:
- Kernel implements every known protocol with an appropriate userspace level access interface.
- Kernel provides generic access for some protocols.
- Kernel could bind to a given DOE and claim it. How to manage user access in this case?
- Ashok Raj asked whether DOE has just one outstanding message at a time? or can take multiple commands and responses?
- Jonathan Cameron responded to Ashok Raj explaining that the DOE pre protocol is only allowed to have a single query-response at the time, but if you implement multiple protocols, then each can have its dedicated query-response in flight. Albeit, this is currently not supported in the kernel.
- Jason Gunthorpe asked if there is any chance to go back and fix the spec to be more implementable?
- Jonathan Cameron responded to Jason Gunthorpe that a lot of the standard draft is currently not yet officially published as a lot of vendors are working on it behind closed doors at the moment, plus some vendors are already building hardware according to the already issued specification, so there is no way to influence the design of DOE at the moment, perhaps this would change for the figure versions of the specification. Jonathan mentioned that the specification designers would welcome more involvement from the kernel community next time.
- Jonathan, moving on, briefly discussed another challenge, from the kernel point of view, when the TEE and firmware use DOE directly between one another bypassing the kernel. And also hot-plug functionality remains to be an unsolved problem at this point. Jonathan then mentioned that a proposal for solving the problems with DOE kernel access might include:
- System-wide _DSM (might not always work, and are also vendor-specific)
- Dedicated ACPI table with a buffer containing current state details (a non-trivial specification change)
- Jason Gunthorpe asked if Linux could publish a generic protocol that could rely to userspace and demand vendors implement it?
- Jonathan addressed a question from Jason Gunthorpe and others from the audience about whether vendor-specific protocols be put on a different mailbox than, for example, discovery protocol - in theory, yes, but in practice no. Jonathan added that this could be solved by kernel mediating access from userspace depending on what the operation would be.
- Ashok Raj added that likes the idea for vendor-specific if it's a separate mailbox. So kernel doesn't need to worry about anything.
- Dan Williams joined the conversation and added more details about the ACPI _DSM based mechanism mentioned previously by Jonathan Cameron.
- Jonathan Cameron suggested, in response, that perhaps vendors should be asked to publish their protocols in advance so that Linux would implement a given specification to be safely used later?
- Dan Williams added that this might help stop vendors from inventing the same thing separately or running wild with unsupervised implementations.
- Lorenzo Pieralisi joined the conversation mentioning that the SPDM is a state machine and thus mediating access from the userspace would require keeping track of the current state.
- Jonathan Cameron discussed issues of SPDM together with Lorenzo Pieralisi and then moved into SPDM explaining that it defines a standard message and protocol for things such: device or mutual authentication, establishing secure channels, integrity measurement (for example, for firmware verification), key management, etc.
- Jonathan then moved on to CMA describing it as defined in the PCI-SIG CMA ECN specification and using DOE protocol to carry SPDM messages, explaining that there is a required subset of SPDM that must be supported for CMA.
- Jonathan, continuing, provided example use cases and where DOE will generally be useful for things such as protecting access through external Thunderbolt (preventing attacks such as Thunderclap), runtime hardware state changes verification (hotplug, resets, OOB firmware updates), virtual function (VF) can have its own CMA DOE instance (for example, to verify underlying trusted hardware), verification of fully emulated devices (virtio) to ensure emulated devices identities, etc.
- Jonathan also then mentioned that Security policies for DOE capable systems are currently not refined yet.
- Jonathan then added that sadly influence the current version of the specification and therefore all the shortcomings would have to be dealt with in the software implementation - perhaps the future drafts would be more influenced by potential users and kernel developers feedback. Currently, the specification is very generic allowing for a lot of ambiguity.
- Participants agreed to take the conversation on the mailing list and/or off-line due to the complex nature of DOE and to discuss possible implementation details.
-
Shared Virtual Addressing (SVA) for in-kernel users [video][slides]
- Jacob Pan began explaining what he is going to talk about is not a new feature, but rather an effort to update the current implementation/codebase so that support for making DMA requests with PASID and also for SVA (which concerns sharing Page Tables between the CPU and IOMMU which then implies DMA) can be added.
- Jacob explained how the implication of a system without PASID support, so DMA without SVA, explaining how the DMA address vs CPU virtual address has to go through almost symmetrical paths in terms of MMU and IOMMU and CPU TLBs and IOTLBs. Jacob continued to explain that when there is no PASID support, everything is almost duplicated (in terms of functionality and most likely to code) and there is no overlap, which forces the device to only use Physical Address based DMA.
- Jacob then explained the difference with a system that has PASID support, so DMA with SVA, explaining how the MMU and IOMMU and TLBs (CPU and IOTLB) even though still separated are now sharing a CPU Page Table (and therefore the mappings), allowing for both the CPU and the device to Virtual Address based DMA (both in the user and kernel mode).
- Jacob also highlighted several issues with the current implementation, especially mentioning that it currently insecurely exposes entire kernel mapping without any restriction, and also does not have any means to perform proper IOTLB synchronisation and bypasses current DMA APIs.
- Jacob then moved to propose and explain some potential solutions:
- Physical address bypass mode. That is similar to direct DMA for trusted devices where such devices can perform DMA passthrough IOMMU on a per PASID basis.
- IOVA mode that is DMA API compatible. Maps a supervisor PASID the same way as the PCI Requester ID (RID) does at the moment.
- A new KVA mode, which introduces new map/unmap APIs, supports strict and fast sub-modes transparently based on whether a device is trusted or not.
- David Woodhouse added that we did have patches once which made the DMA API return 1:1 addresses, with a really trivial "allocator" which only had to do real work when there was a second mapping of a given address. The first mapping just got the same HPA back as the IOVA. It made the first mapping of a given HPA basically lock-free.
- Jason Gunthorpe questioned the new KVA APIs portability since it bypasses existing DMA portability features and questioned how a KVA differs from a bypass mode as both seem to offer no security. Similarly to the performance mode. Jason also then questioned why a device driver would need or even care about controlling the KVA and trust, etc.
- Jason then suggested that a device driver needs to only care about DMA APIs when running in DMA mode.
- Thomas Gleixner pointed out that this new set of APIs introduces layering violations.
- Jason Gunthorpe added that there is no performance hit when using DMA properly and there is no performance gain when using KVA, especially when DMA runs in the direct mode, which is fast.
- Will Deacon asked about the difference between the KVA and the bypass. What benefits does KVA offer over bypass? And added that somebody mentioned "fine-grained permissions", but the linear map is mostly read/write (R+W) isn't it? Also guessing that execute permissions aren't a huge concern for devices?
- Ben Serebrin asked what's the advantage of using PASID in this case (or any case)?
- Jason Gunthorpe responded to Ben Serebrin stating that some hardware requires PASID to do DMA at all, unfortunately. To which Ben followed up asking which hardware would that be? Also, asking if that's the only reason, and why not use a single PASID and treat it like a normal BDF? And Jason responded adding that it's Intel IDXD.
- Ashok Raj suggested that KVA would allow for devices such as the accelerator to work with the same address without any need to convert the pointer to HPA.
- Jason Gunthorpe continued with the notion that there is no need for an additional API that looks and works the same way as the DMA API, especially when DMA is used properly.
- Participants agreed to continue this very passionate conversation on the mailing list, since the use-case for the KVA is currently not entirely clear, especially the performance benefits.
-
Status of Dynamic MSIx and IMS opens [video][slides]
- Ashok Raj started by explaining current challenges related to the implementation of how MSI and MSI-X are currently handled in the device drivers, explaining deficiencies such as, for example, the need for device drivers having to allocate everything they need (in terms of IRQ handling) in advance during the probe time, where afterwards nothing can be dynamically added and/or removed.
- Ashok, continuing, explained everything in the context of how legacy MSI functions, imposing some of the current limitations, contrasting everything to how MSI-X and IMS is defined. Ashok also briefly mentioned what impact current implementation has on VFIO and handling IRQs in guests.
- Thomas Gleixner then asked Ashok Raj and Megha Dey if the new patches would address the current deficiencies he was just talking about? To which Ashok replied that the proposal is concerning work done by Megha that aims to enable the ability to add/remove IRQ vectors dynamically after the probe phase, and thus wouldn't necessarily (at least not yet) resolve all the outstanding issues related to IRQ handling.
- Megha Dey then moved to explain the new API she aims to introduce and explained changes proposed to the MSI-X allocation which would enable IRQ vector allocations after the probe time without negatively impacting the existing legacy drivers using currently supported APIs.
- Megha then asked if there are any comments on the proposed new APIs and added that patches are already available for review soon to be ready to be submitted to the upstream kernel.
- Ashok Raj then suggested that the IMS core could perform the PASID programming if the device driver supports it, asking Jason Gunthorpe for comment.
- Jason Gunthorpe, in response to Ashok Raj, argued that the format of an interrupt message should not be dependent on the core as each device might and most likely will provide its own implementation depending on how IRQ is implemented in the hardware.
- Thomas Gleixner pointed out that IMS programming PASID as proposed by Ashok Raj is not a generic enough solution when both the guest and host (using translated PASID) use PASID, and that we should avoid device specific hacks.
- The participants discussed the possibility to add a hypercall to solve the issues of programming guest PASID, however, Thomas Gleixner provided multiple examples where hypercalls would not work or not be needed at all. In response to Thomas Gleixner's comment, Jason Gunthorpe pointed out that HyperV currently uses similar hypercall to program PCIe MSI-X, but people were murky on the details as it was a long time ago.
- Ashok Raj suggested that this would need to be redesigned based on the feedback from the session and a new e-mail thread on the mailing list will be started to track the progress and for the discussion to follow.
-
Unified I/O page table management for passthrough devices, in-kernel API discussion between IOMMU core and /dev/iommu [video][slides]
- Kevin Tian began by explaining what is currently supported in the Linux kernel, briefly explaining the two existing pass-through frameworks: VFIO and vDPA. Kevin explained in depth their MMU logic and Page Table management, how current implementation duplicates a lot of the same logic between different frameworks and explained why this does not scale well, especially in the future as adding support for more features like PASID/SSID, User-managed Page Tables, I/O Page Fault, IOMMU Dirt Bit, etc., would result in complex logic that most likely would be maintained in more than one place.
- Kevin then proposed to develop a new Unified Framework to handle IOMMU and to centrally manage I/O Page Tables for pass-through devices. Kevin then explained how this new framework would fit within the existing ecosystem and how any new feature added would be then shared with VFIO and vDPA through this common framework.
- Kevin, continuing, explained how the new framework would also propose addition of a /dev/iommu device node as a new interface (also referred to as "iommufd") to facilitate userspace access (uAPI) and to manage the user-initiated DMAs.
- Kevin also mentioned that a first version of the work was sent already as a series of patches.
- Jason Gunthorpe mentioned that this new framework is shaping to be a future fully-fledged sub-system (not unlike the existing DRM or RDMA sub-systems). That in the future will allow userspace to talk directly to hardware very efficiently without heavy kernel involvement. Jason also added that this is also aimed to provide a very generic interface offering the same set of features across all supported architecture and hardware.
- Jason then, while discussing the possible implementation details, suggested that VFIO, for example, would become a shim in front of the new framework. To avoid duplicating the code and efforts.
- Kevin Tian added that he is actively looking for community collaboration so that the efforts can move a lot quicker and work can be done on different parts of the new framework in parallel. Kevin Tian then asked interested people to reach out on the mailing list.
- Baolu Lu pointed out that in future implementations the "iommufd" interface could copy what VFIO does today, but with the addition of management of multiple security contexts due to the bind/attach semantics being decoupled in the new design.
- A brief discussion took place about how IOMMU is currently attached to a device and a group, and how would this work going forward. Jörg Rödel, Jason Gunthorpe and Baolu Lu talked about IOMMU groups, domains, security zone/areas, locking (how to approach "refcount") and how a group maps to a device, especially when the IOMMU/hardware cannot differentiate/make a match between a device and group. Jörg then explained how PCI comes into the picture, especially with sub-functions of a PCIe device, and how IOMMU groups would have to work with PCI going forward.
- Baolu Lu asked if "iommufd" should be a separate module and whether it should be called "iommufd" or "uiommu", and Jörg Rödel together with Jason Gunthorpe suggested that it would be a good idea to keep the "iommufd" as a module since all the users are modules today and Kconfig and other mechanisms would handle loading requires dependencies just fine. Everyone seems to agree that the name "iommufd" would be fine since the kernel has a long-standing history of naming things something-fd.
-
Brain storm some of the features support in Linux for PCIe [video][slides]
- Ashok Raj brought up several open issues in the PCI sub-system that he would like to discuss with the wider audience:
- MPS handling for hot-plug devices
- Extended Tagging
- TC/VC Mapping
- Flattening Portal Bridge (FPB)
- Resizable BARs
- Ashok asked the audience about how to handle MPS (and MRRS) for hot-plug devices, specifically how to handle devices that aren't capable of the same already selected maximum MPS that was set before the device was inserted/added. Bjorn Helgaas responded saying that this indeed needs work to which. Ashok then suggested some possible solutions, and Bjorn agreed that enumerating the hierarchy of devices to reconfigure them to reflect new MPS/MRRS values (similarly to how the BAR reassignment patch from a year ago currently handles this) might be a feasible approach. Ashok added that perhaps when devices have incompatible settings then we would possibly refuse such devices to load and bind a driver letting the user know that some action has to be taken and perhaps even suggest a system restart. Ashok then mentioned that he had already begun working on patches to handle the MPS size handling.
- Ashok Raj then asked about how to work with extended/enhanced tagging such as the 10-Bit Tagging (aside from existing 8-Bit Tagging support). Bjorn Helgaas mentioned that there is already a series of patches adding 10-Bit Tagging support that has been sent to the mailing list for review. Bjorn then added that this is a new piece of functionality that is aimed to be added very soon and would need to be investigated a little bit more especially if there are concerns about how to handle 10-Bit Tagging going forward.
- Ashok, continuing, asked about the TC/VC Mapping, but didn't have a great response to this question as this feature does not seem to be widely used and people didn't have anything to suggest. Bjorn Helgaas wasn't sure what TC/VC is and how it's being used currently thus he did not recommend how to offer this feature in Linux.
- Ashok moved then onto the Flattening Portal Bridge (FPB), which is a new feature supported in the upcoming Tiger Lake platform family from Intel, and asked how the Linux kernel could support it, even though at the moment there hasn't been a great demand to support this. Bjorn Helgaas mentioned that he hasn't seen any patches or requests for this functionality to be supported by Linux for the time being, however, he also mentioned that the resource rebalancing in Linux is not very well explored at the moment, and FPB might be one of the potential solutions for that. Jonathan Cameron brought up a potential issue of handling changes to configuration on the OS side where things have been previously set in the BIOS. But there was no consensus on what to do with FPB going forward.
- Ashok Raj then briefly discussed Resizable Bars, especially on the side of graphic devices, and explained the issue of handling resizing of BARs based on the values originating from the hardware and the device drivers, but the conversation that followed concluded that the way how things work currently is acceptable and there is nothing to do to address this at the moment. Especially since graphic drivers and primarily the DRM sub-system relies on handling resizable BARs by itself (by requesting different sizes as needed) on a device-specific basis. It has also been pointed out during the discussion that often devices claim to support very large BARs and drivers often allocate a fraction of what is needed (knowing best what to do), therefore allocating large BARs in advance would be both wasteful and problematic. However, as Ashok Raj concluded, perhaps supporting the ability in the PCI sub-system to resize BARs when BIOS does not support it would be a desirable feature.
- Ashok, slowly wrapping up his presentation, asked if there is anything else we ought to fix in the PCI sub-system, and Jonathan Cameron mentioned that it would be beneficial for kernel developers to get involved with the PCI standard specification draft work a little earlier in the process before the actual specification is finally published (by which time it's often too late to influence changes for the better). Ashok pointed out that a challenge with getting involved with the PCI-SIG group is that specification development is often not open (happening behind closed doors) and decisions are kept private before the final draft is published which can be problematic. No solution has been put forward at this time.
- Ashok Raj brought up several open issues in the PCI sub-system that he would like to discuss with the wider audience:
Open Printing MC
CFP Ends: Sept 10, 2021
The Open Printing microconference focuses on improving and modernizing the way we print in Linux.
Suggested topics:
- Changes in CUPS 2.4.x
- Print sharing changes for mobile
- OAauth support to replace Kerberos
- Printer drivers replaced with Printer Applications
- TLS/X.509 changes
- CUPS in containers
- CUPS 3.0
- Future CUPS development
- Identify support platforms
- Key printing system components
- Discuss integration with Printer Applications and application stores like Snap Store
- Print Management GUI
- Migrating from working with CUPS queues to IPP services
- Handling legacy devices that do not handle IPP services
- Common Print Dialog Backends
- CPDB, CUPS backend.
- Separating GUI toolkits and the print technology support to be independent from each other.
- Printer/Scanner Driver Design and Development
- Design, creation and Snap-packaging of Printer/Scanner Applications.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Open Printing MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Aveek Basu <basu.aveek@gmail.com>
- Till Kamppeter <till.kamppeter@gmail.com>
- Michael Sweet <msweet+lpc@msweet.org>
- Ira McDonald <blueroofmusic@gmail.com>
RISC-V MC
CFP Ends: Sept 10, 2021
The RISC-V microconference focuses on the development of RISC-V.
Suggested topics:
- Platform specification progress, including SBI-0.3 and the future plans for SBI-0.4. There has been significant progress on the platform specifications, including a server profile, that needs discussion.
- Privileged specification progress, possible 1.12 (which is a work in progress at the foundation).
- Support for the V and B specifications, along with questions about the drafts. The V extension is of particular interest, as there are implementation of the draft extensions that are likely to be incompatible with what will eventually be ratified so we need to discuss what exactly user ABI compatibility means.
- H extension / KVM discussion, which is probably part of the drafts. The KVM port has been hung up on the H extension ratification process, which is unlikely to proceed any time soon. We should discuss other options for a KVM port that avoid waiting for the H extension.
- Support for the batch of SOCs currently landing (JH7100, D1)
- Support for non-coherent systems
- How to handle compliance.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "RISC-V MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Palmer Dabbelt <palmer@dabbelt.com>
- ATISH PATRA <atish.patra@wdc.com>
Kernel Dependability and Assurance MC
CFP Ends: Sept 10, 2021
The Kernel Dependability and Assurance Microconference focuses on infrastructure to be able to assure software quality and that the Linux kernel is dependable in applications that require predictability and trust.
Suggested topics:
- Identify missing features that will provide assurance in safety critical systems.
- Which test coverage infrastructures are most effective to provide evidence for kernel quality assurance? How should it be measured?
- Explore ways to improve testing framework and tests in the kernel with a specific goal to increase traceability and code coverage.
- Regression Testing for safety: Prioritize configurations and tests critical and important for quality and dependability
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Kernel Dependability and Assurance MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Shuah Khan <skhan@linuxfoundation.org>
- Gabriele Paoloni <paoloni.gabriele@gmail.com>
Kernel Dependability MC Summary
The Kernel Dependability and Assurance Microconference focuses on infrastructure to be able to assure software quality and that the Linux kernel is dependable in applications that require predictability and trust.
-
Runtime redundancy and monitoring for critical subsystem/components [video] [slides]
-
ASIL decomposition: Separate out the simpler component from the complex component by breaking the top level requirement into multiple simpler requirements. ASILB Hypervisor separates out the QM(B) Linux kernel from the ASILB Safe OS. But this is expensive.
-
One way to accomplish ASIL decomposition is separating out the subsystems that have these requirements. The monitor and the element must be separate. Things to consider are:
-
Can use the runtime verification on the monitor?
-
How do we separate the monitor and the element as they are running in the same address space?
-
-
The monitor is driven by events (via trace events?). It keeps variables to save state in read only mode. It must keep the element from corrupting the monitor variables. One way is to accomplish this goal is by making it all read only, but have the fault handler on write to check if the write is from the monitor and then perform write.
-
What about triple faults?
-
The hypervisor will detect a kernel crash and triple fault concern is covered by this.
-
-
Could the monitor be used by security purposes too?
-
Perhaps keep the element's address space from corrupting the kernel. But still need to protect against the kernel itself corrupting the monitor. The monitor must be verified.
-
Outcome: Work on a few ideas to design a monitor, implement a prototype, and get feedback.
-
-
Traceability and code coverage: what we have in Linux and how it contributes to safety [video] [slides]
-
Overview of RH Automotive initiative
-
OS distribution based on RHEL to support automotive safety critical applications
-
Overview of CKI and Kernel CI
-
CKI is RH public facing CI and member of the Kernel CI
-
Importance of code coverage in ISO26262:
-
need to evaluate the completeness of the test coverage and can be used also to verify correct traceability from top level requirements to design element of the Kernel
-
-
Code Coverage Analysis
-
fetches a pre-built RHEL gcov kernel
-
uses lcov to generate reports
-
multiple reports can be combined to have an overall view
-
Code Coverage Example in the slide set.
-
-
Targeted Testing: it's a tool called KPET that triggers tests automatically as soon as a patch is submitted to verify if it is covered.
-
Possible improvement for CKI:
-
gating patches that are missing code coverage;
-
Integrating code coverage analysis into the pipeline
-
-
Discussion
-
Who is using code coverage? Is it useful?
-
What is the % coverage target to aim for? Is 90% achievable
-
How often should we generate coverage reports?....for every patch it is expensive
-
What is the alternative to code coverage
-
Like the idea of gating patches with missing targeted tests. currently we are doing this manually. This process will be automated integrating code coverage in the pipeline
-
What code is covered by customer workloads; gcov has performance impacts that make it inappropriate for customer-deployed kernels. It is hard to get a report if a test misbehave and stop the test compile. Looking at different parts of the Kernel I think we are away to make gating patches a feasible feature, but it is a good long term goal. We need maintainers commitment to make this happen. There is not a correct coverage target. Don’t know of any project achieving 90% target. In most of code basis the target is between 70% and 80% percent as achievable. Approach of tackling it with maintainers subsystem by subsystem is reasonable way to approach this.
-
Coverage is important to make sure that we tested all the meaningful states of the system. In Automotive code coverage analysis is used at subsystem level to evaluate the coverage trend and make sure it is positive. Once we have a model for the subsystem, we can identify the states we're missing to trigger. Using RV to monitor to understand which path being taken. They are correlated. It is important to come up with a strategy to share with maintainers what their coverage looks like in concise and easy to read reports. Challenge is to have the requirements that motivate new patches also be systematically recorded, and the testing appropriate be tracked. It would be good if requirements were written in a way that help to derive test cases
-
Coverage reports from syzkaller provides coverage: https://syzkaller.appspot.com/upstream
-
Coverage reports are very useful for assessing how good is the testing, what's missing, etc. The only way to ensure that what developers think is covered is really covered.
-
Outcome: Agreement that having code coverage would be useful, challenge is now how to make it practical and accessible to maintainers. Goal would be to evolve to coverage on patches matched back to requirements.
-
-
Adding kernel-specific test coverage to GCC's -fanalyzer option [video] [slides]
-
fanalyzer is a compiler option to enable static analysis inside the compiler. fanalyzer is not complete nor fully correct; it shows false negatives as well as false positives
-
Wow to extend the analyzer: information leaks uninitialized kernel memory copied to user space
untrusted data used by a trusted kernel service where should the code to enable this live? Right now David is using is own branch with few kernel hacks -
Across trust boundaries. Suggestion to use__user (Annotation of function that does the copy across trust boundaries) David can reinvestigate if gcc provide an attribute space that the analyzer can pick up on. David invented also the attribute tainted to mark a risk of untrusted data coming as input to this function. Prefer untrusted over tainted in kernel, and use define.
-
Kernel specific parts are in Red Hat until responsible disclosure path figured out. Without LTO it is hard to check the whole call tree where data can be tainted. We can quiet false positives by turning off checkers to fix false positive and then turn back on. This should probably be a config option. So time is not necessarily an issue. Default to off. Sparse is running all the time. Request to share the tools so we can be one step ahead of black hats. The best way to integrate – send it to Kernel community to figure out integration.
Outcome: Like to see this in the kernel and available.
-
-
A bug is NOT a bug is NOT a bug [video] [slides]
Differences in bug classes, bug tracking and bug impact. Explore what is a bug, how to track a bug and what is the impact of a bug?
-
What is a bug: spelling mistake? compiler warning race conditions?
-
every commit that doesn't add a feature is a bug-fix
-
-
Exploring which tools are available to: detect bugs, track bugs, and report bugs upstream
-
Classifying different classes of bugs: identified by humans, automated tools, fuzzing, bug reported by humans, reported by bugzilla, and bugs reported by automated testing systems:
-
how are these bugs tracked? how are the fixes linked to the failing tests
-
-
Compiler warning bug report: is the parsing precise? how the reporting works?
-
Bug report identified by fuzzing: uses its own bug tracking tool
Discussion
Can find a way to consistently report bugs without all these tools spreading around? Assuming we find a tool to report all the bug classes, how do we present meaningful information to the maintainers?
How can we build something that is able to report bugs fast enough?-
If a bug gets to linux-next it is still fine as long as it does not land upstream, however a bug implies a rework that would mean another patch on top of linux-next. Not all spelling mistakes are a security threat. However, typo squatting is a vulnerability so they could be considered bugs. It depends where the spelling mistake is. In comments, no. In code, it could be. It would be good if we can get to a point where we reject features breaking kselftest. A final test by Linus involves compile test backed by a set of CIs that we can point to showing that there are no bugs. Doing sanity checks is a good idea as it can spot bugs even though it doesn't guarantee that there no integration issues in linux-next. Every maintainer has got his own test suite. It isn’t possible to test on all supported hardware. CI rings don’t or can’t host all supported systems and hardware. It isn’t practical to expect them to. It would be nice to have a checklist to be met by contributors before sending a patch
-
How do you know which tests to run for a subsystem? Should maintainer re-run tests identified for the patch? How do you prove that the tests were run.
-
Threshold for spelling mistakes: Compiler warnings are considered before pulling into Linux-next; spelling mistakes get pulled into linux
Outcome: No clear outcome from this discussion.
-
- Kernel cgroups and namespaces: can they contribute to freedom from interference claims? [video] [slides]
FFI is the lack of cascading failures from one item to another item in the system. Dependent failure analysis as well as fault tree analysis can b very valuable to support FFI claims. Containers are enabled by namespaces and cgroups.
How to mitigate time interference due to storage BW being eaten up? For instance, Intel having CAT technology but it is never used to prevent this type of interference. From a QE point of view we can develop tests to verify this. How can you isolate the CPU resource itself? Don't think you can isolate CPUs using cgroups. cgroups allow pinning but they do not make them invisible. Isolating CPUs can provide the right visibility of available CPUs to applications running on top or to deny a view of resources allocation to prevent security attacks.
Discussion-
Do we have enough namespaces or cgroups controllers?
-
Controller subsystem negotiation (granularity ok)?
-
Cgroup-v2 not yet controlling RT process, all must be in root cgroup for cpu controller to be enabled...updates?
-
Any thoughts about KVM? (long version: Can HVs enable an IO contention that existing control surfaces cannot ameliorate?)
-
Are virtualized GPU functions under control?
Outcome: No clear outcome from this discussion.
-
-
Kernel testing frameworks [video] [slides]
GCOV - summaries don’t tell whether your testing is good or bad. Kselftest & KUnit - combined can be achieve goals. Test plan to think about paths being tested. Keeping these frameworks in kernel tree, so that tests are kept with code. KCOV is quite different from GCOV and replace one with another and vice versa. There's also the testing overview page in the kernel docs which covers the differences between KUnit and kselftest: https://www.kernel.org/doc/html/latest/dev-tools/testing-overview.html
Outcome: Feel free to continue conversation at next ELISA workshop.
System Boot and Security MC
CFP Ends: Sept 15, 2021
The System Boot and Security microconference focuses on the firmware, bootloaders, system boot and security around the Linux system. It also welcomes discussions around legal and organizational issues that hinder cooperation between companies and organizations to bring together a secure system.
Suggested topics:
- TPMs, HSMs, secure elements
- Roots of Trust: SRTM and DRTM
- Intel TXT, SGX, TDX
- AMD SKINIT, SEV
- Ways to improve attestation,
- Integrity Measurement Architecture (IMA)
- TrenchBoot, tboot
- UEFI, coreboot, U-Boot, LinuxBoot, hostboot
- Measured Boot, Verified Boot, UEFI Secure Boot, UEFI Secure Boot Advanced Targeting (SBAT)
- shim
- boot loaders: GRUB2, SeaBIOS, network boot, PXE, iPXE
- u-root
- OpenBMC u-bmc
- Legal, organizational and other similar issues relevant for people interested in system boot and security.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "System Boot and Security MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Daniel Kiper <dkiper@net-space.pl>
- Piotr Król <piotr.krol@3mdeb.com>
- Matthew Garrett <mjg59-plumbers@srcf.ucam.org>
Android MC
CFP Ends: Sept 7, 2021
The Android microconference focuses on cooperation between the Android and Linux communities.
Suggested topics:
- Alignment issues between Android and Cgroups v2: Issues in refactoring Android's use of cgroups to utilize cgroups v2
- Thermal: Issues around performance and thermal handling between the kernel and Android's HAL
- Fuse/device-mapper/other storage: Discuss out-of-tree dm/storage drivers and how they might go upstream or better align with upstream efforts
- In kernel memory tracking: Tracking/account GPU (and other multi-device shared) memory and how it might fit with cgroups
- fw_devlink: Remaining fw_devlink issues to resolve, now that its enabled by default.
- Hermetic builds/Kbuild
- GKI updates: Whats new this year in GKI and where its going next year
- Rust in AOSP / Kernel / Binder: How Android is adopting rust for userland and potentially for kernel drivers
- Android Automotive OS Reference Platform: Details on recent Android Automotive work
- Community devboard/device Collaboration: Ways to better collaborate on enabling various devboard against AOSP, without needing close interlock with Google
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Android MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Karim Yaghmour <karim.yaghmour@opersys.com>
- Todd Kjos <tkjos@google.com>
- John Stultz <john.stultz@linaro.org>
- Amit Pundir <amit.pundir@linaro.org>
- Sumit Semwal <sumit.semwal@linaro.org>
- Sandeep Patil <sspatil@google.com>
- Ric Wheeler <ricwheeler@fb.com>
Android MC Summary
-
- Issues seen bringing up a product with uclamp framework
- Problem: Trouble with uclamp.max parameter when multiple tasks with contradictory requests are co-scheduled on the same CPU.
- Example: background/unimportant task alone on a cpu with low uclamp.max setting => low cpu freq, even over a long time
- But when another task (with default uclamp.max) is co-scheduled on the same CPU, the runqueue-level uclamp.max is lifted completely.
- This causes the frequency to go to max, and the EAS over-utilized threshold to be crossed, which will re-enable periodic load-balancing and negatively impact the task placement.
- Wastes significant amounts of energy (the second task probably doesn't need to run at max frequency if it is short running), and negatively impacts performance with unnecessary task migrations. The worst of both worlds.
- Goal: Change the rules for runqueue-level aggregations and apply the uclamp.max value before doing the util_avg sum, hence clamping how much a task can contribute to the overall runqueue-level request.
- In discussion it was mentioned that agreed that the problem was real and worth fixing
- Some Concern that it could still potentially lead to some tasks being unfairly treated
- uclamp.max value applied to the background task will make it run for longer, which means it will now need more CPU time. So more likely background task will still be runnable when important tasks starts to run, delaying the important task
- Two proposals:
- 1) check whether the important task could fit at the current CPU frequency (with the background task's uclamp.max applied), and not lift the runqueue's uclamp.max when that is the case, but fallback to raising the frequency to max otherwise.
- 2) account for the tasks priorities to select a frequency point that will allow the important task to be 'served' fairly.
- Android was now making use of cpu.shares to avoid some of these issues.
- First proposal was unlikely to suffice for a product as it didn't address the issue in a number of cases.
- Second proposal however sounded more promising.
- Arm discussing this internally, and plans to address, but no patches ready at this point
- Wang and Perret offered to help w/ testing on their devices.
- Intel-based platforms have similar-issues
- Outcome: Agreement the problem is real & making changes to the uclamp.max aggregation mechanism was indeed the best chance of tackling it. Likely will follow the design principles of proposal #2 to ensure a fair treatment of CFS tasks at all times.
-
fw_devlink discussion [video][slides]
- Problem: How do we deal with the missing driver use case?
- goal is to have the default kernel (has fw_devlink=on) boot/probe the consumer devices properly when the supplier drivers aren't present.
- Number of potential solutions discussed (incompatible list, extended timer on each probe, user space interface for loading complete, DT annotations for optional dependencies)
- Decision: Agreed to go with the "Set timer and extend" option since that seems like the best balance between "works by default" vs maintainable.
- Problem: simple-bus + sync_state() callback issues
- Related to supplier drivers getting sync_state() callback correctly after all their consumers have probed.
- fw_devlink needs to create "proxy device links" between the parent of a consumer and the supplier until the child consumer devices are added during parent probe.
- Simple-bus devices add the child devices without ever probing.
- Nicer to just probe the simple-bus devices using the existing simple-pm-bus driver.
- Discussion:
- "simple-bus" or equivalent devices need to be bound to the simple-pm-bus driver so that sync_state() callbacks work correctly for suppliers.
- However "simple-bus" is sometimes a generic match for some devices which have a real specific driver.
- The child devices are populated by the OF platform code but the device itself is bound to a real/specific driver.
- need an "allow list" of compatible strings to bind to or a "deny list" of compatible devices to bind to.
- Rob wanted deny list and have it populated with all the compatible strings that are supported by the kernel that also list "simple-bus" as a generic match.
- Concern that if we use a deny list, might miss not upstreamed compatible strings & a kernel update would break them (the simple-pm-bus driver would bind when it shouldn't).
- Decision: Go with allow list + allow driver_override
- Problem: Default asynchronous probing / parallelized module loading
- Default async boot cut down module load time by 50%
- Parallelized module loading cut another 50% but functionality was broken mainly due to poor error handling in downstream drivers.
- Discussion: Should we enable default async probing for DT based systems?
- General feeling of "give it a shot"
- Need a new command line param to do async probing by default.
- Concern around ordering/enumeration issues (esp storage), but it should be solvable
- If a category of devices hit these issues, we could just force synchronous probing at the bus level for those devices? Instead of blocking asynchronous probing for every device.
- Decision: Start off with a command line to enable asynchronous probing by default.
- Problem: Topological probing?
- Devices would only be probe attempted once all their suppliers have probed and won't be added to the deferred probe list before that.
- Avoids blindly kicking off a probe attempt due to unrelated device probing
- No real discussion on this
- Problem: Any DT properties we are missing?
- syscon devices are used to set up regmap and then vendor specific bindings point to syscon devices and wait for them to be ready before they can probe.
- Might be good to track those vendor specific bindings and their dependency on syscon devices. Adding all the vendor bindings seems excessive.
- Maybe that's more incentive to move away from vendor specific bindings.
- Ideas about allowing modules to append to the "supplier bindings" that fw_devlink knows how to parse.
- This has a lot of complexities. So might never get done.
- Decision: none
- Problem: How can we improve debugging friendliness?
- Requested a way to find the probe order after booting up to analyze any variance in boot ups.
- Not exactly, but device links info is available in /sys/class/devlink that can be used to get a sense of ordering.
- For a device link where status is tracked (status file != "untracked", the supplier would always have finished probing before the consumer started probing.
- Problem: Booting issues with fw_devlink=on
- Some devices don't boot properly without deferred_probe_timeout=30.
- Could be related to the "missing driver" discussion earlier.
- Try if deferred_probe_timeout=1 works. If it does, then it's related. If not, it's probably a driver issue that needs to be fixed.
- Some devices have display issues with fw_devlink=on. Possibly related to implicit assumption of probe ordering and clock enable assumptions from there.
- fw_devlink=on issues are generally related to devices not probing. If the devices probe but you still have issues it's generally a driver issue that needs to be fixed. fw_devlink is just highlighting them.
- However, fw_devlink tries not to change probe ordering when devices follow the driver core design/guarantees.
- Saravana will fix things if folks see reordering when it shouldn't have
- Problem: Regulator sync_state()
- Folks want to bring up regulator sync_state()
- Focus is on making sure fw_devlink=on doesn't break anyone's device.
- Have some ideas on how to do it but time constrained.
- Problem: How do we deal with the missing driver use case?
-
Android drivers in Rust [video][slides]
- Wedson Almeida Filho presented examples of Rust driver code contrasted with their C counterparts taken from PL061 and NVMe drivers, then opened up for questions/discussion.
- Qualcomm has a Rust version of their SoC info driver and that they intend to port other drivers for some of their SoC hardware blocks.
- The Rust-for-Linux approach is to write safe abstractions that can be reused by other drivers, and to minimize the amount of 'unsafe' code in drivers.
- Qualcomm devs are more concerned with the integration of the Rust toolchain with their build systems.
- Question on the process of learning Rust and concern about C++-like aspects like scoped locks creeping in; was it hard to adjust to these new patterns?
- Almeida tried to explain that locks in Rust are different from C++ in that the data is wrapped by the lock and inaccessible unless the lock is held; this requires a 'guard', whose destructor unlocks the lock.
- The latter part is not the only way to do it though, we can (if we choose to) implement an unlock function that consumes the guard, rendering the data inaccessible again after the lock is released.
- As for adjusting to new patterns, Almeida said he found it OK.
- Question about the motivation for adding Rust support
- 70-80% of vulnerabilities come from memory-safety issues, which Rust eliminates by construction
- This is a fundamental difference from other languages without garbage collection.
- also helps with correctness because someone writing a driver can focus on their hardware instead of all the small ways in which one can misuse kernel APIs and introduce bugs.
- Question: Is writing drivers in Rust is more approachable than in C?
- Almeida thinks so, and it was suggested that this was maybe a better selling point.
- For maintainers it's easier to review and accept a Rust patch because they don't have to worry about classes of bugs.
- Maintainers need to be very careful with unsafe blocks but that Rust requires them to be explicitly marked, so maintainers know about them.
-
Improving AOSP Community Collaboration [video][slides]
- Problem: Keeping up with AOSP can be difficult, and documentation is often lacking.
- While there are a few devboards that are supported directly in the AOSP project, outside of AOSP there are a number of interesting efforts to support boards or devices against AOSP.
- Linaro heavily leans on shared experiences with the devboards they support
- Proposal: We create some sort of community to raise awareness of each other's efforts and potentially allow for collaboration between devices outside of AOSP as well.
- Initial internal discussions found not much appetite/funding for a centralized LineageOS style project.
- Proposal: a lightweight wiki + mailinglist + IRC channel approach
- few folks asking what were the blockers to setting up the wiki/mailing list/irc setup, which mainly is just wanting to get community input so it's hosted in a neutral way, and folks agree on the naming of the group.
- the kernel-team@android.com / android-kernel-team@google.com email alias were mentioned and are very low volume.
- They are great tools for engaging with Google, but as they aren't public mailing lists, so proposal was more trying to find a way for various parties in the community to collaborate.
- It was asked if silicon vendors would be interested in participating, and there is potential, but they are usually focused on the next Android release, not AOSP
- It was suggested if the collaboration page might be a good place to share the libcamera efforts, as that's something that is in progress, but would be good to see wider use.
- Outcome: Unfortunately there weren't many suggestions or ideas for names/hosting. The most concrete suggestion was using something like a github io page.
- Will move forward with web page, IRC channel and mailing list to try to provide a lightweight place for folks to collaborate.
- Other general AOSP gripes were brought up
- meson integration and lacking documentation.
- improvements to documentation have been made, but issues like versioning make it complicated.
- New HAL interfaces being developed internally was also brought up as an area that is difficult.
- Cuttlefish suggested as a reference implementation and resource for understanding and validating HALs.
- One proposal suggest a project to create generic HALs that focus on upstream kernel interfaces
- good work done recently on haptics drivers to create a reference implementation.
- v4l2_codec2 might be a similar area.
- Another proposal was sharing device directories between multiple devices
- If there were more dynamic HALs that could do more runtime selection, they might be able to share boot and vendor partitions as well.
- Similar to what Qualcomm does on shipping devices, with System/Vendor images that support a number of their SoCs.
- keeping the idea to a single SoC seemed reasonable as purely generic approaches may take too much disk space, but non-android distros are able to have generic userlands that support a wide range of form-factor devices, and generally pushed for more usage of common upstream kernel interfaces.
- Separate topic of modularizing binder and ashmem came up as well
- for supporting android environments on top of classic linux distributions.
- bad idea for binder, which should be built in.
- For ashmem, it should be able to be dropped entirely unless one is planning to support the Google Play Store, as some legacy applications which have not been updated still directly make use of the interface.
- Problem: Keeping up with AOSP can be difficult, and documentation is often lacking.
GPU/media/AI buffer management and interop MC
CFP Ends: Sept 10, 2021
The GPU/media/AI buffer management and interop microconference focuses on Linux kernel support for new graphics hardware that is coming out in the near future. Most vendors are also moving to firmware control of job scheduling, additionally complicating the DRM subsystem's model of open user space for all drivers and API. This has been a lively topic with neural-network accelerators in particular, which were accepted into an alternate subsystem to avoid the open-user space requirement, something which was later regretted.
As all of these changes impact both media and neural-network accelerators, this Linux Plumbers Conference microconference allows us to open the discussion past the graphics community and into the wider kernel community. Much of the graphics-specific integration will be discussed at XDC the prior week, but particularly with cgroup integration of memory and job scheduling being a topic, plus the already-complicated integration into the memory-management subsystem, input from core kernel developers would be much appreciated.
Suggested topics:
- Handling explicit synchronization without kernel mediation and awareness
- Supporting fence-value reads on multiple architectures in an efficient way
- User space API implications of buffer sharing without guaranteed job completion
- cgroup support and accounting for GPU memory allocation and time-sharing
- API and long-term support implications for closed-source firmware
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "GPU/media/AI buffer management and interop MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
Daniel Stone <daniel@fooishbar.org>
Diversity, Equity and Inclusion MC
CFP Ends: Sept 10, 2021
The Diversity, Equity and Inclusion microconference focuses on identifying how we can improve the diversity of new contributors and retain existing developers and maintainers in the kernel community. As Linux kernel community turns 30 this year, we need to understand what is working and where we can improve (and how). Experts from the DEI research community will share their perspectives, together with the perspectives from the Linux community members to help determine some next steps.
Suggested topics:
- What are the challenges in attracting and retaining a diverse group of developers that are worth focusing on.
- Does the Code of Conduct and Inclusive naming efforts help people of diverse groups feel at home? What else is missing?
- How effective have the kernel mentoring initiatives been? Are there best practices emerging that help the limited pool of mentors be more effective?
- What will be the most effective next steps for advancing Diversity, Equity and Inclusion that will improve the trends, and help us scale?
If you are interested in participating in this microconference and have topics to propose, please use the CfP process, and select "Diversity
Equity and Inclusion MC" for the "Track". More topics will be added based on CfP for this microconference.
MC leads:
- Shuah Khan <shuah@kernel.org>
- Kate Stewart <stewart@linux.com>