The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond. LPC 2018 will be held November 13-15 in Vancouver, BC, Canada. We are looking forward to seeing you there!
The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.
As a part of ongoing research project we have added several features to CRIU: post-copy memory migration, post-copy migration over RDMA and support from cross-architecture checkpoint-restore.
The "plain" post-copy migration is already upstream and, up to hiccups that regularily show up in CI, it can be considered working so there is not much to discuss about it.
The post-copy migration over RDMA aims to reduce the remote page fault latency. We have a working prototype that replaces TCP transport for memory transfer with RDMA. We still do not have solid performance evaluation, but if RDMA will provide the expected reduction in page fault latency, we are going to work with the CRIU community to make the RDMA support upstream.
The cross-architecture checkpoint-restore is the most peculiar and controversal feature. Various aspects of heterogeneous ISA execution have been a hot reseach topic for a while, and we decided to see what it would take to make CRIU capabable of migrating an application between architectures.
The idea is simple: if we create the binary for all architecutres so that all the symbols have exactly the same address, then restoring a dump created on a different architecutre is a matter of transforming the stack and the registers.
This transformation relies on the metadata generated by the specialized compiler that generates multiple object files from a single source (one for each architecture), hence the FatELF.
Up till now we were able to force CRIU to create a checkpoint of an extended "Hello, World" application on arm64 and restore this application on x86.
Last year we discussed the efforts to bring stacking and namespacing to the LSM subsystem. Over the last year several of the outstanding issues have been resolved (if not always in the most satisfactory way). The path forward for upstreaming stacking is now clear.
This presentation will discuss solutions to outstanding problems and the current direction for upstreaming LSM stacking. As well as the status of namespacing in the LSM subsystems and what it means for containers.
Google has a large cgroup v1 deployment and have begun planning our migration to cgroup v2. This migration has proven difficult because of our extensive use of cgroup v1 features.
Among the most challenging issues are the transition from multiple hierarchies to a unified one, migration of users who create their own cgroups, custom threaded cgroup management and the lack of ability to transition a controller between v1 and v2 more gradually. Additionally the cgroup v1 features don't exactly map to ones in cgroup v2 which means that there is additional risk during the migration where non-obvious behavior changes have to be debugged and tracked down. The talk will outline these challenges in more detail and describe approaches taken by us to solve them including proposals for possible changes to ease this migration for other users of the cgroup v1 interface.
Discussions around time namespace are there for a long time. The first attempt to implement it was in 2006. From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespace:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not well defined (currently it is system start up time, but the POSIX says “since an unspecified point in the past”), and are different for each system. When a container is migrated from one node to another, all clocks are to be restored into consistent states;
in other words, they have to continue running from the same points where they have been dumped.
We are going to present a patch set to support offsets for monotonic and boottime clocks. There are a few points to discuss:
Any other features of time namespaces can be also discussed there.
Git: https://github.com/0x7f454c46/linux/tree/wip/time-ns
Wiki: https://criu.org/Time_namespace
Currently, container runtimes are faced with a large attack surface when it comes to a malicious container guest. This most obvious attack surface is the filesystem, and the wide variety of filesystem races and other such tricks that can be used to trick a container runtime into accessing files they shouldn't. To tackle this, most container runtimes have come up with necessary userspace hacks to work around these issues -- but many of the improvements are inherently flawed as they are not done from kernel-space.
In this session, a discussion of the various kernel APIs that could benefit container runtime security will be opened. Topics on the agenda would be the use of AT_EMPTY_PATH with openat(2), whether there are any more blockers for the AT_NO_JUMPS patchset, and a proposal of AT_THIS_ROOT which would allow for much more secure interaction with
container filesystems.
Linux Plumbers Conference being the place where most CRIU developers and users regularly meet and exchange news and ideas, traditionally had an overview talk about what has happened in and around CRIU since the previous Linux Plumbers Conference.
As the checkpoint and restore micro conference is now part of the containers micro conference we still want to keep this 'tradition' as it gives us and all other attendees a good overview what has changed in CRIU and how those changes are part of projects using CRIU or to motivate other projects to make use of the newest CRIU features.
In this talk I will give an overview of what has changed since the last Linux Plumbers Conference in CRIU. Give details about certain changes and why we (the CRIU developers) think they are important and how they are (or could be) used in projects making use of CRIU.
In addition to changes in the past we want to here from the community what changes they would like to see in CRIU to better support their projects using CRIU.
The pivot_root() operation is an essential step in virtualizing a
container's root directory. Current pivot_root() semantics require that a mountpoint is not a shared mountpoint. If it is, the pivot_root() operation will not be allowed. However, some containers need to have a virtualized root directory while at the same time have the root directory be a shared mountpoint. This is necessary when mounts between the host and the container are supposed to propagate in order to have a
straightforward mechanism to share mount information. In this talk we will explain the original reason for blocking pivot_root() on shared mountpoints and start a discussion centered around a patchset that is a necessary precondition to safely enable pivot_root() on shared mountpoints.
This talk focuses on our use of CRIU for transparent checkpoint/restore task migrations within Google's shared compute infrastructure. This project began as a means to simplify user applications and increase utilization in our clusters. We've now productionized a sizable deployment of our CRIU-based task migration infrastructure. We'll present our experiences using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.
System resource information, like memory, network and device statistics, are crucial for system administrators to understand the inner workings of their systems, and are increasingly being used by applications to fine tune performance in different environments.
Getting system resource information on Linux is not a straightforward affair. The best way is to collect the information from procfs or sysfs, but getting such information from procfs or sysfs presents many challenges. Each time an application wants to get a system resource information, it has to open a file, read the content and then parse the content to get actual information. If application is running in a container then even reading from procfs directly may give information about resources on whole system instead of just the container. Libresource tries to fix few of these problems by providing a standard library with set of APIs through which we can get system resource information e.g. memory, CPU, stat, networking, security related information.
Libresource provides/may provide following benefits:
While deploying a CRIU-based transparent checkpoint/restore task migration infrastructure at Google, one of the toughest challenges we faced was security. The infrastructure views the applications it runs as inherently untrusted, yet CRIU requires expansive privileges at times in order to successfully checkpoint and restore workloads. We found many cases where malignant workloads could trick CRIU into elevating their privileges during checkpoint/restore. We present our experience in securely checkpointing and restoring untrusted workloads with minimal Linux privileges while enabling the bulk of CRIU functionality. We'll discuss changes required to enable this usecase and make the case for an increased emphasis on security in checkpoint/restore.
Several in-house Oracle customers have identified that their large seccomp filters are slowing down their applications. Their filters largely consist of simple allow/deny logic for many syscalls (306 in one case) and for the most part don't utilize argument filtering.
Currently libseccomp generates an if-equal statement for each syscall in the filter. Its pseudocode looks roughly like this:
if (sycall == read)
return allow
if (syscall == write)
return allow
...
# 300 or so if statements later
if (syscall == foo)
return allow
return deny
This is very slow for syscalls that happen to be at the end of the filter. Libseccomp currently allows prioritizing the syscalls to place the most frequent ones at the top of the filter, but this isn't always possible - especially in a cloud situation.
I currently have a working prototype and the numbers are very promising.
In this prototype, I timed calling getppid() in a loop using a filter similar to one of my customer's seccomp filters. I ran this loop one million times and recorded the min, max, and mean times (in TSC ticks) to call getppid().
(I didn't disable interrupts, so the max time was often large.) I chose to report the minimum time because I feel it best represents the actual time to traverse the syscall.
The code for the libseccomp RFE is available here:
https://github.com/seccomp/libseccomp/issues/116
seccomp disabled 138
getppid() at the front of 306-syscall seccomp filter 256
getppid() in middle of 306-syscall seccomp filter 516
getppid() at the end of the 306-syscall filter 1942
getppid() in a binary tree 312
On non-embedded systems device management in Linux is a task split between kernelspace and userspace. Since the implementation of the devtmpfs pseudo filesystem the kernel is solely responsible for creating device nodes while udev in userspace is mainly responsible for consistent device naming and permissions. The devtmpfs filesystem however is not namespace aware. As such devices always belong to the
initial user namespace. In times of SR-IOV enabled devices it is possible and needed to hand off devices to non-initial user namespaces.
The last couple of months I’ve been working on enabling userspace to be able to target device events to specific user namespaces. With recent patchsets of mine we have now reached that goal. As such userspace can now tie devices to a specific user namespace. This talk aims to explain the concept of namespace aware
device management and to explain the patchsets that were needed to make device management namespace aware and possible future improvements.
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Interactive applications, which includes everything from real time
games through flight simulators and virtual reality environments,
place strong real-time requirements on the whole computing environment
to ensure that the correct data are presented to the user at the
correct time. This requires two things; the first is that the time
when the information will be displayed be known to the application so
that the correct contents can be computed, and second that the frame
actually be displayed at that time.
These two pieces of information are managed inconsistently through the
graphics stack, making it difficult for applications to provide a
smooth animation experience to users. And because of the many APIs
which lie between the application rendering using OpenGL or Vulkan and
the underlying hardware, a failure to correctly handle things at
any point along the chain will result in stuttering results.
Fixing this requires changes throughout the system, from making the
kernel provide better control and information about the queuing and
presentation of images through changes in composited window systems to
ensure that images are displayed at the target time, and that
the actual time of presentation is reported back to applications and
finally additions to rendering APIs like Vulkan to expose control
over image presentation times and feedback about when images ended up
being visible to the user.
This presentation will first demonstrate the effects of poor display
timing support inherent in the current graphics stack, talk about the
different solutions required at each level of the system and finally
show the working system.
Software that uses a 32-bit integer to represent seconds since the Unix epoch of Jan 1 1970 is affected by that variable overflowing on Jan 19 2038, often in a catastrophic way. Aside from most 32-bit binaries that use timestamps, this includes file systems (e.g. ext3 or xfs), file formats (e.g. cpio, utmp, core dumps), network protocols (e.g. nfs) and even hardware (e.g. real-time clocks or SCSI adapters).
Work has been going on to avoid that overflow in the Linux kernel, with hundreds of patches reworking drivers, file systems and the user space interfaces including over 50 affected system calls.
With much of this activity getting done during 2018, it's time to give an update on what has been achieved in the kernel, what parts still remain to be solved, and how we will proceed to solve this in user space, and how to use the work in long-living product deployments.
It is common to see Linux being used on real-time research projects. However, the assumptions made in papers are very often unrealistic. In contrast, researchers argue that the main metric used on PREEMPT RT, although useful, is an oversimplification of the problem.
It is a consensus that the academic research helps to improve Linux’s state-of-art, and vice-versa. So how can we reduce the gap between these task forces? The real-time researchers start papers with a clear definition of the task model. But we do not have a task model for Linux: this is where the gap is.
This talk presents effort on establishing the task model for the PREEMPT RT Linux. Starting with the description of the operations that influence the timing behavior of tasks, passing by the definition of the relationships of the operations. Finally, the outcomes for Linux, like new metrics for the PREEMPT RT and a model validator (a lockdep like verificator, but for preemption) for the kernel, are discussed.
The SCHED_DEADLINE scheduling policy is all but done. Even though it existed in mainline for several years, many features are yet to be implemented; some are already available as immature code, some others only exist as wishes.
In this talk Juri Lelli and Daniel Bristot De Oliveira will give the audience in-depth details of what’s missing, what’s under development and what might be desirable to have. The intent is to provide as much information as possible to people attending, so that a fruitful discussion might be held later on during hallway and micro conference sessions.
Examples of what is going to be presented are:
A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.
Official Networking Track website: http://vger.kernel.org/lpc-networking2018.html
XDP already offers rich facilities for high performance packet
processing, and has seen deployment in several production systems.
However, this does not mean that XDP is a finished system; on the
contrary, improvements are being added in every release of Linux, and
rough edges are constantly being filed down. The purpose of this talk is
to discuss some of these possibilities for future improvements,
including how to address some of the known limitations of the system. We
are especially interested in soliciting feedback and ideas from the
community on the best way forward.
The issues we are planning to discuss include, but are not limited to:
User experience and debugging tools: How do we make it easier for
people who are not familiar with the kernel or XDP to get to grips
with the system and be productive when writing XDP programs?
Driver support: How do we get to full support for XDP in all drivers?
Is this even a goal we should be striving for?
Performance: At high packet rates, every micro-optimisation counts.
Things like inlining function calls in drivers are important, but also
batching to amortise fixed costs such as DMA mapping. What are the
known bottlenecks, and how do we address them?
QoS and rate transitions: How should we do QoS in XDP? In particular,
rate transitions (where a faster link feeds into a slower) are
currently hard to deal with from XDP, and would benefit from, e.g.,
Active Queue Management (AQM). Can we adapt some of the AQM and QoS
facilities in the regular networking stack to work with XDP? Or should
we do something different?
Accelerating other parts of the stack: Tom Herbert started the
discussion on accelerating transport protocols with XDP back in 2016.
How do we make progress on this? Or should we be doing something
different? Are there other areas where we can extend XDPs processing
model to provide useful accelerations?
XDP is a framework for running BPF programs in the NIC driver to allow
decisions about the fate of a received packet at the earliest point in
the Linux networking stack. For the most part the BPF programs rely on
maps to drive packet decisions, maps that are managed for example by a
userspace agent. This architecture has implications on how the system is
configured, monitored and debugged.
An alternative approach is to make the kernel networking tables
accessible by BPF programs. This approach allows the use of standard
Linux APIs and tools to manage networking configuration and state while
still achieving the higher performance provided by XDP. An example of
providing access to kernel tables is the recently added helper to allow
IPv4 and IPv6 FIB (and nexthop) lookups in XDP programs. Routing suites
such as FRR manage the FIB tables, and the XDP packet path benefits by
automatically adapting to the FIB updates in real time. While a huge
first step, a FIB lookup alone is not sufficient for general networking
deployments.
This talk discusses the advantages of making kernel tables available to
XDP programs to create a programmable packet pipeline, what features
have been implemented as of October 2018, key missing features, and
current challenges.
Over the past several years, BPF has steadily become more powerful in multiple
ways: Through building more intelligence into the verifier which allows more
complex programs to be loaded, and through extension of the API such as by
adding new map types and new native BPF function calls. While BPF has its roots
in applying filters at the socket layer, the ability to introspect the sockets
relating to traffic being filtered has been limited.
To build such awareness into a BPF helper, the verifier needs the ability to
track the safety of the calls, including appropriate reference counting upon
the underlying socket. This talk walks through extensions to the verifier to
perform tracking of references in a BPF program. This allows BPF developers to
extend the UAPI with functions that allocate and release resources within the
execution lifetime of a BPF program, and the verifier will validate that the
resources are released exactly once prior to program completion.
Using this new reference tracking ability in the verifier, we add socket lookup
and release function calls to the BPF API, allowing BPF programs to safely find
a socket and build logic upon the presence or attributes of a socket. This can
be used to load-balance traffic based on the presence of a listening
application, or to implement stateful firewalling primitives to understand
whether traffic for this connection has been seen before. With this new
functionality, BPF programs can integrate more closely with the networking
stack's understanding of the traffic transiting the kernel.
In this talk we describe our experiences in evaluating DC-TCP. Preliminary testing with Netesto uncovered issues with our NIC that affected fairness between flows, as well as bugs in the DC-TCP code path in Linux that resulted in RPC tail latencies of up to 200ms. Once we fixed those issues, we proceeded to test in a 6 rack mini cluster running some of our production applications. This testing demonstrated very large decreases in packet discards (12 to 1000x) at a cost of larger CPU utilization. In addition to describing the issues and fixes, we provide detailed experimental results and explore the causes of the larger CPU utilization as well as discuss partial solutions to this issue.
Note: We plan to test on a much larger cluster and have those results available before the conference.
Linux bridge is deployed on Hosts, Hypervisors, Container OS's and in most recent years on data center switches. It is complete in its feature set with forwarding, learning, proxy and snooping functions. It can bridge Layer-2 domains between VM's, Containers, Racks, POD's and between data centers as seen with Ethernet-Virtual Private networks [1, 2]. With Linux bridge deployments moving up the rack, it is now bridging Larger Layer-2 domains bringing in scale challenges. The bridge forwarding database can scale to thousands of entries on a data center switch with hardware acceleration support.
In this paper we discuss performance and operational challenges with large scale bridge fdb database and solutions to address them. We will discuss solutions like fdb dst port failover for faster convergence, faster API for fdb updates from control plane and reducing number of fdb dst ports with Light weight tunnel endpoints for bridging over a tunneling solution (eg vxlan).
Most solutions though discussed around the below deployment scenarios are generic and can be applied to all bridge use-cases:
[1] https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-11
[2] https://www.netdevconf.org/2.2/slides/prabhu-linuxbridge-tutorial.pdf
The eXpress Data Path (XDP) is a new kernel-feature, intended to provide
fast packet processing as close as possible to device hardware. XDP
builds on top of the extended Berkely Packet Filter (eBPF) and allows
users to write a C-like packet processing program, which can be attached
to the device driver’s receiving queue. When the device observes an
incoming packet, the user-defined XDP program is triggered to execute on
the packet payload, making the decision as early as possible before
handing the packet down the processing pipeline.
P4 is a domain-specific language describing how packets are processed by
the data plane of a programmable network elements, including network
interface cards, appliances, and virtual switches. It provides an
abstraction that allows programmers to express existing and future
protocol format without coupling it to any data plane specific
knowledge. The language is explicitly designed to be protocol-agnostic.
A P4 programmer can write their own protocols and load the P4 program
into P4-capable network elements.
As high-level networking language, P4 supports a diverse set of compiler
backends and also possesses the capability to express eBPF and XDP programs.
We present P4C-XDP, a new backend for the P4 compiler. P4C-XDP leverages
XDP to aim for a high performance software data plane. The backend
generates a eBPF-compliant C representation from a given P4 program
which is passed to clang and llvm to produce the bytecode. Using
conventional eBPF kernel hooks the program can then be loaded into the
eBPF virtual machine in the device driver. The kernel verifier
guarantees the safety of the generated code. Any packets
received/transmitted from/to this device driver now trigger the
execution of the loaded P4 program.
The P4C-XDP project is an open source project hosted at
https://github.com/vmware/p4c-xdp/. We provide prove-of-concept sample
code under the tests directory, which contains a couple of examples such
as basic protocol parsing, checksum recalculation, multiple tables
lookups, and tunnel protocol en-/decapsulation.
Port mirroring is one of the most common network troubleshooting
techniques. SPAN (Switch Port Analyzer) allows a user to send a copy
of the monitored traffic to a local or remote device using a sniffer
or packet analyzer. RSPAN is similar, but sends and received traffic
on a VLAN. ERSPAN extends the port mirroring capability from Layer 2
to Layer 3, allowing the mirrored traffic to be encapsulated in an
extension of the GRE (Generic Routing Encapsulation) protocol and sent
through an IP network. In addition, ERSPAN carries configurable
metadatas (e.g., session ID, timestamps), so that the packet analyzer
has better understanding of the packets.
ERSPAN for IPv4 was added into Linux kernel in 4.14, and for IPv6 in
4.16. The implementation includes both transmission and reception and
is based on the existing ip_gre and ip6_gre kernel module. As a
result, Linux today can act as an ERSPAN traffic source sending the
ERSPAN mirrored traffic to the remote host, or an ERSPAN destination
which receives and parses the ERSPAN packets generated from Cisco or
other ERSPAN-capable switches.
We’ve added both the native tunnel support and metadata-mode tunnel
support. In this paper, we demonstrate three ways to use the ERSPAN
protocol. First, for Linux users, using iproute2 to create native
tunnel net device. Traffic sent to the net device will be
encapsulated with the protocol header accordingly and traffic matching
the protocol configuration will be received from the net device.
Second, for eBPF users, using iproute2 to create metadata-mode ERSPAN
tunnel. With eBPF TC hook and eBPF tunnel helper functions, users can
read/write ERSPAN protocol’s fields in finer granularity. Finally,
for Open vSwitch users, using the netlink interface to create a switch
and programmatically parse, lookup, and forward the ERSPAN packets
based on flows installed from the userspace.
AF_XDP is a new socket type for raw frames to be introduced in 4.18
(in linux-next at the time of writing). The current code base offers
throughput numbers north of 20 Mpps per application core for 64-byte
packets on our system, however there are a lot of optimizations that
could be performed in order to increase this even further. The focus
of this paper is the performance optimizations we need to make in
AF_XDP to get it to perform as fast as DPDK.
We present optimization that fall into two broad categories: ones that
are seamless to the application and ones that requires additions to
the uapi. In the first category we examine the following:
Loosen the requirement for having an XDP program. If the user does
not need an XDP program and there is only one AF_XDP socket bound to
a particular queue, we do not need an XDP program. This should cut
out quite a number of cycles from the RX path.
Wire up busy poll from user space. If the application writer is
using epoll() and friends, this has the potential benefit of
removing the coherency communication between the RX (NAPI) core and
the application core as everything is now done on a single
core. Should improve performance for a number of use cases. Maybe it
is worth revisiting the old idea of threaded NAPI in this context
too.
Optimize for high instruction cache usage through batching as has
been explored in for example Cisco's VPP stack and Edward Cree in
his net-next RFC "Handle multiple received packets at each stage".
In the uapi extensions category we examine the following
optimizations:
Support a new mode for NICs with in-order TX completions. In this
mode, the completion queue would not be used. Instead the
application would simply look at the pointer in the TX queue to see
if a packet has been completed. In this mode, we do not need any
backpreassure between the completion queue and the TX queue and we
do not need to populate or publish anything in the completion queue
as it is not used. Should improve the performance of TX for in-order
NICs significantly.
Introduce the "type-writer" model where each chunk can contain
multiple packets. This is the model that e.g., Chelsio has in its
NICs. But experiments show that this mode also can provide better
performance for regular NICs as there are fewer transactions on the
queues. Requires a new flag to be introduced in the options field of
the descriptor.
With these optimization, we believe we can reach our goal of close to
40 Mpps of throughput for 64-byte packets in zero-copy mode. Full
analysis with performance numbers will be presented in the final
paper.
iptables have been the typical tool to create firewall for linux hosts. We have used them at Facebook for setting up host firewalls on our servers across a variety of tiers. In this proposal, we introduce a eBPF / XDP based firewall solution which we use for packet filtering and has parity to our iptables implementation. We discuss various aspects of this. Following is a brief summary of these aspects, which we will detail further in the paper / presentation.
Design and Implementation:
Performance benefits and comparisons with iptables
Ease of policy / config updates and maintenance
Deployment experience:
BPF Program array
Proposal for a completely generic firewall solution to migrate existing iptables rules to eBPF / XDP based filtering
This talk is a continuation of the initial XDP HW-based hints work presented at NetDev 2.1 in Seoul, South Korea.
It will start with focus on showcasing new prototypes to allow an XDP program to request required HW-generated metadata hints from a NIC. The talk will show how the hints are generated by the NIC and what are the performance characteristics for various XDP applications. We also want to demonstrate how such a metadata can be helpful for applications that use AF_XDP sockets.
The talk with then discuss planned upstreaming thoughts, and look to generate more discussion around implementation details, programming flows, etc., with the larger audience from the community.
SCTP is a transport protocol, like TCP and UDP, originating from SIGTRAN
IETF Working Group in the early 2000's with the initial objective of
supporting the transport of PSTN signalling over IP networks. It featured
multi-homing and multi-stream from the beginning, and since then there
have been a number of improvements that help it serve other purposes too,
such as support for Partial Reliability and Stream Scheduling.
Linux SCTP arrived late and was stuck. It wasn't as up to date as the
released RFCs, while it was also far behind other systems such as BSD,
and also suffered from performance problems. In the past 2 years, we
were dedicated to ensuring that these features were addressed and
focused on making many improvements. Now all the features from released
RFCs have been fully supported in Linux, and some from draft RFCs are
already ongoing. Besides, we've seen an obvious improvement in performance
in various scenarios.
In this talk we will first do a quick review on SCTP basics, including:
Then go through the improvements that were made in the past 2 years,
including:
We will finish by reviewing a list of what is on our radar as well as next
steps, like:
For its powerfulness and complexity, SCTP is destined to face many challenges
and threats, but we believe that we have already and will continue to make it
better than that on other systems, but also than other transport protocols.
Please join us, Linux SCTP needs your help too!
The Linux Plumbers 2018 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.
Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.
Plans for participants and talks will be similar to last year's (https://blog.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/641).
This proposal is to gather hackers interested in improving the thermal subsystem in the Linux Kernel and its interaction with hardware and userspace based policies. Nowadays, given the nature of the workloads and the wide spectrum of devices that Linux is used into, the peers interested in improving the thermal subsystem come from different backgrounds and with use cases of diverse thermal constrained systems, ranging from embedded devices to systems with high computing power. Despite the heterogeneity of software solutions to control thermals, the thermal subsystem is still core of many of them, including policies that rely on hardware configured thresholds, or interactions with firmware based control loops, or even policies that rely on userspace daemons. Therefore, this micro conference aims on gathering the thermal interested developers of the community to discuss further improvements of the Linux thermal subsystem.
The current interfaces for thermal sysfs are really designed for mostly debug and not optimized to handle thermal events in user space. The current notification mechanism using netlink sockets is slow and adds additional computation and latency. The same things hold true for sysfs based reading of temperature, which needs constant polling to identify
trend.
In the past I proposed bridge to Linux IIO, which was agreed in principal but never merged to the main line Linux. Here I will propose one more solution for further discussion.
Discuss the following topics:
Discussion around some mobile usecases that don't fit very well in the current framework and proposals for possible solutions.
These include virtual sensors, heirachical thermal zones, multiple sensor per thermal zone support, extending governors to tackles tempertaure ranges.
Thermal governors can respond to an overheat event for a cpu by capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the cpu is restricted. But Linux scheduler is unaware maximum cpu capacity restrictions placed due to thermal
activity. This session aims to discuss potential solution to thermal framework scheduler interactions.
A discussion around using idle injection as a means to do thermal management
A discussion on creating virtual temperature sensors that acts as aggregators for physical sensors, thereby allowing all the framework operations
Discuss miscellaneous topics from the audience and summarize the Thermal MC
There is once more renewed interest in clang kernel builds, this has progressed to the point where some vendors (eg, some Android devices) are shipping builds in production, though not on x86. There's people looking at this from both the toolchain and kernel points of view on Arm and x86 and work on getting KernelCI able to provide good automated testing coverage.
Plumbers seems like a great time to have a BoF to sync up about how to approach things, getting toolchain enginers and kernel engineers from different companies together.
Key participants include Mark Brown, Arnd Bergmann, Todd Kijos,
Nick Desaulniers, Behan Webster and Adhemerval Zanella.
Clang has become a viable C/C++ compiler -- it is used as the primary compiler in Android, OpenMandriva and various BSDs these days.
Most parts of a modern Linux system can be built with Clang these days - but some core system components, including the kernel and some low level core libraries (most notably glibc) are exceptions to the rule.
Let's explore what needs to be done to make the core system compatible with both modern toolchains (clang 7.x/master and gcc 8.x/master).
Come chat about getting Gen-Z supported for real in Linux.
Gen-Z (https://genzconsortium.org/) is a new system interconnect that blends capabilities of DDR, PCI, USB, Infiniband and Ethernet. Come to this BOF to discuss how best to integrate Gen-Z into Linux.
Why are linux copy tools so problematic - not even calling the kernel copy APIs, and having terrible lack of features compared to other OS? Why is something as simple as copying a file painful for some network and cluster (and even local) file systems,and which tools (rsync, cp, gio copy etc) should we extend first to add the missing performance features that many fs need?
How can we consistently expose metadata and file system information across groups of file systems that have common information? The integration of a new xstat/statx call last year was progress, but let's discuss whether it should be extended (to allow reporting of additional attributes e.g. for the cloud), and also the status of the proposed "file system info" new system call and the proposed new mount syscall.
In addition, there are key security features (including fs-verity and RichACLs) that can be discussed.
The Linux File System layer is one of the most active areas of the kernel, and changes in the VFS, especially for network and cluster file systems benefit from discussions like these.
[Note: Moved to Kernel Summit Track.]
Linux Plumbers 2018: ZUFS - Zero Copy User-Mode FileSystem - One year Later
One year after its inception there are real hardened FSs. Real performance measurements. and many innovative fixtures. But is it ready for upstream?
Few highlights:
In the talk I will give a short architectural and functional overview.
Then will go over some of the leftover challenges.
And finally hope to engage in an open discussion of how this project should move forward to be accepted into the Kernel, gain more users and FS implementations.
Cheers
Boaz
This is probably a better fit as a CfP in either the containers or BPF microconferences.
seccomp is a critical component to ensuring safe containerization of untrusted processes. But at Oracle we are observing that this security often comes with an expensive performance penalty. I would like to start a discussion of how can we can improve seccomp's performance without compromising security.
Below is an open RFC I have in libseccomp that should significantly improve its performance when processing large filters. I would like to discuss other performance improvement possibilities - eBPF in general, eBPF hash map support, whitelists vs blacklists, etc. I would gladly take requests and ideas and try to incorporate them into libseccomp and seccomp as appropriate.
https://github.com/seccomp/libseccomp/issues/116
Several in-house Oracle customers have identified that their large seccomp filters are slowing down their applications. Their filters largely consist of simple allow/deny logic for many syscalls (306 in one case) and for the most part don't utilize argument filtering.
After invaluable feedback from Paul Moore and Kees Cook, I have chosen to pursue a cBPF binary tree to improve performance for these customers. A cBPF binary tree requires no kernel changes and should be transparent for all libseccomp users.
I currently have a working prototype and the numbers are very promising. I modified gen_bpf.c and gen_pfc.c to utilize a cBPF binary tree if there are 16 or more syscalls being filtered. I then timed calling getppid() in a loop using one of my customer's seccomp filters. I ran this loop one million times and recorded the min, max, and mean times (in TSC ticks) to call getppid(). (I didn't disable interrupts, so the max time was often large.) I chose to report the minimum time because I feel it best represents the actual time to traverse the syscall.
Test Case minimum TSC ticks to make syscall
----------------------------------------------------------------
seccomp disabled 138
getppid() at the front of 306-syscall seccomp filter 256
getppid() in middle of 306-syscall seccomp filter 516
getppid() at the end of the 306-syscall filter 1942
getppid() in a binary tree 312
As shown in the table above, a binary tree can siginficantly improve syscall performance in the average and worst case scenario for these customers.
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Historically, kernels that ran on Android devices have typically been 2+ years old compared to mainline (this year's flagship devices are shipping with 4.9 kernels) and because of the challenges associated with updating, most devices in the field are far behind the latest long-term stable (LTS) releases. The Android team has been gradually putting in place the necessary processes and enhancements to permanently bridge this gap. Much of the work on the Android kernel in 2018 focused on improving the ability to update the kernel -- at least to recent LTS levels. This work comprises a significant testing effort to ensure downstream partners that updating to new LTS levels is safe, as well as process work to convince partners that the security benefits of taking LTS patches far outweigh the risk of new bugs. The testing also focuses on ABI consistency (within LTS releases) for interfaces relied upon by userspace and kernel modules. This has resulted in enhancements to the LTP suite and a new proposal to the mailing list for "kernel namespaces".
Additionally, the Android kernel testing benefits from additional tools developed by Google that are enabled via the Clang compiler. Google's devices have been shipping kernels built via Clang for 2 years. The Android team tests and assists in maintaining arm and arm64 kernel builds with clang.
The talk will also cover some of the key features being developed for Android and introduce topics that will be discussed during the Android Micro-Conference.
Heterogeneous computing use massively parallel devices, such as GPU, to crunch through huge data-set. This talks intends to present the issues, challenges and problems related to memory management and heterogeneous computing. Issues and problems from one address space per device which makes exchanging or sharing data-set between devices and CPUs hard, complex and error prone.
Solutions involve a unified address space between devices and CPU often call SVM (Share Virtual Memory) or SVA (Share Virtual Address). In those unified address space a virtual address valid on CPUs is also valid on the devices. Talk will address both hardware and software solutions to this problem. Moreover it will consider ways to preserve the ability to use the device memory in those scheme.
Ultimately this talks is an opportunity to discuss memory placement, like for NUMA architecture, in a world where we not only have to worry about CPU but also about devices like GPU and their associated memory.
If it were not enough, we now also have to worry about memory hierarchy for each CPU or device. Memory hierarchy going from fast High Bandwidth Memory (HBM) to main memory (DDR DIMM) which can be order of magnitude slower, and finally to persistent memory which is large in size but slower and with higher latency.
It is well known that developers do not like writing documentation. But although documenting the code may seem dull and unrewarding, it has definite value for the writer.
When you write the documentation you gain an insight into the algorithms, design (or lack of such), and implementation details. Sometimes you see neat code and say "Hey, that's genius!". But sometimes you discover small bugs or chunks of code that beg for refactoring. In any case, your understanding of the system significantly improves.
I'd like to share the experience I had with Linux memory management documentation, what was it's state a few months ago, what have been done and where are we now.
The work on the memory management documentation is in progress and the question "Where do we want to be?" is definitely a topic for discussion and debate.
The first rule of kernel maintenance is that there are no hard and fast rules. While there are several documents and guidelines on patch contribution, advice on how to serve in a maintainer role has historically been tribal knowledge. This organically grown state of affairs is both a source strength and a source of friction. It has served the community well to be adaptable to the different personalities and technical problem spaces that inhabit the kernel community. However, that variability also leads to inconsistent experiences for contributors across subsystems, insufficient guidance for new maintainers, and excess stress on current maintainers. As the Linux kernel project expects to continue its rate of growth it needs to be able both scale the maintainers it has and ramp new ones without necessarily requiring them to make a decade's worth of mistakes to become proficient.
The presentation makes the case for why a maintainer handbook is needed, including frequently committed mistakes and commonly encountered pain points. It broaches the "whys" and "hows" of contributors having significantly different experiences with the Linux Kernel project depending on what subsystem they are contributing. The talk is an opportunity to solicit maintainers in the audience on the guidelines they would reference on an ongoing basis, and it is an opportunity for contributors to voice wish list items when working with upstream. Finally, it will be a call to action to expand the document with subsystem-local rules of the road where those local rules differ, or go beyond the common recommendations.
Since 2004 a project has been going on trying to make the Linux kernel into a true hard Real-Time operating system. This project has become know as PREEMPT_RT (formally the "real-time patch"). Over the past decade, there was a running joke that this year PREEMPT_RT would be merged into the mainline kernel, but that has never happened. In actuality, it has been merged in pieces. Examples of what came from PREEMPT_RT include: mutexes, high resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The only thing left is turning spin_locks into mutexes, and that is now mature enough to make its way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged!
Getting PREEMPT_RT into the kernel was a battle, but it is not the end of the war. Once PREEMPT_RT is in mainline, there's still much more work to be done. The RT developers have been so focused on getting RT into mainline, that little has been thought about what to do when it is finally merged. There is a lot to discuss about what to do after RT is in mainline. The game is not over yet.
POSIX condition variables (condvars) provide a commonly used interprocess communication mechanism. Threads can queue up and wait for an event before continuing. The glibc implementation of condvars in 2009 was not suitable for use in real-time systems due to a potential priority inversion. A fix has been available and used in many real-time systems since that time. A recent change to glibc to address a POSIX compliance issue with condvars broke that fix and modified the implementation in such a way as to prevent real-time usage of glibc condvars by introducing new type of priority inversion.
The real-time use case places constraints on condvar usage patterns, such as requiring a PI mutex to be associated with the condvar prior to a signal or broadcast. Most importantly, the implementation must always wake the waiters in priority FIFO order. To address this usage, the librtpi project provides a narrow real-time specific implementation of the condition variable mechanism in a way which can be readily used in lieu of the glibc implementation.
We will discuss the motivation and the current state of the project, as well as the long term strategy for a stable real-time condvar implementation.
Xenomai offers a wonderful debugging feature: whenever a realtime thread calls an non-rt safe syscall, SIGXCPU is delivered. That is particular helpful for users which build their application on top of libraries. Often it is not clear what the side effects of a library call is.
What options are there to implement something similiar to SIGXCPU? A simple prototype using BPF showed the limits of BPF (similiar experience as Joel descripes in "BPFd: Running BCC tools remotely across systems and architectures" https://lwn.net/Articles/744522/).
A short brainstorming session with Steven, Julia and Clark showed that there are a few options to achieve the goal. In this session I would like to discuss the options (or even show what has been achieved so far).
Key participants:
- Steven Rostedt
- Julia Cartwright
- Clark Williams
- Sebastian Andrzej Siewior
- Thomas Gleixner
Julia has worked with an Extreme-Value-Analysis tool that can analyze a lot of data. Collecting various output runs of jitterdebug, which collects all jitter data (outliers and all), could it be useful to analyse the data that it produces?
The PREEMPT_RT's current metric, the latency is good. It helped to guide the
development of the preempt_rt for more than a decade. However, in real-time
analysis, the principal metric for the system the response time of tasks.
Generally, in addition to the latency, the response time of a task comprises the
task's execution time, the blocking time on locks, the overhead associated with
scheduling and so on. Although we can think on ways to measure such values on
Linux, we 1) don't have a single/standardized way to do this, and 2) we don't do
regression tests to see if things changed from one version to another.
This talk will discuss these points, collecting ideas on how to proceed toward
the development of new metrics and ways to use them to test the kernel-rt.
Lots of research has been made to various issues like CPU affinity, Bandwidth inheritance, cgroup support; but nothing has been done to the kernel. Let's make a commitment and push these ideas forward into mainline.
Setting RT Priority inside usernamespace is not allowed even for mapped root uid. The use case is to be able to run RT priority processes inside usernamespace. Should there be a way to allow this, subject to cgroup RT limits, if a cgroup is dedicated to the usernamespace?
Is there need to change anything how we maintain the stable-rt trees? Or should we focus all effort on supporting the mainline tree?
Potential attendees:
- Steven Rostedt
- Sebastian Andrzej Siewior
- Thomas Gleixner
- Tom Zanussi
- Julia Cartwright
The fully preemptive preemption model, along with real-time mutexes, are the main features of the PREEMPT RT.
How do we check if we are respecting all the rules for these features, e.g., how do we check if changes in the kernel are not breaking the preemption or the locking model?
For locking, we already have an answer: Lockdep!
But how about the preemption model?
The presenter has a preliminary formalization of the preemption model, and he wants to discuss how to implement the validator of the model. Should it be in kernel or user-space? Tracing or a "validator" like lockdep?
Topics
Binding and Devicetree Source/DTB Validation: update and next steps
Binding specification format
Validation Process and Process
How to validate overlays
Devicetree Specification: update and next steps
Reducing devicetree memory and storage size
Overlays
Bootloader and Linux kernel implementation update
Remaining blockers and issues
Use cases
Devicetree compiler (dtc)
Next version of DTB/FDT format
Motivated by desire to replace metadata being encoded as normal data (metadata for overlays)
Other desired changes should be considered
Boot and Run-time Configuration
Pain points and needs
Multi-bus devices
Feedback from the trenches
how DTOs are used in embedded devices in practice
in U-Boot and Linux
in systems with FPGAs
Use of devicetrees in small code/data space (e.g. U-Boot SPL)
Connector node bindings
FPGA issues
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Containers (or Operating System based Virtualization) are an old
technology; however, the current excitement (and consequent
investment) around containers provides interesting avenues for
research on updating the way we build and manage container technology.
The most active area of research today, thanks to concerns raised by
groups supporting other types of virtualization, is in improving the
security properties of containers.
The first step in improving security is actually being able to measure
it in the first place, so the initial goal of a research programme for
container security involves finding that measure. In this talk I'll
outline one such measure (attack profiles) developed by IBM research,
the useful results that can be derived from it, the problems it has
and the avenues that can be explored to refine future measurements of
containment.
Contrary to popular belief, a "container" doesn't describe one fixed
thing, but instead is a collective noun for a group of isolation and
resource control primitives (in Linux terminology called namespaces
and cgroups) the composition of which can be independently varied. In
the second half of this talk, we'll explore how containment can be
improved by replacing some of the isolation primitives with local
system call emulation sandboxes, a promising technique used by both
the Google gVisor and the IBM Nabla secure container systems. We'll
also explore the question of whether sandboxes are the end point of
container security research or merely point the way to the next
Frontier for container abstraction.
Using graphics cards for compute acceleration has been a major shift in technology lately, especially around AI/ML and HPC.
Until now the clear market leader has been the CUDA stack from NVIDIA, which is a closed source solution that runs on Linux. Open source applications like tensorflow (AI/ML) rely on this closed stack to utilise GPUs for acceleration.
Vendor aligned stacks such as AMD's ROCm and Intel's OpenCL NEO are emerging that try to fill the gap for their specific hardware platforms. These stacks are very large, and don't share much if any code. There are also efforts inside groups like Khronos with their OpenCL, SPIR-V and SYCL standards being made to produce something that can work as a useful standardised alternative.
This talk will discuss the possibility of creating a vendor neutral reference compute stack based around open source technologies and open source development models that could execute compute tasks across multiple vendor GPUs. Using SYCL/OpenCL/Vulkan and the open-source Mesa stack, as the basis for a future task that development of tools and features on top of as part of a desktop OS.
This talk doesn't have all the answers, but it wants to get people considering what we can produce in the area.
Side channel attacks are here to stay. What can we do inside the operating system to proactively defend against them? This talk will walk through a few of the ideas that Intel’s Open Source Technology Center are developing to improve our resistance to side channel attacks as part of our new side channel defense project. We would also like to gather ideas from the rest of the community on what our top priorities for side channel defense for the Linux kernel should be.
Plugging in USB sticks, building VM images, and unprivileged containers all give rise to a situation where users are mounting and dealing with filesystem images they have not built themselves, and don't necessarily want to trust.
This leads to the problem of how to mount and read/write those filesystems without opening yourself up to more risk than visiting a web page.
I will survey what has been built already, describe what the technical challenges and describe the problems ahead.
With this talk I hope to unite the various groups across the linux ecosystem that care about this problem and get the discussion started on how we can move forward.
A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.
Official Networking Track website: http://vger.kernel.org/lpc-networking2018.html
phylib has provided the API Ethernet MAC drivers have used to control
Copper PHYs for many years. However with the advent of MACs/PHYs with
bandwidth of > 1Gbps, SERDES interfaces and fibre optical modules,
phylib is not sufficient. phylink provides an API which MAC drivers
can use to control these more complex and dynamic, possibly
hot-pluggable PHYs. This presentation will explain why phylink is
needed, how it differs from phylib, and describe how to convert a MAC
driver from phylib to phylink in order to make use of its new
features. The kernel support for SFP modules will also be detailed,
including how the MAC needs to handle hot-plugging of the PHY, which
can be copper or fibre.
This talk is divided into two parts, first we present on kTLS, the current kernel's
sockmap BPF architecture for L7 policy enforcement, as well as the kernel's ULP and
strparser framework which is utilized by both in order to hook into socket callbacks
and determine message boundaries for subsequent processing.
We further elaborate on the challenges we face when trying to combine kTLS with the
power of BPF for the eventual goal of allowing in-kernel introspection and policy
enforcement of application data before encryption. Besides others, this includes a
discussion on various approaches to address the shortcomings of the current ULP layer,
optimizations for strparser, and the consolidation of scatter/gather processing for
kTLS and sockmap as well as future work on top of that.
UDP is a popular foundation for new protocols. It is available across
operating systems without superuser privileges and widely supported
by middleboxes. Shipping protocols in userspace on top of
a robust UDP stack allows for rapid deployment, experimentation
and innovation of network protocols.
But implementing protocols in userspace has limitations. The
environment lacks access to features like high resolution timers
and hardware offload. Transport cost can be high. Cycle count of
transferring large payloads with UDP can be up to 3x that of TCP.
In this talk we present recent and ongoing work, both by the authors
and others, at improving UDP for content delivery.
UDP Segmentation offload amortizes transmit stack traversal by
sending as many as 64 segments as one large fused large packet.
The kernel passes this through the stack as one datagram, then
splits it into multiple packets and replicates their network and
transport headers just before handing to the network device.
Some devices can offload segmentation for exact multiples of
segment size. We discuss how partial GSO support combines the
best of software and hardware offload and evaluate the benefits of
segmentation offload over standard UDP.
With these large buffers, MSG_ZEROCOPY becomes effective at
removing the cost of copying in sendmsg, often the largest
single line item in these workloads. We extend this to UDP and
evaluate it on top of GSO.
Bursting too many segments at once can cause drops and retransmits.
SO_TXTIME adds a release time interface which allows offloading of
pacing to the kernel, where it is both more accurate and cheaper.
We will look at this interface and how it is supported by queuing
disciplines and hardware devices.
Finally, we look at how these transmit savings can be extended to
the forwarding and receive paths through the complement of GSO,
GRO, and local delivery of fused packets.
Among the various ways of using eBPF, OVS has been exploring the power
of eBPF in three: (1) attaching eBPF to TC, (2) offloading a subset of
processing to XDP, and (3) by-passing the kernel using AF_XDP.
Unfortunately, as of today, none of the three approaches satisfies the
requirements of OVS. In this presentation, we’d like to share the
challenges we faced, experience learned, and seek for feedbacks from
the community for future direction.
Attaching eBPF to TC started first with the most aggressive goal: we
planned to re-implement the entire features of OVS kernel datapath
under net/openvswitch/* into eBPF code. We worked around a couple of
limitations, for example, the lack of TLV support led us to redefine a
binary kernel-user API using a fixed-length array; and without a
dedicated way to execute a packet, we created a dedicated device for
user to kernel packet transmission, with a different BPF program
attached to handle packet execute logic. Currently, we are working on
connection tracking. Although a simple eBPF map can achieve basic
operations of conntrack table lookup and commit, how to handle NAT,
(de)fragmentation, and ALG are still under discussion.
Moving one layer below TC is called XDP (eXpress Data Path), a much
faster layer for packet processing, but with almost no extra packet
metadata and limited BPF helpers support. Depending on the complexity
of flows, OVS can offload a subset of its flow processing to XDP when
feasible. However, the fact that XDP has fewer helper function support
implies that either 1) only very limited number of flows are eligible
for offload, or 2) more flow processing logic needed to be done in
native eBPF.
AF_XDP represents another form of XDP, with a socket interface for
control plane and a shared memory API for accessing packets from
userspace applications. OVS today has another full-fledged datapath
implementation in userspace, called dpif-netdev, used by DPDK
community. By treating the AF_XDP as a fast packet-I/O channel, the
OVS dpif-netdev can satisfy almost all existing features. We are
working on building the prototype and evaluating its performance.
RFC patch:
OVS eBPF datapath.
https://www.mail-archive.com/iovisor-dev@lists.iovisor.org/msg01105.html
Over the last 10 years the world has seen NICs go from single port,
single netdev devices, to multi-port, hardware switching, CPU/NFP
having, FPGA carrying, hundreds of attached netdev providing,
behemoths. This presentation will begin with an overview of the
current state of filtering and scheduling, and the evolution of the
kernel and networking hardware interfaces. (HINT: it’s a bit of a
jungle we’ve helped grow!) We’ll summarize the different kinds of
networking products available from different vendors, and show the
workflows of how a user can use the network hardware
offloads/accelerations available and where there are still gaps. Of
particular interest to us is how to have a useful, generic hardware
offload supporting infrastructure (with seamless software fallback!)
within the kernel, and we’ll explain the differences between deploying
an eBPF program that can run in software, and one that can be
offloaded by a programmable ASIC based NIC. We will discuss our
analysis of the cost of an offload, and when it may not be a great
idea to do so, as hardware offload is most useful when it achieves the
desired speed and requires no special software (kernel changes). Some
other topics we will touch on: the programmability exposed by smart
NICs is more than that of a data plane packet processing engine and
hence any packet processing programming language such as eBPF or P4
will require certain extensions to take advantage of the device
capabilities in a holistic way. We’ll provide a look into the future
and how we think our customers will use the interfaces we want to
provide both from our hardware, and from the kernel. We will also go
over the matrix of most important parameters that are shaping our HW
designs and why.
Today every packet which is reaching Facebook’s network is being processed by XDP enabled application. We have been using it for more then 1.5 years and this talk is about evolution of XDP and BPF which has been driven by our production needs. I’m going to talk about history of changes in core BPF components as well as will show why and how it was done. What performance improvements did we get (with synthetics and real world data) and how it was implemented. Also I’m going to talk about issues and shortcoming of BPF/XDP which we have found during our operations, as well as some gotchas and corner cases. In the end we are going to discuss on what is still missing and which part could be improved.
Topics and areas of existing BPF/XDP infrastructure which are going to be covered in this talk:
Lessons which we have learned during operation of XDP:
Missing parts: what and why could be added:
Currently the Linux kernel implements two distinct datapaths for Open
vSwitch: the ovskdp and the TC DP. The latter has been added recently
mainly to allow HW offload, while the former is usually preferred for
SW based forwarding due to functional and performance reasons.
We evaluate both datapaths in a typical forwarding scenario - the PVP
test - using the perf tool to identify bottlenecks in the TC SW dp.
While similar steps usually incur in similar costs, the TC SW DP
requires an additional, per packet, skb_clone, due to a TC actions
constraint.
We propose to extend the existing act infrastructure, leveraging the
ACT_REDIRECT action and the bpf redirect code, to allow clone-free
forwarding from the mirred action and then re-evaluate the datapaths
performances: the gap is then almost already closed.
Nevertheless, TC SW performance can be further improved by completing
the RCU-ification of the TC actions and expanding the recent
listification infrastructure to the TC (ingress) hook. We plan also to
compare the TC/SW datapath with an custom eBPF program implementing the
equivalent flow set to tag a reference value for the target
performances.
eBPF (extended Berkeley Packet Filter) has been shown to be a flexible
kernel construct used for a variety of use cases, such as load balancing,
intrusion detection systems (IDS), tracing and many others. One such
emerging use case revolves around the proposal made by William Tu for
the use of eBPF as a data path for Open vSwitch. However, there are
broader switching use cases developing around the use of eBPF capable
hardware. This talk is designed to explore the bottlenecks that exist in
generalising the application of eBPF further to both container switching as
well as physical switching.
Topics that will be covered include proposals for container isolation through
the use of features such as programmable RSS, the viability of physical
switching using eBPF capable hardware as well as integrations with other
subsystems or additional helper functions which may improve the possible
functionality.
Linux currently provides mechanisms for managing and allocating many of the system resources such as CPU, Memory, etc. Network resource management is more complicated since networking deals not only with a local resource, such as CPU management does, but can also deal with a global resource. The goal is not only to provide a mechanism for allocating the local network resource (NIC bandwidth), but also to support management of network resources external to the host, such as link and switch bandwidths.
For networking, the primary mechanism for allocating and managing bandwidth has been the traffic control (tc) subsystem. While tc allows for shaping of outgoing traffic and policing of incoming traffic, it suffers from some drawbacks. The first drawback is a history of performance issues when using the Hierarchical Queuing Discipline (htb) which is usually required for anything other than simple shaping needs. A second drawback is the lack of flexibility usually provided by general programming constructs.
We are in the process of designing and implementing a BPF based framework for efficiently supporting shaping of both egress and ingress traffic based on both local and global network allocations.
Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.
Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.
We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.
Cgroup accounting has significant overhead due to the need to constantly loop over all cpus to update statistics of cpu usages and blocked averages. We have seen that on 4 socket Haswell, database benchmarks like TPCC have 8% performance regression at the time of Haswell and 4.4 kernel when it is run under cgroup. On recent Cannon Lake platform using latest PCIE SSDs and 4.18 kernel, the regression in the scheduler has gotten worse to 12%. We will highlight the bottlenecks in the scheduler with detailed profiles of the hot path. We'll like to explore possible avenues to improve cgroup accounting.
Discuss two possible approaches to live update Linux that runs as a hypervisor without a noticeable effect on running Virtual Machines (VM). One method is to use cooperative multi-OSing paradigm to share the same machine between two kernels while the new kernel is booting, and the old kernel is still serving the running VM instances. Allow the new kernel to live migrate the drivers from the old kernel by using shadow class drivers, and later do the live migration of running VMs without copying their memory. The second method is to boot new kernel in a fully virtualized environment, that is the same as the underlying hardware, live migrate the VMs into the newly booted hypervisor, and make the hypervisor transition from the VM environment to bare metal.
Summary:
In this talk I discuss scalability of load balancing algorithms in the task scheduler, and present my work on tracking overloaded CPUs with a bitmap, and using the bitmap to steal tasks when CPUs become idle.
Abstract:
The scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a challenge on both the push and pull sides on large systems. For pulls, the scheduler searches all CPUs in successively larger domains until an overloaded CPU is found, and pulls a task from the busiest group. This is very expensive, so search time is limited by the average idle time, and some domains are not searched. Balance is not always achieved, and idle CPUs go unused.
I propose an alternate mechanism that is invoked after the existing search limits itself and finds nothing. I maintain a bitmap of overloaded CPUs, where a CPU sets its bit when its runnable CFS task count exceeds 1. The bitmap is sparse, with a limited number of significant bits per cacheline. This reduces cache contention when many threads concurrently set, clear, and visit elements. There is a bitmap per last-level cache. When a CPU becomes idle, it finds the first overloaded CPU in the bitmap and steals a task from it. For certain configurations and test cases, this optimization improves hackbench performance by 27%, OLTP by 9%, and tbench by 16%, with a minimal cost in search time. I present schedstat data showing the change in vital scheduler metrics before and after the optimization.
For now the new stealing is confined to the LLC to avoid NUMA effects, but it could be extended to steal across nodes in the future. It could also be extended to the realtime scheduling class. Lastly, the sparse bitmap could be used to track idle cores and idle CPUs and used to optimize balancing on the push side.
1) Scalability of scheduler idle cpu and core search on systems with large number of cpus
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. These don't scale for large llc domains and will only get worse with more cores in future. Spending too much time in the scheduler will hurt performance of very context switch intensive workloads. A more scalable way to do the search is desirable which is not O(no. of cpus) or O(no. of cores) in worst case.
2) Scalability of idle cpu stealing on systems with large number of cpus and domains
When a cpu becomes idles it tries to steal threads from other overloaded cpus using idle_balance. idle_balance does more work because it searches widely for the busiest CPU to offload, so to limit its CPU consumption, it declines to search if the system is too busy. A more scalable/lightweight way of stealing is desirable so that we can always try to steal with very little cost.
3) Discuss workloads that use pipes and can benefit from pipe busy waits
When pipe is full or empty a thread goes to sleep immediately. If the sleep wakeup happens very fast, the cost of sleep wakeup overhead can hurt a very context switch sensitive workload that is using pipes heavily. A few microseconds of busy wait before sleeping can avoid the overhead and improve the performance. Network sockets has similar capability. So far hackbench with pipe shows huge improvements, want to discuss other potential use cases.
Huge pages are essential to addressing performance botttlenecks
since the base page sizes are not changing while the amount of memory is
ever increasing. Huge pages can address TLB misses but also memmory
overhead in the Linux kernel that arises through page faults and other
compute intensive processing of small pages. Huge pages are required
with contemporary high speed NVME ssds to reach full throughput because
the I/O overhead can be reduced and large contiguous memory I/O can then
be scheduled by the devices. However, using huge pages often requires the
modification of applications if transparent huge pages cannot be used.
Transparent huge pages also require application specific setup to work
effectively.
Flexible workqueue: Currently we have two pool setting-up for workqueue: 1) per-cpu workqueue pool and 2) unbound workqueue pool, the former require the users of workqueues to have some knowledge of cpu online state, as shown in:
https://lore.kernel.org/lkml/20180625224332.10596-2-paulmck@linux.vnet.ibm.com/T/#u
While the latter (unbound workqueue) only has one pool per-NUMA, and that may hurt the scalability if we want to run multiple tasks in parallel inside a NUMA node.
Therefore, that is a clear requirement for having a setting-up for workqueue to provide flexible level of parallelism (i.e. could run as many tasks as possible while save users from worrying about race with cpu hotplug).
We'd like to have a session to talk about requirement and possible solution.
Certain CPU-intensive tasks in the kernel can benefit from multithreading, such as zeroing large ranges of memory, initializing massive state (struct page) at boot, VFIO page pinning, XFS quotacheck, and freeing memory on munmap/exit. There is currently no interface that provides this service. ktask is a framework built on workqueues that splits up the work, chooses the number of threads to use, synchronizes these threads, and load balances the work between them. I want to discuss current issues with this work, including allowing ktask threads to play well with the scheduler, cgroup awareness so ktask threads are throttled appropriately, and appropriately enabling ktask according to power management settings.
The mmap_sem has long been a contention point in the memory management
subsystem. In this session some mmap_sem related topics will be
discussed. Some optimization has been merged by the upstream kernel to
solve holding mmap_sem for write for excessive period of time in
munmap path by downgrading write mmap_sem to read. And, some
optimization are under discussion on the mailing list, i.e. release
mmap_sem earlier for page cache readahead, speculative page fault.
There is still optimization room by figuring out just what mmap_sem
protects. It covers access to many fields in the mm_struct structure.
It is also used for the virtual memory area (VMA) red-black tree, the
process VMA list, and various fields within the VMA structure itself.
Finer grain locks might be better to replace mmap_sem to reduce
contention, i.e. range lock or per vma lock.
Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena. Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel. Other areas where there is ongoing work include the low memory killer, dynamically-allocated Binder devices, kernel namespaces, EAS, userdata FS checkpointing and DT.
The working planning document is here:
https://docs.google.com/spreadsheets/d/1ymzOB4wapccX6t1b11T2-m9ny8buN7EuUqhCxrsmKe4
As of Linux 4.18, there are more than 30000 exported symbols in the kernel that can be used by loadable kernel modules. These exported symbols are all part of a global namespace, and there seems to be consensus among kernel devs that the export surface is too large, and hard to reason about. This talk describes a series of patches that introduce symbol namespaces, in order to more clearly delineate the export surface.
Android's transition from in-kernel lowmemorykiller to userspace LMKD introduced a number of challenges including low-latency memory pressure detection and signaling, improving kill decision mechanisms and improving SIGKILL delivery latency. This talk will focus on the memory pressure detection mechanism based on PSI (Pressure Stall Information) recently developed by Johannes Weiner. It will also present PSI Monitor module currently in development stage.
The Binder driver currently does allow for the allocation of multiple binder devices through a kconfig option. However, this means how many binder devices the kernel will allocate is hard-coded and cannot be changed at runtime. This is inconvenient for scenarios where processes wish to allocate binder devices on the fly and the number of needed devices is not know in advance. For example, we are running large system container workloads where each container wants at least one binder device that is not shared with any other container. The number of running containers can change dynamically which causes binder devices to be freed or allocated on demand. In this session I want to propose and discuss two distinct approaches how this problem can be solved:
1. /dev/binder-control: A new control device /dev/binder-control is added through which processes can allocate a free or add a new binder device to the system.
2. binderfs: A new binderfs filesystem is added. Each mount of binderfs in a new mount (and ipc) namespace will be a new instance similar to how devpts works. Ideally, binderfs would be mountable from non-initial user namespaces. This idea is
similar to earlier proposals of a lofs (filesystem for loop devices).
This session hopes to start a fruitful discussion around the feasibility of this feature and nurture a technical discussion around the actual implementation.
Despite the continuous and encouraging improvements, AOSP stable kernels have still a certain delta with respect to mainline. Some features are still unique to AOSP (e.g. WALT or SchedTune), others are back-port from mainline (e.g. idle loop optimization). Whenever an existing feature is modified, or a new/backported feature is proposed for an AOSP stable kernel, apart from a great code review on gerrit, we would like to increase our confidence on the quality of the chances by testing their impact on few key areas: interactivity, energy efficiency and performance. This slot will be dedicated on describing a possible solution with, the main goal is to collect feedbacks on how to increase its adoption by AOSP common kernel's contributors.
Android A/B updates allows roll back of updates that fail to boot, rolling back system, vendor partitions. BUT if update modifies the userdata partition before failing, cannot roll back modifications, and Android does not support updated userdata with old system/vendor partitions. If the file system supports snapshots, use them! We are adding snapshot support to F2FS. If no filesystem support, consider a block level solution. We will discuss a dm-snap based solution vs a new solution from Google called dm-bow.
The Android OS is stored on signed, read-only partitions. Sizing these partitions is difficult. After a few years, a major OS pdate may no longer fit in a device's existing partitions. To use space more intelligently, we are introducing a userspace partitioning system based on dm-linear, similar to LVM.
Android Oreo introduced some device tree bindings that MUST be specified for Android to be able to mount core partitions early (before SELinux enforcement) in the boot sequence for Project Treble. This talk is about the plans to get rid of that kernel / device tree dependency.. Instead, have a single global fstab for all and how the device tree fstab is to be deprecated.
Android uses ashmem for sharing memory regions. We are trying to migrate all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd for shared memory in Android, and contributing to improving memfd upstream. Note staging drivers are also not ABI and generally can be removed at anytime. This talk is about the current open challenges with this, patches that are recently sent to LKML, technical difficulties, userspace requirements, etc. One of the big difficulties with having a "pinning" interface. John Stultz has proposed vrange syscall before. It would be good to some consensus on the direction that we should go in this regard.
Android kernels is a cocktail of upstream, android common kernels and big amounts of out-of-tree vendor code to support the SoCs and board peripherals. Android Pie now requires the board peripherals to be described using a device tree overlay. It is recommended that the drivers for those peripherals be loaded at boot time as a kernel module.
This discussion is intended to evaluate, seek feedback and find out possible hurdles for if taking it a step further for Android devices. So as to also have the SoC code loaded as kernel modules as well. This obviously facilitates faster core kernel updates on Android devices. More importantly though, Android and the linux kernel can move together year over year without having to worry about older kernels.
A short update on what Google is doing to help move partners away from proprietary interfaces for their display drivers and towards DRM/KMS and upstreaming.
Follow-up discussion to previous years about remaining work to be done to get ION driver merged upstream.
Introduction to Cuttlefish VM
Refereed track talk
Collective progress report from Android MC 2018
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Now that the XArray is in, it's time to make use of it. I've got a git tree which converts every current user of the radix tree to the XArray as well as converting some users of the IDA to the XArray.
The XArray may also be a great replacement for a list_head in your data structure.
I can also talk about currently inappropriate uses for the XArray and what I might do in future to make the XArray useful for more users.
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Being a traditional tool with a long history, strace has been making every effort to overcome various deficiencies in the kernel API. Unfortunately, some of these workarounds are fragile, and in some cases no workaround is possible. In this talk maintainers of strace will describe these deficiencies and propose extensions to the kernel API so that tools like strace could work in a more reliable way.
Problem: there is no kernel API to find out whether the tracee is entering or exiting syscall.
Current workarounds: strace does its best to sort out and track ptrace events, this works in most cases, but in case of strace attaching to a tracee being inside exec when its first syscall stop is syscall-exit-stop instead of syscall-enter-stop, the workaround is fragile, and in infamous case of int 0x80 on x86_64 there is no reliable workaround.
Proposed solution: extend the ptrace API with PTRACE_GET_SYSCALL_INFO
request.
Problem: there is no kernel API to invoke wait4
syscall with changed signal mask.
Current workarounds: strace does its best to implement a race-free workaround, but it is way too complex and hard to maintain.
Proposed solution: add wait6
syscall which is wait4
with additional signal mask arguments, like pselect
vs select
and ppoll
vs poll
.
Problem: time precision provided by struct rusage
is too low for strace -c
nowadays.
Current workarounds: none.
Proposed solution: when adding wait6
syscall, change struct rusage
argument to a different structure with fields of type struct timespec
instead of struct timeval
.
Problem: PID namespaces have been introduced without a proper kernel API to translate between tracer and tracee views of pids. This causes confusion among strace users, e.g. https://bugzilla.redhat.com/1035433
Current workarounds: none.
Proposed solution: add translate_pid
syscall, e.g. https://lkml.org/lkml/2018/7/3/589
Problem: there are no consistent declarative syscall descriptions, this forces every user to reinvent its own wheel and catch up with the kernel.
Current workarounds: a lot of manual work has been done in strace to implement parsers of all syscalls. Some of these parsers are quite complex and hard to test. Other projects, e.g. syzkaller, implement their own representation of syscall ABI.
Proposed solution: provide declarative descriptions for all syscalls consistently.
Formal methods have a reputation of being difficult, accessible mostly to academics and of little use to the typical kernel hacker. This talk aims to show how, without "formal" training, one can use such tools for the benefit of the Linux kernel. It will introduce a few formal models that helped find actual bugs in the Linux kernel and start a discussion around future uses from modelling existing kernel implementation (e.g. cpu hotplug, page cache states, mm_user/mm_count) to formally specifying new design choices. The introductory examples are written in PlusCal (an algorithm language based on TLA+) but no prior knowledge is required.
Providing a consistent and predictable performance experience for applications is an important goal for cloud providers. Creating isolated job domains in a multi-tenant shared environment can be extremely challenging. At Google, performance isolation challenges due to memory bandwidth has been on the rise with newer workloads. This talk covers our attempt to understand and mitigate isolation issues caused by memory bandwidth saturation.
The recent Intel RDT support in Linux helps us both monitor and manage memory bandwidth use on newer platforms. However, it still leaves a large chunk of our fleet at risk of memory bandwidth issues. The talk covers three aspects of our isolation attempts:
We believe the problems and trends we have observed are universally applicable. We hope to inform and initiate discussion around common solutions across the community.
Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. In this talk, Daniel Xu will cover why the Linux kernel OOM killer is surprisingly ineffective and how oomd, a newly opensourced userspace OOM killer, does a more effective and reliable job. Not only does the switch from kernel space to userspace result a more flexible solution, but it also directly translates to better resource utilization. His talk will also do a deep dive into the Linux kernel changes and improvements necessary for oomd to operate.
The GNU Toolchain and Clang/LLVM play a critical role at the nexus of the Linux Kernel, the Open Source Software ecosystem, and computer hardware. The rapid innovation and progress in each of these components requires greater cooperation and coordination. This Toolchain Microconference will explore recent developments in the toolchain projects, the roadmaps, and how to address the challenges and opportunities ahead as the pace of change continues to accelerate.
Current successes and future challenges for the GNU Toolchain. This talk will discuss the recent improvement in GCC, GLIBC, GDB and Binutils and future directions. How can the GNU Toolchain better engage the Linux kernel community?
CET is a security enhancement technology coming in upcoming Intel hardware. This paper will talk about all the changes in the software that are required to enable CET in the marketplace. The changes are all-encompassing affecting the kernel, linker, compilers, libraries, applications etc.
In 2018, People are still using glibc 2.17, which was released in February 2013, on SKX, even when the current released glibc 2.28 has the new memory, string and math functions optimized for SKX. The same thing will happen five years from now.
The CPU runtime C library for x86-64, libcpu-rt-c:
The latest memory, string functions from glibc Binary compatible with any x86-64 OSes. Link directly or LD_PRELOAD.
This is RFC session of to check lacking kernel features glibc lacks (such as termios2), some features glibc might require to implement correctly some standards (such as pthread cancellation), and how to improve communication between kernel and gnu toolchain developers.
The 32-bit RISC-V glibc port is not currently upstream so we've taken the opportunity to leave the 32-bit Linux ABI a bit slushy in the hope that we can avoid any known to be legacy interfaces. The last major interface remaining that we plan on deprecating is the 32-bit time_t interface, and while we don't want to delay our glibc release just to have a clean time_t we think it's possible to get everything done in time.
This session exists to determine if this is feasible, and assuming it is feasible how we can go about doing it.
This session gives a brief introduction to the new features introduced in AArch64 with Armv8.5 and an overview of how these features will make it into toolchains in upcoming releases.
BPF is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP, tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.
BPF has seen a lot of progress since last year's Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one.
This year's BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year's event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.
Irrelevant text added by Paul E. McKenney.
Official BPF MC website: http://vger.kernel.org/lpc-bpf2018.html
BPF MC opening session.
Google servers classify, measure, and shape their outgoing traffic. The original implementation is based on Linux kernel traffic control (TC). As server platforms scale so does their network bandwidth and number of classified flows, exposing scalability limits in the TC system - specifically contention on the root qdisc lock.
Mechanisms like selective qdisc bypass, sharded qdisc hierarchies, and low-overhead prequeue ameliorate the contention up to a point. But they cannot fully resolve it. Recent changes to the Linux kernel make it possible to move classification, measurement, and packet mangling outside this critical section, potentially scaling to much higher rates while simultaneously shaping more flows and applying more flexible policies.
By moving classification and measurement to BPF at the new TC egress hook, servers avoid taking a lock millions of times per second. Running BPF programs at socket connect time with TCP_BPF converts overhead from per-packet to per-flow. The programmability of BPF also allows us to implement entirely new functions, such as runtime configurable congestion control, first-packet classification and socket-based QoS policies. It enables faster deployment cycles and as this business logic can be updated dynamically from a user agent. The discussion will focus on our experience converting an existing traffic shaping system to a solution based on BPF, and the issues we’ve encountered during testing and debugging.
Compile-once and run-everywhere can make deployment simpler and may consume less resources on the target host, e.g., without llvm compiler and kernel devel package. Currently bpf programs for networking can compile once and run over different kernel versions. But bpf programs for tracing cannot since it accesses kernel internal headers and these headers are subject to change between kernel versions.
But compile-once run-everywhere for tracing is not easy. BPF programs could access anything in the kernel headers, including data structures, macros and inline functions. To achieve this goal, we need (1) preserving header-level accesses for the bpf program, and (2) abstracting header info of vmlinux. Right before program load on the target host, some kind of resolution is done for bpf program against the running kernel so the resulted program is just like to that compiled against host kernel headers.
In this talk, we will explore how BTF could be used by both bpf program and vmlinux to advance the possibility of bpf program compile-once and run-everywhere.
BPF program writers today who build and distribute their programs as ELF objects typically write their programs using one of a small set of (mostly) similar headers that establish norms around ELF section definitions. One such norm is the presence of a "maps" section which allows maps to be referenced within BPF instructions using virtual file descriptors. When a BPF loader (eg, iproute2) opens the ELF, it loads each map referred in this section, creates a real file descriptor for that map, then updates all BPF instructions which refer to the same map to specify the real file descriptor. This allows symbolic referencing to maps without requiring writers to implement their own loaders or recompile their programs every time they create a map.
This discussion will take a look at how to provide similar symbolic referencing for static data. Existing implementations already templatize information such as MAC or IP addresses using C macros, then invoke a compiler to replace such static data at load time, at a cost of one compilation per load. By extending the support for static variables into ELF sections, programs could be written and compiled once then reloaded many times with different static data.
Currently, BPF can not support basic loops such as for, while, do/while, etc. Users work around this by forcing the compiler to "unroll" these control flow constructs in the LLVM backend. However, this only works up to a point. Unrolling increases instruction count and complexity on the verifier and further LLVM can not easily unroll all loops. The result is developers end up writing code that is unnatural, iterating until they find a version that LLVM will compile into a form the verifier backend will support.
We developed a verifier extension to detect bounded loops here,
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/log/?h=wip/bpf-loop-detection
This requires building a DOM tree (computationally expensive) and then matching loop patterns to find loop invariants to verify loops terminate. In this discussion we would like to cover the pros and cons of this approach. As well as discuss another proposal to use explicit control flow instructions to simplify this task.
The goal of this discussion would be to come to a consensus on how to proceed to make progress on supporting bounded loops.
eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.
In this talk, implementations for both versions of DF analyser will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.
Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.
eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.
Complex software usually depends on many different components, which sometimes perform background tasks with side effects not directly visible to their users. Without proper tools it can be hard to identify which component is responsible for performance hits or undesired behaviors.
We were challenged to implement D-Bus observability tools in embedded, ARM32 or ARM64 kernel based environments, both with 32-bit userspace. While we found bcc-tools, an open source compiler set useful, it appeared that it lacks support for 32-bit environments. We extended bcc-tools with support for 32-bit architectures. Using bcc-tools we created Linux eBPF programs – small programs written in a subset of C language, loaded from user-space and executed in kernel context. We attached them to uprobes and kprobes - user and kernel space special kinds of breakpoints. While it worked on ARM32 kernel based system, we faced another problem - ARM64 kernel lacked support for uprobes set in 32-bit binaries. The 64-bit ARM Linux kernel was extended with the ability to probe 32-bit binaries.
We propose to discuss challenges we faced trying to implement bcc-tools based tracing tools on ARM devices. We present a working solution to overcome lack of support for 32-bit architectures in bcc-tools, leaving space for discussion about other ways to achieve the same result. We also introduce 32-bit instruction probing in ARM64 kernel - a solution that we found very useful in our case. As a proof of concept we present tools that monitor D-Bus usage in ARM32 or ARM64 kernel based system with 32-bit userspace. We list what needs to be done for complete eBPF-based tools to be fully usable on ARM.
eBPF (extended Berkeley Packet Filter) is an in-kernel generic virtual machine, which can be used to execute simple programs injected by the user at various hooks in the kernel, on the occurrences of events such as incoming packets. eBPF was designed to simplify the work of in-kernel just-in-time compilers, i.e. translation of eBPF intermediate representation to CPU machine code. Upstream Linux kernel currently contains JITs for all major 64-bit instruction set architectures (ISAs) (x86, AArch64, MIPS, PowerPC, SPARC, s390) as well as some 32-bit translators (ARM, x86, also NFP - Netronome Flow Processor).
The eBPF generic virtual machine with clearly defined semantics makes it a very good vehicle for enabling programming of custom hardware. From storage devices to networking processors most host I/O controllers today are built based on or with accompaniment with general purpose processing cores, e.g. ARM. As vendors try to expose more and more capabilities of their hardware, using a general purpose machine definition like eBPF to inject code into hardware directly allows us to avoid creation of vendor specific APIs.
In this talk I will describe the eBPF offload mechanism which exists today in the Linux kernel and how they compare to other offloading stacks e.g. for compute or graphics. I will present a proof-of-concept work on of reusing existing eBPF JITs for non-host architecture (e.g. ARM JIT on x86) to program a emulated device, followed by a short description of the eBPF offload for NFP hardware as an example of a real-life offload.
eBPF-based traffic policer as a replacement* of Hierarchical Token Bucket queuing discipline.
The key idea is two rate three color marker (rfc2698) algorithm, which inputs are committed and peak rates with the corresponding burst sizes and the output is a color or category assigned to a packet. There are conforming, exceeding, violating categories. An action is applied to violating category - either drop or dscp remark. Another action may optionally be applied to exceeding category.
Close-up of eBPF implementation**. Write intensiveness is a cornerstone: an update of available tokens is required on each packet, as well as tracking of time. Naive implementation and its exposure to data races on multi-core processors system. A problem of updating both timestamp and the number of available tokens atomically. Slicing the timeline into chunks of the size of burst duration as a solution for races, mapping each packet into its chunk, so there is no need in updating global timestamp. Two approaches of storing timeline chunks: bpf LRU hash map and a block of timeline chunks in bpf array. Circulating over a block of timeline chunks. Proc and cons of the latter approach: lock-free with bpf array as the only data structure used vs. increased amount of locked memory.
Combining several policers. Linear chain of policers instead of hierarchy. Passing a packet over the chain. Dealing with bandwidth underutilization when first K policers in a chain conform a packet and K+1 rejects. Commutative property of chained policers. Interaction with UDP and TCP. TCP reacting on drop by changing congestion window which affects the actual rate.
Deep packet inspection seems to be a largely unexplored area of BPF use cases. The 4096 instruction limit and the lack of loops make such implementations non-straightforward for many protocols. Using XDP and socket filters, at Red Sift, we implemented DNS and TLS handshake detection to provide better monitoring for our clusters. We learned that while the protocol implementation is not necessarily straightforward, the BPF VM provides a reasonably safe environment for DPI-style parsing. When coupled with our Rust userspace implementation, it can provide information and functionality that previously would have required userspace intercepting proxies or middleboxes, at a comparable performance to iptables-style packet filters. Further work is needed to explore how we can turn this into a more comprehensive, active component, mainly due to the BPF VM restrictions around 4096 instruction programs.
BPF trace tools such as bcc/trace and bpftrace can attach to Systemtap USDT (user application statically defined tracepoints) probes. These probes can be created by a macro imported from "sys/sdt.h" or by a provider file. Either way, Systemtap will register those probes as entries in the note section of the ELF file with the name of the probe, its address and the arguments as assembly locations. This approach is fairly simple, easy to parse and non-intrusive. Unfortunately, it is also obsolete and lacks features such as typed arguments and built-in dynamic instrumentation. Since BPF tools are growing in popularity, it makes sense to create a new enhanced format to fix these shortcomings.
We can discuss and make decisions about the future of USDT probes used by BPF trace tools. Some possible alternatives are: extend Systemtap USDT to introduce these new features or extend kernel tracepoints so that user applications can also register them.
The 'perf trace' tool uses the syscall tracepoints to provide a !ptrace based 'strace' like tool, augmenting the syscall arguments provided by the tracepoints with integer->strings tables automatically generated from the kernel headers, showing the paths associated with fds, pid COMMs, etc.
That is enough for integer arguments, pointer arguments needs either kprobes put in special locations, which is fragile and has been so far implemented only for getname_flags (open, etc filenames), or using eBPF to hook into the syscall enter/exit tracepoints to collect pointer contents right after the existing tracepoint payload.
This has been done to some extent and is present in the kernel sources in the tools/perf/examples/bpf/augmented_syscalls.c, using the pre-existing support in perf to use BPF C programs as event names, automagically using clang/llvm to build and load it via sys_bpf(), 'perf trace' hooks this to the existing beautifiers that seeing that extra data use it to get the filename, struct sockaddr_in, etc.
This was done for a bunch of syscalls, what is left is to get this all automated using BTF, allow passing filters attached to the syscalls, select which syscalls should be traced, use a pre-compiled augmented_syscalls.c just selecting what bits of the obj should be used, etc, i.e. the open issues about this streamlining process to avoid requiring the clang toolchain, etc will be the matter of this discussion.
BPFtrace is a high-level tracing language powered by BPF. Inspired by awk and C, as well as predecessor tracers such as DTrace and SystemTap, BPFtrace has a clean and simple syntax which empower users to easily create BPF programs and attach them to kprobes, uprobes, and tracepoints.
We can discuss the future of this work, including BTF integration for kprobe struct arguments, and solicit feedback.
The existence and power of eBPF provides a generic execution engine at the kernel level. We have been exploring leveraging the power of eBPF as a way to integrate DTrace more into the existing tracing framework that has matured within the Linux kernel. While DTrace comes with some more lightweight ways for getting probes fired, and while it has a pretty nice userlevel consumer with useful features, there should be no need to duplicate a lot of effort on the level of processing probe events and generating data for the consumer.
We want to move forward with modifying DTrace to make use of the eBPF subsystem, and propose and contribute extensions to eBPF (and most likely some other tracing related subsystems) to provide more support for not only DTrace but tracing tools in general. In order to contribute things that benefit more than just us, we need to get together and talk, so let's get it started...
One year after its inception there are real hardened FSs.
Many innovative fixtures. But is it ready for upstream?
Few highlights:
In the talk I will give a short architectural and functional overview. Then will go over some of the leftover challenges.
And finally hope to engage in an open discussion of how this project should move forward to be accepted into the Kernel, gain more users and FS implementations.
Case-insensitive file name lookups is a recurrent topic on Linux filesystems, but its stalled development has regained traction in the past few years, thanks to its applications in platforms like Valve's SteamOS and Android. Despite aiming at simplifying the file lookup operation from a user point of view, since human languages don't directly correlate to arbitrary case folding and encoding composition premises, the actual implementation of encoding and case-insensitive awareness carry an outstanding number of issues and corner cases, which require a clear behavioral definition from the file system layer in order to get it right. File systems developers are invited to come discuss such premises and what is expected from an in-kernel common encoding and case-insensitive abstraction for file systems.
Steal time due to hypervisor overcommitment is a widespread and well-understood phenomena in virtualized environments.
However, sometimes steal appears even when a hypervisor is not overcommitted. This talk will lay out our quest for better utilization of hypervisor hardware by reducing steal. We will talk about kernel heuristics causing it, how we handle disabling these heuristics by implementing a userspace daemon and the issues that arise from this.
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
The physical memory management in the Linux kernel is mostly based on single page allocations, but there are many situations where a larger physically continuous memory needs to be allocated. Some are for the benefit of userspace (e.g. huge pages), others for better performance in the kernel (SLAB/SLUB, networking, and others).
Making sure that contiguous physical memory is available for allocation is far from trivial, as pages are reclaimed for reuse roughly in last-recently-used (LRU) order, which is typically different from their physical placement. The freed memory is thus fragmented. The kernel has two complementary mechanisms to defragment the free memory. One is memory compaction, which migrates used pages to make the free pages contiguous. The other is page grouping by mobility, which tries to make sure that pages that cannot be migrated are grouped together, so the rest of pages can be effectively compacted. Both mechanisms employ various heuristics to balance the success of large allocations, and their overhead in terms of latencies due to processor and lock usage.
The talk will discuss the two mechanisms, focusing on the known problems and their possible solutions, that have been proposed by several memory management developers.
WireGuard [1] [2] is a new network tunneling mechanism written for
Linux, which, after three years of development, is nearly ready for
upstream. It uses a formally proven cryptographic protocol, custom
tailored for the Linux kernel, and has already seen very widespread
deployment, in everything from smart phones to massive data center
clusters. WireGuard uses a novel timer mechanism to hide state from
userspace, and in general presents userspace with a "stateless" and
"declarative" system of establishing secure tunnels. The codebase is
also remarkably small and has been written with a number of defense in
depth techniques. Integration into the larger Linux ecosystem is
advancing at a health rate, with recent patches for systemd and
NetworkManager merged. There is also ongoing work into combining
WireGuard with automatic configuration and mesh routing daemons on
Linux. This talk will focus on a wide variety of WireGuard’s innards
and tentacles onto other projects. The presentation will walk through
WireGuard's integration into the netdev subsystem, its unique use of
network namespaces, why kernel space is necessary is necessary, the
various hurdles that have gone into designing a cryptographic protocol
specifically with kernel constraints in mind. It will also examine a
practical approach to formal verification, suitable for kernel
engineers and not just academics, and connect the ideas of that with
our extensive continuous integration testing framework across multiple
kernel architectures and versions. As if that was not already enough,
we will also take a close look at the interesting performance aspects
of doing high throughput CPU-bound computations in kernel space while
still keeping latency to a minimum. On the topic of smartphones, the
talk will examine power efficiency techniques of both the
implementation and of the protocol design, our experience in
integrating this into Android kernels, and the relationship between
cryptographic secrets and smartphones suspend cycles. Finally we will
look carefully at the WireGuard userspace API and its usage in various
daemons and managers. In short, this presentation will examine the
networking and cryptography design, the kernel engineering, and the
userspace integration considerations of WireGuard.
[1] https://www.wireguard.com
[2] https://www.wireguard.com/papers/wireguard.pdf
Lockdep (the deadlock detector in the Linux kernel) is a powerful tool to detect deadlocks, and has been used for a long time by kernel developers. However, when comes to read/write lock deadlock detections, lockdep only has limited support. Another thing makes this limited support worse is some major architectures (x86 and arm64) has switched or is trying to switch its rwlock implementation to queued rwlock. One example is we found some deadlock cases that happened in kernel but we could not detect it with lockdep.
To improve this situation, a patchset to support read/write deadlock detection in lockdep has been post to lkml and got to its v6. Althrough it got several positive feedbacks, some details about the reasoning of the correctness and other things still need more discussion.
This topic will give a brief introduction on rwlock related deadlocks (recursive read deadlocks) and how we can tweak lockdep to detect them. It will focus on the detection algorithm and its correctness, but also some implementation details.
This topic will provide the opportunity to discuss the reasoning and the overall design with some core lock developers, along with the opportunity to discuss the usage scenarios with potential users. The expected result is we have a cleaner plan on upstreaming this and more developers get educated on how to use this to help their work.
Most modern microprocessors employ complex instruction execution pipelines such that many instructions can be 'in flight' at any given point in time. The efficiency of this pipelining is typically measured in how many instructions get completed per CPU cycle and the metric gets variously called as Instructions Per Cycle (IPC) or the inverse metric Cycles Per Instruction (CPI). Various factors affect this metric and hazards are the primary among them. Different types of hazards exist - Data hazards, Structural hazards and Control hazards. Data hazard is the case where data dependencies exist between instructions in different stages in the pipeline. Structural hazard is when the same processor hardware is needed by more than one instruction in flight at the same time. Control hazards are more the branch misprediction kinds. Information about these hazards are critical towards analyzing performance issues and also to tune software to overcome such issues. Modern processors export such hazard data in Performance Monitoring Unit (PMU) registers. In this talk, we propose an arch neutral extension to perf to export the hazard data presented in different ways by different architectures. We also present how this extension has been applied to the IBM Power processor, the APIs and example output.
The focus will be on power management frameworks, task scheduling in relation to power/energy optimization, and platform power management mechanisms. The goal is to facilitate cross framework and cross platform discussions that can help improve power and energy-awareness in Linux.
An updated proposal for Energy Aware Scheduling has been posted and discussed on LKML during this year [1]. The patch set introduces an independent Energy Model framework holding active power cost of CPUs, and changes the scheduler's wake-up balancing code to use this newly available information when deciding on which CPU a task should run.
This session aims at discussing the open problems identified during the review as well as possible improvements to other areas of the scheduler to further improve energy efficiency.
[1] https://lore.kernel.org/lkml/20181016101513.26919-1-quentin.perret@arm.com/
The Linux scheduler is able to drive frequency selection, when the schedutil cpufreq's governor is in use, based on task utilization aggregated at CPU level. The CPU utilization is then used to select the frequency which better fits the task's generated workload. The current translation of utilization values into a frequency selection is pretty simple: we just go to max for RT tasks or to the minimum frequency which can accommodate the utilization of DL+FAIR tasks.
While this simple mechanism is good enough for DL tasks, for RT and FAIR tasks we can aim at some better frequency driving which can take into consideration hints coming from user-space.
Utilization clamping is a mechanism which allows to filter the utilization generated by RT and FAIR tasks within a range defined from user-space, either for a task or for task groups. The clamped utilization requirements of RUNNABLE tasks are aggregated at CPU level and used to enforce its minimum and/or maximum frequency.
This session is meant to give an update on the most recent LKML posting of the utilization clamping patchset and to open a discussion on how to better progress this proposal.
The venerable menu governor does some thigns that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (becuase it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors are not correlated to the list of available idle states in any way whatever and different correction factors are used depending on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular.
A major rework of the menu governor would be required to address these issues and it is likely that the performance of some workloads would suffer from that. That raises the question of whether or not to try to improve the menu governor or to introduce an entirely new one to replace it, or to do both these things simultaneously.
The Generic PM domains framework (genpd) keeps evolving to deal with new problems. Lately, we have for example seen genpd to incorporate support for active states power management and also support for multiple PM domains per device. Let's walk through these new changes that have been made and discuss their impact.
While new technologies in platform power management continue to evolve, we need to look at ways to ensure it's independent of the OSPM. Custom vendor solutions for power management and device/system configuration lead to fragmentation.
ACPI solved the problem for some market segments by abstracting details, but we still need an alternative for the traditional embedded/mobile market. ARM SCMI continues to address concerns in a few of these functional areas, but there still is a lot of resistance to move away from direct control of power resources in the OS. Examples include:
a. Voltage dependencies for clocks (DVFS) [1] - genpd and performance domain integration
b. Generic cpufreq governor for devfreq [2]
c. On-chip interconnect API [3]
This session aims at reaching some consensus and guidelines going forward to avoid further fragmentation.
[1] https://www.spinics.net/lists/linux-clk/msg27587.html
[2] https://patchwork.ozlabs.org/cover/916114/
[3] https://patchwork.kernel.org/cover/10562761/
Due to high performance demands systems tend to be over-provisioned, where it is not possible to run at peak power of each component. Even if each component has capability to report power and set power limits, there is no kernel level framework to achieve that. IPA addresses part of it, but on the systems in question thermal limits usually are not a problem, but sudden power overdraw is a bigger issue (particularly on unlocked systems). In addition, without proper power balance among components, they can starve each other. For example, in Intel KabyLake-G there are 4 big power consumers: CPUs, two GPUs and memory. If CPUs take most of the power, it will hurt graphics performance as the GPU will not be able to handle requests timely. So the power has to be managed at run time based on the workload demand.
Runtime PM allows drivers to automatically suspend devices that have not been used for a defined amount of time. This autosuspend feature is really efficient for handling bursts of activity on a device by optimizing the number of runtime PM suspend/resume calls. However, the runtime PM timers used for that are fully based on jiffies granularity which raises problems for some embedded ARM platforms that want to optimize their energy usage as much as possible. For example, the minimum timeout value on arm64 is between 4 and 8 ms.
The session will discuss the impact of switching runtime PM over to using hrtimers and a more fine graied time scale. It also will highlight the advantages and drawbacks of the changes relative to the current situation.
Modern SoCs have multiple CPUs and DSPs that generate a lot of data flowing through the on-chip interconnects. The topologies could be multi-tiered and complex. These buses are designed to handle use cases with high data throughput, but as the workload varies they need to be scaled to avoid wasting power. Furthermore, the priority between masters can vary depending on the running use case like video playback or CPU intensive tasks. The purpose of this new API is to allow drivers to express their QoS needs for interconnect paths between the various SoC components. The requests from drivers are aggregated and the system configures the interconnect hardware to the most optimal performance and power profile.
The session will discuss the following:
- How the consumer drivers can determine their bandwidth needs.
- How to support different QoS configurations based on whether each CPU/DSP device is active or sleeping.
Remote DMA Microconference
Opening RDMA session with agenda, announcements and some statistics from last year.
Discussion of the best way to govern 3rd party memory registration and if it is acceptable to implement RDMA-specific functionality (in this case, page fault handling) inside the kernel in order to avoid exposing additional interfaces.
RDMA, DAX and persistant memory co-existence.
Explore the limits of what is possible without using On
Demand Paging Memory Registration. Discuss 'shootdown'
of userspace MRs
Dirtying pages obtained with get_user_pages() can oops ext4
discuss open solutions.
Poor performance of get_user_pages on very large virtual ranges.
No standardized API to allocate regions to user space
Carry over from last year
RDMA and PCI peer to peer transactions. IOMMU issues. Integration with HMM. How to expose PCI BAR memory to userspace and other drivers as a DMA target.
Problem solve RDMA's distinct lack of public tests.
Provide a better framework for all drivers to test with, and a framework for basic testing in userspace.
Worst remaining unfixed syzkaller bugs and how to try to fix them.
Attempt to close on the remaining tasks to complete the project.
Restore fork() support to userspace
Let's gather together and try to plane next year.
The momentum behind RISC-V ecosystem is really commendable and its open nature has a large role in its growth. It allowed contributions from both academic and industry community leading to an unprecedented number of hardware designs proposals in a very short span of time. Soon, a wider variety of RISC-V based hardware boards and extensions
will be available, allowing a larger choice of applications not limited to embedded micro-controllers. RISC-V software ecosystem also need to grow across the stack so that RISC-V can be a true alternative to existing ISA. Linux kernel support holds the key in this.
The primary objective of the RISC-V track at Plumbers to initiate a community wide discussion about the design problems/ideas for different Linux kernel features implemented or to be implemented. We believe that this will also result in significant increase in active developer participation in code review/patch submissions which will definitely lead to a better & stable kernel for RISC-V.
There has been a lot of talk about requirement of a RISC-V platform specification to standardise various key components.
One of them is Platform Level Interrupt Controller (PLIC) and local interrupts. We also need a stable yet extensible firmware interface for RISC-V to support virtualization and power management extensions.
Another area that can be discussed is a standard boot flow for any RISC-V based unix platform.
This is a proposal to make SBI a flexible and extensible interface. It is based on the foundation policy of RISC-V i.e. modularity and openness. The current RISC-V SBI only defines a few mandatory functions such as inter-processor interrupts (IPI) interface, reprogramming timer, serial console, and memory barrier instructions. Many important functionalities such as power management/cpu-hotplug are not yet defined due to difficulties in accommodating modifications without breaking the backward compatibility with the current interface.
RISC-V is currently focused on the embedded market. However, RISC-V already has a design of a superior vector unit that could make the architecture very relevant for High Performance Computing because it would provide superior floating point performance. There are other issues though that would need to be addressed in the RISC-V architecture like the ability to handle large memory sizes properly, scalable locking and scalable I/O. This is an overview of things that may have to be addressed to make RISC-V competitive in the HPC area. These features overlap to some extend to what is also needed to enable cloud computing and we are also briefly going into how that could be accomplished. Ideally, (in the far and distant future) I would like to have RISC-V cover all areas of computing so that a single instruction set can be used for all use cases in our company so that our support overhead can be drastically reduced since we would not have to deal with multiple architectures for different use cases anymore.
Power Management need to designed from ground up for RISC-V.
The RISC-V ISA is still missing a key aspect in modern computing by not having virtualization support. The spec is currently in draft state, although most of the key elements are there. We can discuss what the next steps are in order to start getting hypervisors running, at least in QEMU. We can also discuss having the spec ratified and included in the official RISC-V ISA.
Andes Technology involved in RISC-V Linux Development since mid -2017 and have submitted 20+ patches to enhance functionality. We will discuss challenges supporting features such as loadable module, perf, ELF attributes, ASID, cache coherence and AndeStar V5 extension specific problems.
Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.
Over the past few years the graphics subsystem has been spearheading experiments in running things differently: Pre-merge CI wrapped around mailing lists using patchwork, committer model as a form of group maintainership on steroids, and other things. As a result the graphics people have run into some interesting new corner cases of the kernel's "patches carved on stone tablets" process.
On the other hand the freedesktop.org project, which provides all the server infrastracture for the graphics subsystem, is undergoing a big reorganization of how they provide their services. The biggest change is migrating all source hosting over to a gitlab instance.
This talk will go into the why of these changes and detail what is definitely going to change, and what is being looked into more as experiments with open outcomes.
The Google computing infrastructure uses containers to manage millions of simultaneously running jobs in data centers worldwide. Although the applications are container aware and are designed to be resilient to failures, evictions due to resource contention and scheduled maintenance events can reduce overall efficiency due to the time required to rebuild complex application state. This talk discusses the ongoing use of the open source Checkpoint/Restore in Userspace (CRIU) software to migrate container workloads between machines without loss of application state, allowing improvements in efficiency and utilization. We’ll present our experiences with using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.
The eXpress Data Path (XDP) has been gradually integrated into the Linux kernel over several releases. XDP offers fast and programmable packet processing in kernel context. The operating system kernel itself provides a safe execution environment for custom packet processing applications, in form of eBPF programs, executed in device driver context. XDP provides a fully integrated solution working in concert with the kernel's networking stack. Applications are written in higher level languages such as C and compiled via LLVM into eBPF bytecode which the kernel statically analyses for safety, and JIT translates into native instructions. This is an alternative approach, compared to kernel bypass mechanisms (like DPDK and netmap).
This talk gives a practical focused introduction to XDP. Describing and giving code examples for the programming environment provided to the XDP developer. The programmer need to change their mindeset a bit, when coding for this XDP/eBPF execution environment. XDP programs are often split between eBPF-code running kernel side and userspace control plane. The control plane API not predefined, and is up to the programmer, through userspace manipulating shared eBPF maps.
Lively discussion among top level kernel developers about interesting topics raised during Plumbers and Elsewhere
Lively discussion among top level kernel developers about interesting topics raised during Plumbers and elsewhere
The main purpose of the Linux Plumbers 2018 Live kernel patching miniconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make Live patching of the Linux Kernel (more or less) feature complete.
The main purpose of the proposed miniconference is focusing on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.