Linux Plumbers Conference 2018

America/Vancouver

Description

November 13-15 2018, Vancouver, BC

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.  LPC 2018 will be held November 13-15 in Vancouver, BC, Canada.  We are looking forward to seeing you there!

    • 09:00 17:30
      Containers MC Junior/Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior/Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

      • 09:00
        Opening session 10m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100
        Speaker: Stéphane Graber (Canonical Ltd.)
      • 09:15
        News from academia: FatELF, RDMA and CRIU 30m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        As a part of ongoing research project we have added several features to CRIU: post-copy memory migration, post-copy migration over RDMA and support from cross-architecture checkpoint-restore.

        The "plain" post-copy migration is already upstream and, up to hiccups that regularily show up in CI, it can be considered working so there is not much to discuss about it.

        The post-copy migration over RDMA aims to reduce the remote page fault latency. We have a working prototype that replaces TCP transport for memory transfer with RDMA. We still do not have solid performance evaluation, but if RDMA will provide the expected reduction in page fault latency, we are going to work with the CRIU community to make the RDMA support upstream.

        The cross-architecture checkpoint-restore is the most peculiar and controversal feature. Various aspects of heterogeneous ISA execution have been a hot reseach topic for a while, and we decided to see what it would take to make CRIU capabable of migrating an application between architectures.

        The idea is simple: if we create the binary for all architecutres so that all the symbols have exactly the same address, then restoring a dump created on a different architecutre is a matter of transforming the stack and the registers.

        This transformation relies on the metadata generated by the specialized compiler that generates multiple object files from a single source (one for each architecture), hence the FatELF.

        Up till now we were able to force CRIU to create a checkpoint of an extended "Hello, World" application on arm64 and restore this application on x86.

        Speakers: Joel Nider (IBM), Mike Rapoport (IBM)
      • 09:50
        Stacking and LSM Namespacing Redux 15m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        Last year we discussed the efforts to bring stacking and namespacing to the LSM subsystem. Over the last year several of the outstanding issues have been resolved (if not always in the most satisfactory way). The path forward for upstreaming stacking is now clear.

        This presentation will discuss solutions to outstanding problems and the current direction for upstreaming LSM stacking. As well as the status of namespacing in the LSM subsystems and what it means for containers.

        Speakers: Casey Schaufler (Intel), John Johansen (Canonical)
      • 10:10
        Challenges in migrating a large cgroup deployment from v1 to v2 20m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        Google has a large cgroup v1 deployment and have begun planning our migration to cgroup v2. This migration has proven difficult because of our extensive use of cgroup v1 features.
        Among the most challenging issues are the transition from multiple hierarchies to a unified one, migration of users who create their own cgroups, custom threaded cgroup management and the lack of ability to transition a controller between v1 and v2 more gradually. Additionally the cgroup v1 features don't exactly map to ones in cgroup v2 which means that there is additional risk during the migration where non-obvious behavior changes have to be debugged and tracked down. The talk will outline these challenges in more detail and describe approaches taken by us to solve them including proposals for possible changes to ease this migration for other users of the cgroup v1 interface.

        Speakers: Kamil Yurtsever (Google), Shakeel Butt (Google)
      • 10:30
        Break 30m
      • 11:00
        Time Namespaces 30m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        Discussions around time namespace are there for a long time. The first attempt to implement it was in 2006. From that time, the topic appears on and off in various discussions.

        There are two main use cases for time namespace:
        1. change date and time inside a container;
        2. adjust clocks for a container restored from a checkpoint.

        “It seems like this might be one of the last major obstacles keeping migration from being used in production systems, given that not all containers and connections can be migrated as long as a time dependency is capable of messing it up.” (by github.com/dav-ell)

        The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the start points for them are not well defined (currently it is system start up time, but the POSIX says “since an unspecified point in the past”), and are different for each system. When a container is migrated from one node to another, all clocks are to be restored into consistent states;
        in other words, they have to continue running from the same points where they have been dumped.

        We are going to present a patch set to support offsets for monotonic and boottime clocks. There are a few points to discuss:

        • Clone flags exhaustion. Currently there is only one unused clone flag bit left, and it may be worth to use it to extend arguments of the clone system call.
        • Realtime clock implementation details. Is having a simple offset enough? What to do when date and time is changed on the host? Is there a need to adjust vfs modification and creation times?

        Any other features of time namespaces can be also discussed there.

        Git: https://github.com/0x7f454c46/linux/tree/wip/time-ns

        Wiki: https://criu.org/Time_namespace

        Speakers: Andrei Vagin, Dmitry Safonov (Arista Networks)
      • 11:35
        Improving *at(2) to make more secure container runtimes 15m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        Currently, container runtimes are faced with a large attack surface when it comes to a malicious container guest. This most obvious attack surface is the filesystem, and the wide variety of filesystem races and other such tricks that can be used to trick a container runtime into accessing files they shouldn't. To tackle this, most container runtimes have come up with necessary userspace hacks to work around these issues -- but many of the improvements are inherently flawed as they are not done from kernel-space.

        In this session, a discussion of the various kernel APIs that could benefit container runtime security will be opened. Topics on the agenda would be the use of AT_EMPTY_PATH with openat(2), whether there are any more blockers for the AT_NO_JUMPS patchset, and a proposal of AT_THIS_ROOT which would allow for much more secure interaction with
        container filesystems.

        Speaker: Christian Brauner (Canonical)
      • 11:55
        Another year with CRIU: News from the developers 30m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        Linux Plumbers Conference being the place where most CRIU developers and users regularly meet and exchange news and ideas, traditionally had an overview talk about what has happened in and around CRIU since the previous Linux Plumbers Conference.

        As the checkpoint and restore micro conference is now part of the containers micro conference we still want to keep this 'tradition' as it gives us and all other attendees a good overview what has changed in CRIU and how those changes are part of projects using CRIU or to motivate other projects to make use of the newest CRIU features.

        In this talk I will give an overview of what has changed since the last Linux Plumbers Conference in CRIU. Give details about certain changes and why we (the CRIU developers) think they are important and how they are (or could be) used in projects making use of CRIU.

        In addition to changes in the past we want to here from the community what changes they would like to see in CRIU to better support their projects using CRIU.

        Speaker: Adrian Reber (Red Hat)
      • 14:00
        pivot_root() & MS_SHARED 30m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        The pivot_root() operation is an essential step in virtualizing a
        container's root directory. Current pivot_root() semantics require that a mountpoint is not a shared mountpoint. If it is, the pivot_root() operation will not be allowed. However, some containers need to have a virtualized root directory while at the same time have the root directory be a shared mountpoint. This is necessary when mounts between the host and the container are supposed to propagate in order to have a
        straightforward mechanism to share mount information. In this talk we will explain the original reason for blocking pivot_root() on shared mountpoints and start a discussion centered around a patchset that is a necessary precondition to safely enable pivot_root() on shared mountpoints.

        Speakers: Christian Brauner (Canonical), Ram Pai (IBM)
      • 14:35
        Task Migration at Google Using CRIU 15m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        This talk focuses on our use of CRIU for transparent checkpoint/restore task migrations within Google's shared compute infrastructure. This project began as a means to simplify user applications and increase utilization in our clusters. We've now productionized a sizable deployment of our CRIU-based task migration infrastructure. We'll present our experiences using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.

        Speaker: Andy Tucker (Google)
      • 14:55
        libresource - Getting system resource information with standard APIs 15m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        System resource information, like memory, network and device statistics, are crucial for system administrators to understand the inner workings of their systems, and are increasingly being used by applications to fine tune performance in different environments.

        Getting system resource information on Linux is not a straightforward affair. The best way is to collect the information from procfs or sysfs, but getting such information from procfs or sysfs presents many challenges. Each time an application wants to get a system resource information, it has to open a file, read the content and then parse the content to get actual information. If application is running in a container then even reading from procfs directly may give information about resources on whole system instead of just the container. Libresource tries to fix few of these problems by providing a standard library with set of APIs through which we can get system resource information e.g. memory, CPU, stat, networking, security related information.

        Libresource provides/may provide following benefits:

        • Virtualization: In cases where application is running in a virtualized environment using cgroup or namespaces, reading from /proc and /sys file-systems might not give correct information as these are not cgroup aware. Library API will take care of this e.g. if a process is running in a cgroup then library should provide information which is local to that cgroup.
        • Ease of use: Currently applications needs to read this info mostly from /proc and /sys file-systems. In most of the cases complex string parsing is involved which is needed to be done in application code. With the library APIs application can get the information directly and all the string parsing, if any, will be done by library.
        • Stability: If the format in which the information is provided in /proc or /sys file-system is changed then the application code is changed to align with those changes. Also if a better way to get information comes in future, like through a syscall or a sysconf then again application code needs to be changed to get the benefit of it. Library will take care of such changes and the application will never have to change the code.
        Speaker: Rahul Yadav (Oracle)
      • 15:15
        Securely Migrating Untrusted Workloads with CRIU 15m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        While deploying a CRIU-based transparent checkpoint/restore task migration infrastructure at Google, one of the toughest challenges we faced was security. The infrastructure views the applications it runs as inherently untrusted, yet CRIU requires expansive privileges at times in order to successfully checkpoint and restore workloads. We found many cases where malignant workloads could trick CRIU into elevating their privileges during checkpoint/restore. We present our experience in securely checkpointing and restoring untrusted workloads with minimal Linux privileges while enabling the bulk of CRIU functionality. We'll discuss changes required to enable this usecase and make the case for an increased emphasis on security in checkpoint/restore.

        Speaker: Radoslaw Burny (Google)
      • 15:30
        Break 30m
      • 16:00
        State of shiftfs 20m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100
        Speaker: Seth Forshee (Canonical)
      • 16:25
        Using a cBPF Binary Tree in Libseccomp to Improve Performance 15m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        Several in-house Oracle customers have identified that their large seccomp filters are slowing down their applications. Their filters largely consist of simple allow/deny logic for many syscalls (306 in one case) and for the most part don't utilize argument filtering.

        Currently libseccomp generates an if-equal statement for each syscall in the filter. Its pseudocode looks roughly like this:

         if (sycall == read)
             return allow
         if (syscall == write)
             return allow
         ...
         # 300 or so if statements later
         if (syscall == foo)
             return allow
         return deny
        

        This is very slow for syscalls that happen to be at the end of the filter. Libseccomp currently allows prioritizing the syscalls to place the most frequent ones at the top of the filter, but this isn't always possible - especially in a cloud situation.

        I currently have a working prototype and the numbers are very promising.
        In this prototype, I timed calling getppid() in a loop using a filter similar to one of my customer's seccomp filters. I ran this loop one million times and recorded the min, max, and mean times (in TSC ticks) to call getppid().
        (I didn't disable interrupts, so the max time was often large.) I chose to report the minimum time because I feel it best represents the actual time to traverse the syscall.

        The code for the libseccomp RFE is available here:
        https://github.com/seccomp/libseccomp/issues/116

        Test Case minimum TSC ticks to make syscall

        seccomp disabled 138
        getppid() at the front of 306-syscall seccomp filter 256
        getppid() in middle of 306-syscall seccomp filter 516
        getppid() at the end of the 306-syscall filter 1942
        getppid() in a binary tree 312

        Speaker: Tom Hromatka (Oracle)
      • 16:45
        Uevent in namespaces 30m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100

        On non-embedded systems device management in Linux is a task split between kernelspace and userspace. Since the implementation of the devtmpfs pseudo filesystem the kernel is solely responsible for creating device nodes while udev in userspace is mainly responsible for consistent device naming and permissions. The devtmpfs filesystem however is not namespace aware. As such devices always belong to the
        initial user namespace. In times of SR-IOV enabled devices it is possible and needed to hand off devices to non-initial user namespaces.
        The last couple of months I’ve been working on enabling userspace to be able to target device events to specific user namespaces. With recent patchsets of mine we have now reached that goal. As such userspace can now tie devices to a specific user namespace. This talk aims to explain the concept of namespace aware
        device management and to explain the patchsets that were needed to make device management namespace aware and possible future improvements.

        Speaker: Christian Brauner
      • 17:20
        Container IDs 10m Junior/Ballroom-AB

        Junior/Ballroom-AB

        Sheraton Vancouver Wall Center

        100
        Speaker: Stéphane Graber (Canonical Ltd.)
    • 09:00 12:30
      Kernel Summit Track Junior/Ballroom-D (Sheraton Vancouver Wall Center)

      Junior/Ballroom-D

      Sheraton Vancouver Wall Center

      67
      • 09:00
        Linux's Code of Conduct Panel 45m
        Speakers: Chris Mason (Facebook), Greg Kroah-Hartman (Linux Foundation), Mishi Choudhary (Linux Foundation), Olof Johansson (Facebook)
      • 09:45
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

      • 10:30
        Break 30m
      • 11:00
        Virtio as a universal communication format 45m
        Speaker: Michael S. Tsirkin (Red Hat)
      • 11:45
        GCMA: Guaranteed Contiguous Memory Allocator 45m
        Speaker: SeongJae Park
    • 09:00 12:30
      LPC Main Track Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 09:00
        Improving Graphics Interactivity - It's All in the Timing 45m

        Interactive applications, which includes everything from real time
        games through flight simulators and virtual reality environments,
        place strong real-time requirements on the whole computing environment
        to ensure that the correct data are presented to the user at the
        correct time. This requires two things; the first is that the time
        when the information will be displayed be known to the application so
        that the correct contents can be computed, and second that the frame
        actually be displayed at that time.

        These two pieces of information are managed inconsistently through the
        graphics stack, making it difficult for applications to provide a
        smooth animation experience to users. And because of the many APIs
        which lie between the application rendering using OpenGL or Vulkan and
        the underlying hardware, a failure to correctly handle things at
        any point along the chain will result in stuttering results.

        Fixing this requires changes throughout the system, from making the
        kernel provide better control and information about the queuing and
        presentation of images through changes in composited window systems to
        ensure that images are displayed at the target time, and that
        the actual time of presentation is reported back to applications and
        finally additions to rendering APIs like Vulkan to expose control
        over image presentation times and feedback about when images ended up
        being visible to the user.

        This presentation will first demonstrate the effects of poor display
        timing support inherent in the current graphics stack, talk about the
        different solutions required at each level of the system and finally
        show the working system.

        Speaker: Keith Packard (Hewlett Packard Enterprise)
      • 09:45
        The end of time, 19 years to go 45m

        Software that uses a 32-bit integer to represent seconds since the Unix epoch of Jan 1 1970 is affected by that variable overflowing on Jan 19 2038, often in a catastrophic way. Aside from most 32-bit binaries that use timestamps, this includes file systems (e.g. ext3 or xfs), file formats (e.g. cpio, utmp, core dumps), network protocols (e.g. nfs) and even hardware (e.g. real-time clocks or SCSI adapters).

        Work has been going on to avoid that overflow in the Linux kernel, with hundreds of patches reworking drivers, file systems and the user space interfaces including over 50 affected system calls.

        With much of this activity getting done during 2018, it's time to give an update on what has been achieved in the kernel, what parts still remain to be solved, and how we will proceed to solve this in user space, and how to use the work in long-living product deployments.

        Speaker: Mr Arnd Bergmann (Linaro)
      • 10:30
        Break 30m
      • 11:00
        Mind the gap - between real-time Linux and real-time theory 45m

        It is common to see Linux being used on real-time research projects. However, the assumptions made in papers are very often unrealistic. In contrast, researchers argue that the main metric used on PREEMPT RT, although useful, is an oversimplification of the problem.

        It is a consensus that the academic research helps to improve Linux’s state-of-art, and vice-versa. So how can we reduce the gap between these task forces? The real-time researchers start papers with a clear definition of the task model. But we do not have a task model for Linux: this is where the gap is.

        This talk presents effort on establishing the task model for the PREEMPT RT Linux. Starting with the description of the operations that influence the timing behavior of tasks, passing by the definition of the relationships of the operations. Finally, the outcomes for Linux, like new metrics for the PREEMPT RT and a model validator (a lockdep like verificator, but for preemption) for the kernel, are discussed.

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 11:45
        SCHED_DEADLINE desiderata and slightly crazy ideas 45m

        The SCHED_DEADLINE scheduling policy is all but done. Even though it existed in mainline for several years, many features are yet to be implemented; some are already available as immature code, some others only exist as wishes.

        In this talk Juri Lelli and Daniel Bristot De Oliveira will give the audience in-depth details of what’s missing, what’s under development and what might be desirable to have. The intent is to provide as much information as possible to people attending, so that a fruitful discussion might be held later on during hallway and micro conference sessions.

        Examples of what is going to be presented are:

        • Non-root usage
        • CGroup support
        • Re-working RT Throttling to use DL servers
        • Better Priority Inheritance (AKA proxy execution)
        • Schedulability improvements
        • Better support for tracing
        Speakers: Juri Lelli (Red Hat, Inc.), Daniel Bristot de Oliveira (Red Hat, Inc.)
    • 09:00 18:00
      Networking Track Junior/Ballroom-C (Sheraton Vancouver Wall Center)

      Junior/Ballroom-C

      Sheraton Vancouver Wall Center

      67

      A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.

      Official Networking Track website: http://vger.kernel.org/lpc-networking2018.html

      • 09:00
        Welcome 20m

        Openning welcome, announcements, etc.

      • 09:20
        XDP - Challenges and Future Work 35m

        XDP already offers rich facilities for high performance packet
        processing, and has seen deployment in several production systems.
        However, this does not mean that XDP is a finished system; on the
        contrary, improvements are being added in every release of Linux, and
        rough edges are constantly being filed down. The purpose of this talk is
        to discuss some of these possibilities for future improvements,
        including how to address some of the known limitations of the system. We
        are especially interested in soliciting feedback and ideas from the
        community on the best way forward.

        The issues we are planning to discuss include, but are not limited to:

        • User experience and debugging tools: How do we make it easier for
          people who are not familiar with the kernel or XDP to get to grips
          with the system and be productive when writing XDP programs?

        • Driver support: How do we get to full support for XDP in all drivers?
          Is this even a goal we should be striving for?

        • Performance: At high packet rates, every micro-optimisation counts.
          Things like inlining function calls in drivers are important, but also
          batching to amortise fixed costs such as DMA mapping. What are the
          known bottlenecks, and how do we address them?

        • QoS and rate transitions: How should we do QoS in XDP? In particular,
          rate transitions (where a faster link feeds into a slower) are
          currently hard to deal with from XDP, and would benefit from, e.g.,
          Active Queue Management (AQM). Can we adapt some of the AQM and QoS
          facilities in the regular networking stack to work with XDP? Or should
          we do something different?

        • Accelerating other parts of the stack: Tom Herbert started the
          discussion on accelerating transport protocols with XDP back in 2016.
          How do we make progress on this? Or should we be doing something
          different? Are there other areas where we can extend XDPs processing
          model to provide useful accelerations?

        Speakers: Jesper Dangaard Brouer (Red Hat), Toke Høiland-Jørgensen (Karlstad University)
      • 09:55
        Leveraging Kernel Tables with XDP 35m

        XDP is a framework for running BPF programs in the NIC driver to allow
        decisions about the fate of a received packet at the earliest point in
        the Linux networking stack. For the most part the BPF programs rely on
        maps to drive packet decisions, maps that are managed for example by a
        userspace agent. This architecture has implications on how the system is
        configured, monitored and debugged.

        An alternative approach is to make the kernel networking tables
        accessible by BPF programs. This approach allows the use of standard
        Linux APIs and tools to manage networking configuration and state while
        still achieving the higher performance provided by XDP. An example of
        providing access to kernel tables is the recently added helper to allow
        IPv4 and IPv6 FIB (and nexthop) lookups in XDP programs. Routing suites
        such as FRR manage the FIB tables, and the XDP packet path benefits by
        automatically adapting to the FIB updates in real time. While a huge
        first step, a FIB lookup alone is not sufficient for general networking
        deployments.

        This talk discusses the advantages of making kernel tables available to
        XDP programs to create a programmable packet pipeline, what features
        have been implemented as of October 2018, key missing features, and
        current challenges.

        Speaker: David Ahern (Cumulus Networks)
      • 10:30
        Break 30m
      • 11:00
        Building Socket-aware BPF Programs 35m

        Over the past several years, BPF has steadily become more powerful in multiple
        ways: Through building more intelligence into the verifier which allows more
        complex programs to be loaded, and through extension of the API such as by
        adding new map types and new native BPF function calls. While BPF has its roots
        in applying filters at the socket layer, the ability to introspect the sockets
        relating to traffic being filtered has been limited.

        To build such awareness into a BPF helper, the verifier needs the ability to
        track the safety of the calls, including appropriate reference counting upon
        the underlying socket. This talk walks through extensions to the verifier to
        perform tracking of references in a BPF program. This allows BPF developers to
        extend the UAPI with functions that allocate and release resources within the
        execution lifetime of a BPF program, and the verifier will validate that the
        resources are released exactly once prior to program completion.

        Using this new reference tracking ability in the verifier, we add socket lookup
        and release function calls to the BPF API, allowing BPF programs to safely find
        a socket and build logic upon the presence or attributes of a socket. This can
        be used to load-balance traffic based on the presence of a listening
        application, or to implement stateful firewalling primitives to understand
        whether traffic for this connection has been seen before. With this new
        functionality, BPF programs can integrate more closely with the networking
        stack's understanding of the traffic transiting the kernel.

        Speaker: Joe Stringer (Cilium)
      • 11:35
        Experiences Evaluating DC-TCP 35m

        In this talk we describe our experiences in evaluating DC-TCP. Preliminary testing with Netesto uncovered issues with our NIC that affected fairness between flows, as well as bugs in the DC-TCP code path in Linux that resulted in RPC tail latencies of up to 200ms. Once we fixed those issues, we proceeded to test in a 6 rack mini cluster running some of our production applications. This testing demonstrated very large decreases in packet discards (12 to 1000x) at a cost of larger CPU utilization. In addition to describing the issues and fixes, we provide detailed experimental results and explore the causes of the larger CPU utilization as well as discuss partial solutions to this issue.

        Note: We plan to test on a much larger cluster and have those results available before the conference.

        Speakers: Lawrence Brakmo (Facebook), Boris Burkov (Facebook), Greg Leclercq (Facebook), Murat Mugan (Facebook)
      • 12:10
        Scaling Linux Bridge Forwarding Database 35m

        Linux bridge is deployed on Hosts, Hypervisors, Container OS's and in most recent years on data center switches. It is complete in its feature set with forwarding, learning, proxy and snooping functions. It can bridge Layer-2 domains between VM's, Containers, Racks, POD's and between data centers as seen with Ethernet-Virtual Private networks [1, 2]. With Linux bridge deployments moving up the rack, it is now bridging Larger Layer-2 domains bringing in scale challenges. The bridge forwarding database can scale to thousands of entries on a data center switch with hardware acceleration support.

        In this paper we discuss performance and operational challenges with large scale bridge fdb database and solutions to address them. We will discuss solutions like fdb dst port failover for faster convergence, faster API for fdb updates from control plane and reducing number of fdb dst ports with Light weight tunnel endpoints for bridging over a tunneling solution (eg vxlan).

        Most solutions though discussed around the below deployment scenarios are generic and can be applied to all bridge use-cases:

        • Multi-chassis link aggregation scenarios where Linux bridge is part of the active-active switch redundancy solution
        • Ethernet VPN solutions where Linux bridge forwarding database is extended to reach Layer-2 domains over a network overlay like VxLAN

        [1] https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-11
        [2] https://www.netdevconf.org/2.2/slides/prabhu-linuxbridge-tutorial.pdf

        Speakers: Roopa Prabhu (Cumulus Networks), Nikolay Aleksandrov (Cumulus Networks)
      • 14:00
        P4C-XDP: Programming the Linux Kernel Forwarding Plane Using P4 35m

        The eXpress Data Path (XDP) is a new kernel-feature, intended to provide
        fast packet processing as close as possible to device hardware. XDP
        builds on top of the extended Berkely Packet Filter (eBPF) and allows
        users to write a C-like packet processing program, which can be attached
        to the device driver’s receiving queue. When the device observes an
        incoming packet, the user-defined XDP program is triggered to execute on
        the packet payload, making the decision as early as possible before
        handing the packet down the processing pipeline.

        P4 is a domain-specific language describing how packets are processed by
        the data plane of a programmable network elements, including network
        interface cards, appliances, and virtual switches. It provides an
        abstraction that allows programmers to express existing and future
        protocol format without coupling it to any data plane specific
        knowledge. The language is explicitly designed to be protocol-agnostic.
        A P4 programmer can write their own protocols and load the P4 program
        into P4-capable network elements.
        As high-level networking language, P4 supports a diverse set of compiler
        backends and also possesses the capability to express eBPF and XDP programs.

        We present P4C-XDP, a new backend for the P4 compiler. P4C-XDP leverages
        XDP to aim for a high performance software data plane. The backend
        generates a eBPF-compliant C representation from a given P4 program
        which is passed to clang and llvm to produce the bytecode. Using
        conventional eBPF kernel hooks the program can then be loaded into the
        eBPF virtual machine in the device driver. The kernel verifier
        guarantees the safety of the generated code. Any packets
        received/transmitted from/to this device driver now trigger the
        execution of the loaded P4 program.

        The P4C-XDP project is an open source project hosted at
        https://github.com/vmware/p4c-xdp/. We provide prove-of-concept sample
        code under the tests directory, which contains a couple of examples such
        as basic protocol parsing, checksum recalculation, multiple tables
        lookups, and tunnel protocol en-/decapsulation.

        Speakers: Fabian Ruffy (University of British Columbia), Mihai Budiu (VMware), William Tu (VMware)
      • 14:35
        ERSPAN Support for Linux 35m

        Port mirroring is one of the most common network troubleshooting
        techniques. SPAN (Switch Port Analyzer) allows a user to send a copy
        of the monitored traffic to a local or remote device using a sniffer
        or packet analyzer. RSPAN is similar, but sends and received traffic
        on a VLAN. ERSPAN extends the port mirroring capability from Layer 2
        to Layer 3, allowing the mirrored traffic to be encapsulated in an
        extension of the GRE (Generic Routing Encapsulation) protocol and sent
        through an IP network. In addition, ERSPAN carries configurable
        metadatas (e.g., session ID, timestamps), so that the packet analyzer
        has better understanding of the packets.

        ERSPAN for IPv4 was added into Linux kernel in 4.14, and for IPv6 in
        4.16. The implementation includes both transmission and reception and
        is based on the existing ip_gre and ip6_gre kernel module. As a
        result, Linux today can act as an ERSPAN traffic source sending the
        ERSPAN mirrored traffic to the remote host, or an ERSPAN destination
        which receives and parses the ERSPAN packets generated from Cisco or
        other ERSPAN-capable switches.

        We’ve added both the native tunnel support and metadata-mode tunnel
        support. In this paper, we demonstrate three ways to use the ERSPAN
        protocol. First, for Linux users, using iproute2 to create native
        tunnel net device. Traffic sent to the net device will be
        encapsulated with the protocol header accordingly and traffic matching
        the protocol configuration will be received from the net device.
        Second, for eBPF users, using iproute2 to create metadata-mode ERSPAN
        tunnel. With eBPF TC hook and eBPF tunnel helper functions, users can
        read/write ERSPAN protocol’s fields in finer granularity. Finally,
        for Open vSwitch users, using the netlink interface to create a switch
        and programmatically parse, lookup, and forward the ERSPAN packets
        based on flows installed from the userspace.

        Speakers: William Tu (VMware), Greg Rose (VMware)
      • 15:10
        The Path to DPDK Speeds for AF_XDP 35m

        AF_XDP is a new socket type for raw frames to be introduced in 4.18
        (in linux-next at the time of writing). The current code base offers
        throughput numbers north of 20 Mpps per application core for 64-byte
        packets on our system, however there are a lot of optimizations that
        could be performed in order to increase this even further. The focus
        of this paper is the performance optimizations we need to make in
        AF_XDP to get it to perform as fast as DPDK.

        We present optimization that fall into two broad categories: ones that
        are seamless to the application and ones that requires additions to
        the uapi. In the first category we examine the following:

        • Loosen the requirement for having an XDP program. If the user does
          not need an XDP program and there is only one AF_XDP socket bound to
          a particular queue, we do not need an XDP program. This should cut
          out quite a number of cycles from the RX path.

        • Wire up busy poll from user space. If the application writer is
          using epoll() and friends, this has the potential benefit of
          removing the coherency communication between the RX (NAPI) core and
          the application core as everything is now done on a single
          core. Should improve performance for a number of use cases. Maybe it
          is worth revisiting the old idea of threaded NAPI in this context
          too.

        • Optimize for high instruction cache usage through batching as has
          been explored in for example Cisco's VPP stack and Edward Cree in
          his net-next RFC "Handle multiple received packets at each stage".

        In the uapi extensions category we examine the following
        optimizations:

        • Support a new mode for NICs with in-order TX completions. In this
          mode, the completion queue would not be used. Instead the
          application would simply look at the pointer in the TX queue to see
          if a packet has been completed. In this mode, we do not need any
          backpreassure between the completion queue and the TX queue and we
          do not need to populate or publish anything in the completion queue
          as it is not used. Should improve the performance of TX for in-order
          NICs significantly.

        • Introduce the "type-writer" model where each chunk can contain
          multiple packets. This is the model that e.g., Chelsio has in its
          NICs. But experiments show that this mode also can provide better
          performance for regular NICs as there are fewer transactions on the
          queues. Requires a new flag to be introduced in the options field of
          the descriptor.

        With these optimization, we believe we can reach our goal of close to
        40 Mpps of throughput for 64-byte packets in zero-copy mode. Full
        analysis with performance numbers will be presented in the final
        paper.

        Speakers: Björn Töpel (Intel), Magnus Karlsson (Intel)
      • 15:45
        Break 20m
      • 16:05
        eBPF / XDP Based Firewall and Packet Filtering 35m

        iptables have been the typical tool to create firewall for linux hosts. We have used them at Facebook for setting up host firewalls on our servers across a variety of tiers. In this proposal, we introduce a eBPF / XDP based firewall solution which we use for packet filtering and has parity to our iptables implementation. We discuss various aspects of this. Following is a brief summary of these aspects, which we will detail further in the paper / presentation.

        • Design and Implementation:

          • We use BPF Tables (maps, lpm tries, and arrays) to match for appropriate packet header contents
          • The heart of a firewall is a eBPF filter which parses a packet and does lookups against all relevant maps collecting the matching values. A logical rule set is applied to these collected values. This logical set reads similar to a human-readable high level firewall policy. With iptable rules, amidst all the verbose matching criteria inline every rule, such a policy level representation is hard to infer.
        • Performance benefits and comparisons with iptables

          • iptables does packet matching linearly against each rule until a match is found. In our proposal, we use BPF Tables (maps) containing keys for all rules, making packet matching highly efficient. We then apply the policy using the collected results, which results in a considerable speedup over iptables.
        • Ease of policy / config updates and maintenance

          • The network administrator owns the firewall while the app developers typically require opening ports for their applications to work. With our approach of using a eBPF filter, we create a logical separation between the filter which enforces the policy and the contents of the associated maps which represent the specific ports and prefixes that need to be filtered. The policy is owned by the network administrator (Example: ports open to the internet, ports open from within specific prefixes, drop everything else). The data (port numbers, prefixes, etc) can now belong to a separate logical section which presents application developers a predetermined destination to update their data (Example: File containing port opened to internal subnets, etc). This reduces friction between the 2 different functions and reduces human errors.
        • Deployment experience:

          • We deploy this solution in our edge infrastructure to implement our firewall policy.
          • We update configuration, reload filters and contents of the various maps containing keys and values for filtering
        • BPF Program array

          • We use the power of BPF program array to chain different programs like rate limiter, firewall, load balancers, etc. These are building blocks to create a rich, high performant networking solution
        • Proposal for a completely generic firewall solution to migrate existing iptables rules to eBPF / XDP based filtering

          • We present a proposal which can translate existing iptables rules to a better performant eBPF program with mostly user space processing and validation.
        Speakers: Anant Deepak (Facebook), Puneet Mehra (Facebook), Richard Huang (Facebook)
      • 16:40
        XDP Acceleration Using NIC Metadata, Continued 35m

        This talk is a continuation of the initial XDP HW-based hints work presented at NetDev 2.1 in Seoul, South Korea.

        It will start with focus on showcasing new prototypes to allow an XDP program to request required HW-generated metadata hints from a NIC. The talk will show how the hints are generated by the NIC and what are the performance characteristics for various XDP applications. We also want to demonstrate how such a metadata can be helpful for applications that use AF_XDP sockets.

        The talk with then discuss planned upstreaming thoughts, and look to generate more discussion around implementation details, programming flows, etc., with the larger audience from the community.

        Speakers: P. J. Waskiewicz (Intel), Neerav Parikh (Intel)
      • 17:15
        Linux SCTP is Catching Up and Going Above! 35m

        SCTP is a transport protocol, like TCP and UDP, originating from SIGTRAN
        IETF Working Group in the early 2000's with the initial objective of
        supporting the transport of PSTN signalling over IP networks. It featured
        multi-homing and multi-stream from the beginning, and since then there
        have been a number of improvements that help it serve other purposes too,
        such as support for Partial Reliability and Stream Scheduling.

        Linux SCTP arrived late and was stuck. It wasn't as up to date as the
        released RFCs, while it was also far behind other systems such as BSD,
        and also suffered from performance problems. In the past 2 years, we
        were dedicated to ensuring that these features were addressed and
        focused on making many improvements. Now all the features from released
        RFCs have been fully supported in Linux, and some from draft RFCs are
        already ongoing. Besides, we've seen an obvious improvement in performance
        in various scenarios.

        In this talk we will first do a quick review on SCTP basics, including:

        • Background: Why SCTP is used for PSTN Signalling Transport, why other
          applications are using or will use SCTP.
        • Architecture: The general SCTP structures and procedures implemented in
          Linux kernel.
        • VS TCP/UDP: An overview of functions and applicability of SCTP, TCP and
          UDP.

        Then go through the improvements that were made in the past 2 years,
        including:

        • SCTP-related projects in Linux: Other than kernel part, there are also
          lksctp-tools, sctp-tests, tahi-sctp, etc.
        • Features implemented lately: RFC ones like Stream Scheduling, Message
          Interleaving, Stream Reconfig, Partially Reliable Policy, and many
          CMSGs, SndInfos, Socket APIs.
        • Improvements made recently: Big patchsets like SCTP Offload, Transport
          Hashtable, SCTP Diag and Full SELinux support.
        • VS BSD: We will take a look at the difference between Linux and BSD now
          regarding SCTP. You will be surprised to see that we've gone further
          than other systems.

        We will finish by reviewing a list of what is on our radar as well as next
        steps, like:

        • Ongoing features: SCTP NAT and SCTP CMT, two big important features are
          ongoing and already taking form, and more Performance Improvements in
          kernel have also been started.
        • Code refactor: New Congestion Framework will be introduced, which will
          be more flexible for SCTP to extend more Congestion Algorithms.
        • Hardware support: HW CRC Checksum and GSO will definitely make performance
          better, for which a new segment logic for both .segment and HW that works
          for SCTP chunks is needed.
        • RFC docs improvements: We believe that more extensions and revisions will
          make SCTP more widespread.

        For its powerfulness and complexity, SCTP is destined to face many challenges
        and threats, but we believe that we have already and will continue to make it
        better than that on other systems, but also than other transport protocols.
        Please join us, Linux SCTP needs your help too!

        Speakers: Marcelo RIcardo Leitner (Red Hat), Xin Long (Red Hat)
    • 09:00 12:30
      Testing & Fuzzing MC Pavillion/Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-C

      Sheraton Vancouver Wall Center

      58

      The Linux Plumbers 2018 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.

      Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.

      Plans for participants and talks will be similar to last year's (https://blog.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/641).

    • 09:00 12:40
      Thermal MC Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-D

      Sheraton Vancouver Wall Center

      77

      This proposal is to gather hackers interested in improving the thermal subsystem in the Linux Kernel and its interaction with hardware and userspace based policies. Nowadays, given the nature of the workloads and the wide spectrum of devices that Linux is used into, the peers interested in improving the thermal subsystem come from different backgrounds and with use cases of diverse thermal constrained systems, ranging from embedded devices to systems with high computing power. Despite the heterogeneity of software solutions to control thermals, the thermal subsystem is still core of many of them, including policies that rely on hardware configured thresholds, or interactions with firmware based control loops, or even policies that rely on userspace daemons. Therefore, this micro conference aims on gathering the thermal interested developers of the community to discuss further improvements of the Linux thermal subsystem.

      • 09:00
        Thermal user space interfaces: Optimizing user kernel interfaces 25m

        The current interfaces for thermal sysfs are really designed for mostly debug and not optimized to handle thermal events in user space. The current notification mechanism using netlink sockets is slow and adds additional computation and latency. The same things hold true for sysfs based reading of temperature, which needs constant polling to identify
        trend.

        In the past I proposed bridge to Linux IIO, which was agreed in principal but never merged to the main line Linux. Here I will propose one more solution for further discussion.

        Speaker: Srinivas Pandruvada (Intel)
      • 09:25
        Improvements on thermal zone mode 20m

        Discuss the following topics:

        1. thermal framework does not handle disabled thermal zones properly. For example, when a thermal zone is disabled, thermal framework may still poke the thermal sensor, plus, the per-thermal-zone polling timer never stops working for the disabled thermal zone.
        2. thermal framework does not support registering a disabled thermal zone (which would be enabled later) correctly.
        3. some thermal zone may need to switch between enabled/disabled frequently for some reason, e.g. device runtime suspend. In this case, any userspace tool that polls the temperature of the thermal zone may get error messages occasionally. We need to define the kernel thermal zone driver status and the userspace behavior for this case.
        Speaker: Rui Zhang (Intel)
      • 09:45
        Embedded Thermal usecases and how to handle them 25m

        Discussion around some mobile usecases that don't fit very well in the current framework and proposals for possible solutions.

        These include virtual sensors, heirachical thermal zones, multiple sensor per thermal zone support, extending governors to tackles tempertaure ranges.

        Speaker: Amit Kucheria
      • 10:10
        Scheduler interactions with thermal management 25m

        Thermal governors can respond to an overheat event for a cpu by capping the cpu's maximum possible frequency. This in turn
        means that the maximum available compute capacity of the cpu is restricted. But Linux scheduler is unaware maximum cpu capacity restrictions placed due to thermal
        activity. This session aims to discuss potential solution to thermal framework scheduler interactions.

        Speaker: Ms Thara Gopinath (Linaro)
      • 10:30
        Break 30m
      • 11:00
        Idle injection 25m

        A discussion around using idle injection as a means to do thermal management

        Speaker: Daniel Lezcano (Linaro)
      • 11:25
        Better support for virtual temperature sensors 25m

        A discussion on creating virtual temperature sensors that acts as aggregators for physical sensors, thereby allowing all the framework operations

        Speaker: Eduardo Valentin (Linux)
      • 11:50
        Audience topics and summary 40m

        Discuss miscellaneous topics from the audience and summarize the Thermal MC

        Speakers: Amit Kucheria, Eduardo Valentin (Linux)
    • 14:00 18:00
      Birds of a feather (BoF) Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-D

      Sheraton Vancouver Wall Center

      77
      • 14:00
        Building the kernel with clang 30m Pavillion/Ballroom-D

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77

        There is once more renewed interest in clang kernel builds, this has progressed to the point where some vendors (eg, some Android devices) are shipping builds in production, though not on x86. There's people looking at this from both the toolchain and kernel points of view on Arm and x86 and work on getting KernelCI able to provide good automated testing coverage.

        Plumbers seems like a great time to have a BoF to sync up about how to approach things, getting toolchain enginers and kernel engineers from different companies together.

        Key participants include Mark Brown, Arnd Bergmann, Todd Kijos,
        Nick Desaulniers, Behan Webster and Adhemerval Zanella.

        Speaker: Mark Brown
      • 14:30
        LLVM/Clang 30m Pavillion/Ballroom-D

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77

        Clang has become a viable C/C++ compiler -- it is used as the primary compiler in Android, OpenMandriva and various BSDs these days.
        Most parts of a modern Linux system can be built with Clang these days - but some core system components, including the kernel and some low level core libraries (most notably glibc) are exceptions to the rule.
        Let's explore what needs to be done to make the core system compatible with both modern toolchains (clang 7.x/master and gcc 8.x/master).

        Speaker: Mr Bernhard Rosenkränzer (Linaro)
      • 15:00
        Gen-Z on Linux 30m Pavillion/Ballroom-D

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77

        Come chat about getting Gen-Z supported for real in Linux.

        Gen-Z (https://genzconsortium.org/) is a new system interconnect that blends capabilities of DDR, PCI, USB, Infiniband and Ethernet. Come to this BOF to discuss how best to integrate Gen-Z into Linux.

        Speakers: Keith Packard (Hewlett Packard Enterprise), Betty Dall (HPE), Rocky Craig (Hewlett Packard Enterprise), Jim Hull (Hewlett Packard Enterprise)
      • 15:30
        break 30m Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77
      • 16:00
        Solving Linux File System Pain Points 30m Pavillion/Ballroom-D

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77

        Why are linux copy tools so problematic - not even calling the kernel copy APIs, and having terrible lack of features compared to other OS? Why is something as simple as copying a file painful for some network and cluster (and even local) file systems,and which tools (rsync, cp, gio copy etc) should we extend first to add the missing performance features that many fs need?

        How can we consistently expose metadata and file system information across groups of file systems that have common information? The integration of a new xstat/statx call last year was progress, but let's discuss whether it should be extended (to allow reporting of additional attributes e.g. for the cloud), and also the status of the proposed "file system info" new system call and the proposed new mount syscall.

        In addition, there are key security features (including fs-verity and RichACLs) that can be discussed.

        The Linux File System layer is one of the most active areas of the kernel, and changes in the VFS, especially for network and cluster file systems benefit from discussions like these.

        Speaker: Steven French
      • 16:30
        Linux Plumbers 2018: ZUFS - Zero Copy User-Mode FileSystem - One year Later 30m Pavillion/Ballroom-D

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77

        [Note: Moved to Kernel Summit Track.]

        Linux Plumbers 2018: ZUFS - Zero Copy User-Mode FileSystem - One year Later

        One year after its inception there are real hardened FSs. Real performance measurements. and many innovative fixtures. But is it ready for upstream?

        Few highlights:

        • ALL VFS api working including dax, mmap IOCTL xattrs ACLs ....
          (Missing quota)
        • IO API changed (From last year)
        • Support for ASYNC operations
        • Support for both pmem and regular block devices
        • Support for private memory pools
        • ZTs multy-channel and dynamic channel allocation
        • And many more ...

        In the talk I will give a short architectural and functional overview.
        Then will go over some of the leftover challenges.
        And finally hope to engage in an open discussion of how this project should move forward to be accepted into the Kernel, gain more users and FS implementations.

        Cheers
        Boaz

        Speaker: Mr Harrosh Boaz
      • 17:00
        seccomp and libseccomp performance improvements 30m Pavillion/Ballroom-D

        Pavillion/Ballroom-D

        Sheraton Vancouver Wall Center

        77

        This is probably a better fit as a CfP in either the containers or BPF microconferences.

        seccomp is a critical component to ensuring safe containerization of untrusted processes. But at Oracle we are observing that this security often comes with an expensive performance penalty. I would like to start a discussion of how can we can improve seccomp's performance without compromising security.

        Below is an open RFC I have in libseccomp that should significantly improve its performance when processing large filters. I would like to discuss other performance improvement possibilities - eBPF in general, eBPF hash map support, whitelists vs blacklists, etc. I would gladly take requests and ideas and try to incorporate them into libseccomp and seccomp as appropriate.

        https://github.com/seccomp/libseccomp/issues/116

        Several in-house Oracle customers have identified that their large seccomp filters are slowing down their applications. Their filters largely consist of simple allow/deny logic for many syscalls (306 in one case) and for the most part don't utilize argument filtering.

        After invaluable feedback from Paul Moore and Kees Cook, I have chosen to pursue a cBPF binary tree to improve performance for these customers. A cBPF binary tree requires no kernel changes and should be transparent for all libseccomp users.

        I currently have a working prototype and the numbers are very promising. I modified gen_bpf.c and gen_pfc.c to utilize a cBPF binary tree if there are 16 or more syscalls being filtered. I then timed calling getppid() in a loop using one of my customer's seccomp filters. I ran this loop one million times and recorded the min, max, and mean times (in TSC ticks) to call getppid(). (I didn't disable interrupts, so the max time was often large.) I chose to report the minimum time because I feel it best represents the actual time to traverse the syscall.

        Test Case                      minimum TSC ticks to make syscall
        ----------------------------------------------------------------
        seccomp disabled                                             138
        getppid() at the front of 306-syscall seccomp filter         256
        getppid() in middle of 306-syscall seccomp filter            516
        getppid() at the end of the 306-syscall filter              1942
        getppid() in a binary tree                                   312
        

        As shown in the table above, a binary tree can siginficantly improve syscall performance in the average and worst case scenario for these customers.

        Speaker: Tom Hromatka
    • 14:00 18:00
      Kernel Summit Track Junior/Ballroom-D (Sheraton Vancouver Wall Center)

      Junior/Ballroom-D

      Sheraton Vancouver Wall Center

      67
      • 14:00
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

      • 14:45
        When eBPF meets FUSE: Improving Performance of User File Systems 45m
        Speaker: Ashish Bijlani (Georgia Institute of Technology)
      • 15:30
        Break 30m
      • 16:00
        Building Stable Kernel Trees with Machine Learning 45m
        Speakers: Julia Lawall, Sasha Levin
      • 16:45
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

      • 17:30
        TAB Elections 30m
    • 14:00 17:30
      LPC Main Track Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 14:00
        Android and the kernel: herding billions of penguins, one version at a time 45m

        Historically, kernels that ran on Android devices have typically been 2+ years old compared to mainline (this year's flagship devices are shipping with 4.9 kernels) and because of the challenges associated with updating, most devices in the field are far behind the latest long-term stable (LTS) releases. The Android team has been gradually putting in place the necessary processes and enhancements to permanently bridge this gap. Much of the work on the Android kernel in 2018 focused on improving the ability to update the kernel -- at least to recent LTS levels. This work comprises a significant testing effort to ensure downstream partners that updating to new LTS levels is safe, as well as process work to convince partners that the security benefits of taking LTS patches far outweigh the risk of new bugs. The testing also focuses on ABI consistency (within LTS releases) for interfaces relied upon by userspace and kernel modules. This has resulted in enhancements to the LTP suite and a new proposal to the mailing list for "kernel namespaces".

        Additionally, the Android kernel testing benefits from additional tools developed by Google that are enabled via the Clang compiler. Google's devices have been shipping kernels built via Clang for 2 years. The Android team tests and assists in maintaining arm and arm64 kernel builds with clang.

        The talk will also cover some of the key features being developed for Android and introduce topics that will be discussed during the Android Micro-Conference.

        Speaker: Sandeep Patil (Google)
      • 14:45
        Heterogeneous Memory Management 45m

        Heterogeneous computing use massively parallel devices, such as GPU, to crunch through huge data-set. This talks intends to present the issues, challenges and problems related to memory management and heterogeneous computing. Issues and problems from one address space per device which makes exchanging or sharing data-set between devices and CPUs hard, complex and error prone.

        Solutions involve a unified address space between devices and CPU often call SVM (Share Virtual Memory) or SVA (Share Virtual Address). In those unified address space a virtual address valid on CPUs is also valid on the devices. Talk will address both hardware and software solutions to this problem. Moreover it will consider ways to preserve the ability to use the device memory in those scheme.

        Ultimately this talks is an opportunity to discuss memory placement, like for NUMA architecture, in a world where we not only have to worry about CPU but also about devices like GPU and their associated memory.

        If it were not enough, we now also have to worry about memory hierarchy for each CPU or device. Memory hierarchy going from fast High Bandwidth Memory (HBM) to main memory (DDR DIMM) which can be order of magnitude slower, and finally to persistent memory which is large in size but slower and with higher latency.

        Speaker: Jerome Glisse (Red Hat)
      • 15:30
        Break 30m
      • 16:00
        Documenting Linux MM for fun and for ... fun 45m

        It is well known that developers do not like writing documentation. But although documenting the code may seem dull and unrewarding, it has definite value for the writer.

        When you write the documentation you gain an insight into the algorithms, design (or lack of such), and implementation details. Sometimes you see neat code and say "Hey, that's genius!". But sometimes you discover small bugs or chunks of code that beg for refactoring. In any case, your understanding of the system significantly improves.

        I'd like to share the experience I had with Linux memory management documentation, what was it's state a few months ago, what have been done and where are we now.

        The work on the memory management documentation is in progress and the question "Where do we want to be?" is definitely a topic for discussion and debate.

      • 16:45
        Towards a Linux Kernel Maintainer Handbook 45m

        The first rule of kernel maintenance is that there are no hard and fast rules. While there are several documents and guidelines on patch contribution, advice on how to serve in a maintainer role has historically been tribal knowledge. This organically grown state of affairs is both a source strength and a source of friction. It has served the community well to be adaptable to the different personalities and technical problem spaces that inhabit the kernel community. However, that variability also leads to inconsistent experiences for contributors across subsystems, insufficient guidance for new maintainers, and excess stress on current maintainers. As the Linux kernel project expects to continue its rate of growth it needs to be able both scale the maintainers it has and ramp new ones without necessarily requiring them to make a decade's worth of mistakes to become proficient.

        The presentation makes the case for why a maintainer handbook is needed, including frequently committed mistakes and commonly encountered pain points. It broaches the "whys" and "hows" of contributors having significantly different experiences with the Linux Kernel project depending on what subsystem they are contributing. The talk is an opportunity to solicit maintainers in the audience on the guidelines they would reference on an ongoing basis, and it is an opportunity for contributors to voice wish list items when working with upstream. Finally, it will be a call to action to expand the document with subsystem-local rules of the road where those local rules differ, or go beyond the common recommendations.

        Speaker: Dan Williams (Intel Open Source Technology Center)
    • 14:00 18:25
      RT MC Pavillion/Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-C

      Sheraton Vancouver Wall Center

      58

      Since 2004 a project has been going on trying to make the Linux kernel into a true hard Real-Time operating system. This project has become know as PREEMPT_RT (formally the "real-time patch"). Over the past decade, there was a running joke that this year PREEMPT_RT would be merged into the mainline kernel, but that has never happened. In actuality, it has been merged in pieces. Examples of what came from PREEMPT_RT include: mutexes, high resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The only thing left is turning spin_locks into mutexes, and that is now mature enough to make its way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged!

      Getting PREEMPT_RT into the kernel was a battle, but it is not the end of the war. Once PREEMPT_RT is in mainline, there's still much more work to be done. The RT developers have been so focused on getting RT into mainline, that little has been thought about what to do when it is finally merged. There is a lot to discuss about what to do after RT is in mainline. The game is not over yet.

      • 14:00
        Real-Time Condition Variables: librtpi 20m

        POSIX condition variables (condvars) provide a commonly used interprocess communication mechanism. Threads can queue up and wait for an event before continuing. The glibc implementation of condvars in 2009 was not suitable for use in real-time systems due to a potential priority inversion. A fix has been available and used in many real-time systems since that time. A recent change to glibc to address a POSIX compliance issue with condvars broke that fix and modified the implementation in such a way as to prevent real-time usage of glibc condvars by introducing new type of priority inversion.

        The real-time use case places constraints on condvar usage patterns, such as requiring a PI mutex to be associated with the condvar prior to a signal or broadcast. Most importantly, the implementation must always wake the waiters in priority FIFO order. To address this usage, the librtpi project provides a narrow real-time specific implementation of the condition variable mechanism in a way which can be readily used in lieu of the glibc implementation.

        We will discuss the motivation and the current state of the project, as well as the long term strategy for a stable real-time condvar implementation.

        Speakers: Darren Hart, Julia Cartwright
      • 14:20
        BPF is not a -rt debuging tool 25m

        Xenomai offers a wonderful debugging feature: whenever a realtime thread calls an non-rt safe syscall, SIGXCPU is delivered. That is particular helpful for users which build their application on top of libraries. Often it is not clear what the side effects of a library call is.

        What options are there to implement something similiar to SIGXCPU? A simple prototype using BPF showed the limits of BPF (similiar experience as Joel descripes in "BPFd: Running BCC tools remotely across systems and architectures" https://lwn.net/Articles/744522/).

        A short brainstorming session with Steven, Julia and Clark showed that there are a few options to achieve the goal. In this session I would like to discuss the options (or even show what has been achieved so far).

        Key participants:
        - Steven Rostedt
        - Julia Cartwright
        - Clark Williams
        - Sebastian Andrzej Siewior
        - Thomas Gleixner

      • 14:45
        Data analysis of Jitter and more 25m

        Julia has worked with an Extreme-Value-Analysis tool that can analyze a lot of data. Collecting various output runs of jitterdebug, which collects all jitter data (outliers and all), could it be useful to analyse the data that it produces?

      • 15:10
        Beyond the latency: New metrics for the real-time kernel 20m

        The PREEMPT_RT's current metric, the latency is good. It helped to guide the
        development of the preempt_rt for more than a decade. However, in real-time
        analysis, the principal metric for the system the response time of tasks.
        Generally, in addition to the latency, the response time of a task comprises the
        task's execution time, the blocking time on locks, the overhead associated with
        scheduling and so on. Although we can think on ways to measure such values on
        Linux, we 1) don't have a single/standardized way to do this, and 2) we don't do
        regression tests to see if things changed from one version to another.

        This talk will discuss these points, collecting ideas on how to proceed toward
        the development of new metrics and ways to use them to test the kernel-rt.

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 15:30
        Break 30m
      • 16:00
        SCHED_DEADLINE what's next 25m

        Lots of research has been made to various issues like CPU affinity, Bandwidth inheritance, cgroup support; but nothing has been done to the kernel. Let's make a commitment and push these ideas forward into mainline.

      • 16:25
        RT priority inside usernamespace 25m

        Setting RT Priority inside usernamespace is not allowed even for mapped root uid. The use case is to be able to run RT priority processes inside usernamespace. Should there be a way to allow this, subject to cgroup RT limits, if a cgroup is dedicated to the usernamespace?

      • 16:50
        What to do after PREEMPT_RT is accepted into mainline 20m

        Is there need to change anything how we maintain the stable-rt trees? Or should we focus all effort on supporting the mainline tree?

        Potential attendees:
        - Steven Rostedt
        - Sebastian Andrzej Siewior
        - Thomas Gleixner
        - Tom Zanussi
        - Julia Cartwright

      • 17:10
        How can we catch problems that can break the PREEMPT_RT preemption model? 20m

        The fully preemptive preemption model, along with real-time mutexes, are the main features of the PREEMPT RT.

        How do we check if we are respecting all the rules for these features, e.g., how do we check if changes in the kernel are not breaking the preemption or the locking model?

        For locking, we already have an answer: Lockdep!

        But how about the preemption model?

        The presenter has a preliminary formalization of the preemption model, and he wants to discuss how to implement the validator of the model. Should it be in kernel or user-space? Tracing or a "validator" like lockdep?

    • 18:00 21:00
      Welcome Party 3h
    • 09:00 12:30
      Device Tree MC Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-D

      Sheraton Vancouver Wall Center

      77

      Topics
      Binding and Devicetree Source/DTB Validation: update and next steps
      Binding specification format
      Validation Process and Process
      How to validate overlays
      Devicetree Specification: update and next steps

      Reducing devicetree memory and storage size

      Overlays

      Bootloader and Linux kernel implementation update
      Remaining blockers and issues
      Use cases
      Devicetree compiler (dtc)

      Next version of DTB/FDT format
      Motivated by desire to replace metadata being encoded as normal data (metadata for overlays)
      Other desired changes should be considered
      Boot and Run-time Configuration
      Pain points and needs
      Multi-bus devices

      Feedback from the trenches

      how DTOs are used in embedded devices in practice
      in U-Boot and Linux
      in systems with FPGAs
      Use of devicetrees in small code/data space (e.g. U-Boot SPL)

      Connector node bindings

      FPGA issues

    • 09:00 12:30
      Kernel Summit Track Junior/Ballroom-D (Sheraton Vancouver Wall Center)

      Junior/Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 09:00 12:30
      LPC Main Track Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 09:00
        Exploring New Frontiers in Container Technology 45m

        Containers (or Operating System based Virtualization) are an old
        technology; however, the current excitement (and consequent
        investment) around containers provides interesting avenues for
        research on updating the way we build and manage container technology.
        The most active area of research today, thanks to concerns raised by
        groups supporting other types of virtualization, is in improving the
        security properties of containers.

        The first step in improving security is actually being able to measure
        it in the first place, so the initial goal of a research programme for
        container security involves finding that measure. In this talk I'll
        outline one such measure (attack profiles) developed by IBM research,
        the useful results that can be derived from it, the problems it has
        and the avenues that can be explored to refine future measurements of
        containment.

        Contrary to popular belief, a "container" doesn't describe one fixed
        thing, but instead is a collective noun for a group of isolation and
        resource control primitives (in Linux terminology called namespaces
        and cgroups) the composition of which can be independently varied. In
        the second half of this talk, we'll explore how containment can be
        improved by replacing some of the isolation primitives with local
        system call emulation sandboxes, a promising technique used by both
        the Google gVisor and the IBM Nabla secure container systems. We'll
        also explore the question of whether sandboxes are the end point of
        container security research or merely point the way to the next
        Frontier for container abstraction.

        Speaker: James Bottomley (IBM)
      • 09:45
        Open Source GPU compute stack - Not dancing the CUDA dance 45m

        Using graphics cards for compute acceleration has been a major shift in technology lately, especially around AI/ML and HPC.

        Until now the clear market leader has been the CUDA stack from NVIDIA, which is a closed source solution that runs on Linux. Open source applications like tensorflow (AI/ML) rely on this closed stack to utilise GPUs for acceleration.

        Vendor aligned stacks such as AMD's ROCm and Intel's OpenCL NEO are emerging that try to fill the gap for their specific hardware platforms. These stacks are very large, and don't share much if any code. There are also efforts inside groups like Khronos with their OpenCL, SPIR-V and SYCL standards being made to produce something that can work as a useful standardised alternative.

        This talk will discuss the possibility of creating a vendor neutral reference compute stack based around open source technologies and open source development models that could execute compute tasks across multiple vendor GPUs. Using SYCL/OpenCL/Vulkan and the open-source Mesa stack, as the basis for a future task that development of tools and features on top of as part of a desktop OS.

        This talk doesn't have all the answers, but it wants to get people considering what we can produce in the area.

        Speaker: David Airlie
      • 10:30
        Break 30m
      • 11:00
        Proactive Defense Against CPU Side Channel Attacks 45m

        Side channel attacks are here to stay. What can we do inside the operating system to proactively defend against them? This talk will walk through a few of the ideas that Intel’s Open Source Technology Center are developing to improve our resistance to side channel attacks as part of our new side channel defense project. We would also like to gather ideas from the rest of the community on what our top priorities for side channel defense for the Linux kernel should be.

        Speaker: Kristen Accardi
      • 11:45
        Untrusted Filesystems 45m

        Plugging in USB sticks, building VM images, and unprivileged containers all give rise to a situation where users are mounting and dealing with filesystem images they have not built themselves, and don't necessarily want to trust.

        This leads to the problem of how to mount and read/write those filesystems without opening yourself up to more risk than visiting a web page.

        I will survey what has been built already, describe what the technical challenges and describe the problems ahead.

        With this talk I hope to unite the various groups across the linux ecosystem that care about this problem and get the discussion started on how we can move forward.

        Speaker: Eric Biederman
    • 09:00 18:00
      Networking Track Junior/Ballroom-C (Sheraton Vancouver Wall Center)

      Junior/Ballroom-C

      Sheraton Vancouver Wall Center

      67

      A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.

      Official Networking Track website: http://vger.kernel.org/lpc-networking2018.html

      • 09:00
        Daily openning, announcements, etc. 20m
      • 09:55
        Combining kTLS and BPF for Introspection and Policy Enforcement 35m

        This talk is divided into two parts, first we present on kTLS, the current kernel's
        sockmap BPF architecture for L7 policy enforcement, as well as the kernel's ULP and
        strparser framework which is utilized by both in order to hook into socket callbacks
        and determine message boundaries for subsequent processing.

        We further elaborate on the challenges we face when trying to combine kTLS with the
        power of BPF for the eventual goal of allowing in-kernel introspection and policy
        enforcement of application data before encryption. Besides others, this includes a
        discussion on various approaches to address the shortcomings of the current ULP layer,
        optimizations for strparser, and the consolidation of scatter/gather processing for
        kTLS and sockmap as well as future work on top of that.

        Speakers: Daniel Borkmann (Cilium), John Fastabend (Cilium)
      • 10:30
        Break 30m
      • 11:00
        Optimizing UDP for Content Delivery with GSO, Pacing and Zerocopy 35m

        UDP is a popular foundation for new protocols. It is available across
        operating systems without superuser privileges and widely supported
        by middleboxes. Shipping protocols in userspace on top of
        a robust UDP stack allows for rapid deployment, experimentation
        and innovation of network protocols.

        But implementing protocols in userspace has limitations. The
        environment lacks access to features like high resolution timers
        and hardware offload. Transport cost can be high. Cycle count of
        transferring large payloads with UDP can be up to 3x that of TCP.

        In this talk we present recent and ongoing work, both by the authors
        and others, at improving UDP for content delivery.

        UDP Segmentation offload amortizes transmit stack traversal by
        sending as many as 64 segments as one large fused large packet.
        The kernel passes this through the stack as one datagram, then
        splits it into multiple packets and replicates their network and
        transport headers just before handing to the network device.

        Some devices can offload segmentation for exact multiples of
        segment size. We discuss how partial GSO support combines the
        best of software and hardware offload and evaluate the benefits of
        segmentation offload over standard UDP.

        With these large buffers, MSG_ZEROCOPY becomes effective at
        removing the cost of copying in sendmsg, often the largest
        single line item in these workloads. We extend this to UDP and
        evaluate it on top of GSO.

        Bursting too many segments at once can cause drops and retransmits.
        SO_TXTIME adds a release time interface which allows offloading of
        pacing to the kernel, where it is both more accurate and cheaper.
        We will look at this interface and how it is supported by queuing
        disciplines and hardware devices.

        Finally, we look at how these transmit savings can be extended to
        the forwarding and receive paths through the complement of GSO,
        GRO, and local delivery of fused packets.

        Speaker: Willem de Bruijn (Google)
      • 11:35
        Bringing the Power of eBPF to Open vSwitch 45m

        Among the various ways of using eBPF, OVS has been exploring the power
        of eBPF in three: (1) attaching eBPF to TC, (2) offloading a subset of
        processing to XDP, and (3) by-passing the kernel using AF_XDP.
        Unfortunately, as of today, none of the three approaches satisfies the
        requirements of OVS. In this presentation, we’d like to share the
        challenges we faced, experience learned, and seek for feedbacks from
        the community for future direction.

        Attaching eBPF to TC started first with the most aggressive goal: we
        planned to re-implement the entire features of OVS kernel datapath
        under net/openvswitch/* into eBPF code. We worked around a couple of
        limitations, for example, the lack of TLV support led us to redefine a
        binary kernel-user API using a fixed-length array; and without a
        dedicated way to execute a packet, we created a dedicated device for
        user to kernel packet transmission, with a different BPF program
        attached to handle packet execute logic. Currently, we are working on
        connection tracking. Although a simple eBPF map can achieve basic
        operations of conntrack table lookup and commit, how to handle NAT,
        (de)fragmentation, and ALG are still under discussion.

        Moving one layer below TC is called XDP (eXpress Data Path), a much
        faster layer for packet processing, but with almost no extra packet
        metadata and limited BPF helpers support. Depending on the complexity
        of flows, OVS can offload a subset of its flow processing to XDP when
        feasible. However, the fact that XDP has fewer helper function support
        implies that either 1) only very limited number of flows are eligible
        for offload, or 2) more flow processing logic needed to be done in
        native eBPF.

        AF_XDP represents another form of XDP, with a socket interface for
        control plane and a shared memory API for accessing packets from
        userspace applications. OVS today has another full-fledged datapath
        implementation in userspace, called dpif-netdev, used by DPDK
        community. By treating the AF_XDP as a fast packet-I/O channel, the
        OVS dpif-netdev can satisfy almost all existing features. We are
        working on building the prototype and evaluating its performance.

        RFC patch:
        OVS eBPF datapath.
        https://www.mail-archive.com/iovisor-dev@lists.iovisor.org/msg01105.html

        Speakers: William Tu (VMware), Joe Stringer (Isovalent), Yi-Hung Wei (VMware), Yifeng Sun (VMware)
      • 14:00
        What's Happened to the World of Networking Hardware Offloads? 35m

        Over the last 10 years the world has seen NICs go from single port,
        single netdev devices, to multi-port, hardware switching, CPU/NFP
        having, FPGA carrying, hundreds of attached netdev providing,
        behemoths. This presentation will begin with an overview of the
        current state of filtering and scheduling, and the evolution of the
        kernel and networking hardware interfaces. (HINT: it’s a bit of a
        jungle we’ve helped grow!) We’ll summarize the different kinds of
        networking products available from different vendors, and show the
        workflows of how a user can use the network hardware
        offloads/accelerations available and where there are still gaps. Of
        particular interest to us is how to have a useful, generic hardware
        offload supporting infrastructure (with seamless software fallback!)
        within the kernel, and we’ll explain the differences between deploying
        an eBPF program that can run in software, and one that can be
        offloaded by a programmable ASIC based NIC. We will discuss our
        analysis of the cost of an offload, and when it may not be a great
        idea to do so, as hardware offload is most useful when it achieves the
        desired speed and requires no special software (kernel changes). Some
        other topics we will touch on: the programmability exposed by smart
        NICs is more than that of a data plane packet processing engine and
        hence any packet processing programming language such as eBPF or P4
        will require certain extensions to take advantage of the device
        capabilities in a holistic way. We’ll provide a look into the future
        and how we think our customers will use the interfaces we want to
        provide both from our hardware, and from the kernel. We will also go
        over the matrix of most important parameters that are shaping our HW
        designs and why.

        Speakers: Jesse Brandeburg (Intel), Anjali Singhai Jain (Intel)
      • 14:35
        XDP 1.5 Years In Production. Evolution and Lessons Learned. 35m

        Today every packet which is reaching Facebook’s network is being processed by XDP enabled application. We have been using it for more then 1.5 years and this talk is about evolution of XDP and BPF which has been driven by our production needs. I’m going to talk about history of changes in core BPF components as well as will show why and how it was done. What performance improvements did we get (with synthetics and real world data) and how it was implemented. Also I’m going to talk about issues and shortcoming of BPF/XDP which we have found during our operations, as well as some gotchas and corner cases. In the end we are going to discuss on what is still missing and which part could be improved.

        Topics and areas of existing BPF/XDP infrastructure which are going to be covered in this talk:

        • why helpers such as bpf_adjust_head/bpf_adjust_tail has been added
        • unittesting and microbenchmarking with bpf_prog_test_run: how to add test coverage of you BPF program and track the regression (we are going to cover how spectre affected BPF kernel infrastructure and what tweaks has been made to get some performance back)
        • how map-in-map helps us to scale and make sure that we don't waste memory
        • NUMA aware allocation for BPF maps
        • inline lookups for BPF arrays/map-in-map

        Lessons which we have learned during operation of XDP:

        • BPF instruction counts vs complexity
        • How to attach more then one XDP program to the interface
        • when LLVM and verifier are not the same: some tricks to force LLVM to generate proper BPF
        • we will briefly discuss HW limitation: NIC's bandwidth vs packet per second performance

        Missing parts: what and why could be added:

        • the need for hardware checksumming offload
        • bounded loops: what they would allow us to do
        Speaker: Nikita V. Shirokov (Facebook)
      • 15:10
        Keynote: "This Talk Is Not About XDP: From Resource Limits to SKB Lists" 25m
        Speaker: David Miller (Red Hat Inc.)
      • 15:35
        Break 25m
      • 16:00
        TC SW Datapath: A Performance Analysis 35m

        Currently the Linux kernel implements two distinct datapaths for Open
        vSwitch: the ovskdp and the TC DP. The latter has been added recently
        mainly to allow HW offload, while the former is usually preferred for
        SW based forwarding due to functional and performance reasons.

        We evaluate both datapaths in a typical forwarding scenario - the PVP
        test - using the perf tool to identify bottlenecks in the TC SW dp.
        While similar steps usually incur in similar costs, the TC SW DP
        requires an additional, per packet, skb_clone, due to a TC actions
        constraint.

        We propose to extend the existing act infrastructure, leveraging the
        ACT_REDIRECT action and the bpf redirect code, to allow clone-free
        forwarding from the mirred action and then re-evaluate the datapaths
        performances: the gap is then almost already closed.

        Nevertheless, TC SW performance can be further improved by completing
        the RCU-ification of the TC actions and expanding the recent
        listification infrastructure to the TC (ingress) hook. We plan also to
        compare the TC/SW datapath with an custom eBPF program implementing the
        equivalent flow set to tag a reference value for the target
        performances.

        Speakers: Paolo Abeni (Red Hat), Davide Caratti (Red Hat), Eelco Chaudron (Red Hat), Marcelo Ricardo Leitner (Red Hat)
      • 16:35
        Using eBPF as an Abstraction for Switching 35m

        eBPF (extended Berkeley Packet Filter) has been shown to be a flexible
        kernel construct used for a variety of use cases, such as load balancing,
        intrusion detection systems (IDS), tracing and many others. One such
        emerging use case revolves around the proposal made by William Tu for
        the use of eBPF as a data path for Open vSwitch. However, there are
        broader switching use cases developing around the use of eBPF capable
        hardware. This talk is designed to explore the bottlenecks that exist in
        generalising the application of eBPF further to both container switching as
        well as physical switching.

        Topics that will be covered include proposals for container isolation through
        the use of features such as programmable RSS, the viability of physical
        switching using eBPF capable hardware as well as integrations with other
        subsystems or additional helper functions which may improve the possible
        functionality.

        Speaker: Nick Viljoen (Netronome)
      • 17:10
        BPF Host Network Resource Management 35m

        Linux currently provides mechanisms for managing and allocating many of the system resources such as CPU, Memory, etc. Network resource management is more complicated since networking deals not only with a local resource, such as CPU management does, but can also deal with a global resource. The goal is not only to provide a mechanism for allocating the local network resource (NIC bandwidth), but also to support management of network resources external to the host, such as link and switch bandwidths.

        For networking, the primary mechanism for allocating and managing bandwidth has been the traffic control (tc) subsystem. While tc allows for shaping of outgoing traffic and policing of incoming traffic, it suffers from some drawbacks. The first drawback is a history of performance issues when using the Hierarchical Queuing Discipline (htb) which is usually required for anything other than simple shaping needs. A second drawback is the lack of flexibility usually provided by general programming constructs.

        We are in the process of designing and implementing a BPF based framework for efficiently supporting shaping of both egress and ingress traffic based on both local and global network allocations.

        Speakers: Lawrence Brakmo (Facebook), Alexei Starovoitov (Facebook)
      • 17:45
        Closing 15m
    • 09:00 12:30
      Performance and Scalability MC Pavillion/Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-C

      Sheraton Vancouver Wall Center

      58

      Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.

      Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.

      We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.

      • 09:00
        Scheduler task accounting for cgroups 15m

        Cgroup accounting has significant overhead due to the need to constantly loop over all cpus to update statistics of cpu usages and blocked averages. We have seen that on 4 socket Haswell, database benchmarks like TPCC have 8% performance regression at the time of Haswell and 4.4 kernel when it is run under cgroup. On recent Cannon Lake platform using latest PCIE SSDs and 4.18 kernel, the regression in the scheduler has gotten worse to 12%. We will highlight the bottlenecks in the scheduler with detailed profiles of the hot path. We'll like to explore possible avenues to improve cgroup accounting.

        Speaker: Tim Chen
      • 09:15
        Seamlessly update hypervising kernel 15m

        Discuss two possible approaches to live update Linux that runs as a hypervisor without a noticeable effect on running Virtual Machines (VM). One method is to use cooperative multi-OSing paradigm to share the same machine between two kernels while the new kernel is booting, and the old kernel is still serving the running VM instances. Allow the new kernel to live migrate the drivers from the old kernel by using shadow class drivers, and later do the live migration of running VMs without copying their memory. The second method is to boot new kernel in a fully virtualized environment, that is the same as the underlying hardware, live migrate the VMs into the newly booted hypervisor, and make the hypervisor transition from the VM environment to bare metal.

        Speaker: Pavel Tatashin
      • 09:30
        Load balancing via scalable task stealing 30m

        Summary:
        In this talk I discuss scalability of load balancing algorithms in the task scheduler, and present my work on tracking overloaded CPUs with a bitmap, and using the bitmap to steal tasks when CPUs become idle.

        Abstract:
        The scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a challenge on both the push and pull sides on large systems. For pulls, the scheduler searches all CPUs in successively larger domains until an overloaded CPU is found, and pulls a task from the busiest group. This is very expensive, so search time is limited by the average idle time, and some domains are not searched. Balance is not always achieved, and idle CPUs go unused.

        I propose an alternate mechanism that is invoked after the existing search limits itself and finds nothing. I maintain a bitmap of overloaded CPUs, where a CPU sets its bit when its runnable CFS task count exceeds 1. The bitmap is sparse, with a limited number of significant bits per cacheline. This reduces cache contention when many threads concurrently set, clear, and visit elements. There is a bitmap per last-level cache. When a CPU becomes idle, it finds the first overloaded CPU in the bitmap and steals a task from it. For certain configurations and test cases, this optimization improves hackbench performance by 27%, OLTP by 9%, and tbench by 16%, with a minimal cost in search time. I present schedstat data showing the change in vital scheduler metrics before and after the optimization.

        For now the new stealing is confined to the LLC to avoid NUMA effects, but it could be extended to steal across nodes in the future. It could also be extended to the realtime scheduling class. Lastly, the sparse bitmap could be used to track idle cores and idle CPUs and used to optimize balancing on the push side.

        Speaker: Steven Sistare (Oracle)
      • 10:00
        Scheduler and pipe sleep wakeup scalability 30m

        1) Scalability of scheduler idle cpu and core search on systems with large number of cpus

        Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. These don't scale for large llc domains and will only get worse with more cores in future. Spending too much time in the scheduler will hurt performance of very context switch intensive workloads. A more scalable way to do the search is desirable which is not O(no. of cpus) or O(no. of cores) in worst case.

        2) Scalability of idle cpu stealing on systems with large number of cpus and domains

        When a cpu becomes idles it tries to steal threads from other overloaded cpus using idle_balance. idle_balance does more work because it searches widely for the busiest CPU to offload, so to limit its CPU consumption, it declines to search if the system is too busy. A more scalable/lightweight way of stealing is desirable so that we can always try to steal with very little cost.

        3) Discuss workloads that use pipes and can benefit from pipe busy waits

        When pipe is full or empty a thread goes to sleep immediately. If the sleep wakeup happens very fast, the cost of sleep wakeup overhead can hurt a very context switch sensitive workload that is using pipes heavily. A few microseconds of busy wait before sleeping can avoid the overhead and improve the performance. Network sockets has similar capability. So far hackbench with pipe shows huge improvements, want to discuss other potential use cases.

        Speaker: Subhra Mazumdar
      • 10:30
        Break 30m
      • 11:00
        Promoting huge page usage 30m

        Huge pages are essential to addressing performance botttlenecks
        since the base page sizes are not changing while the amount of memory is
        ever increasing. Huge pages can address TLB misses but also memmory
        overhead in the Linux kernel that arises through page faults and other
        compute intensive processing of small pages. Huge pages are required
        with contemporary high speed NVME ssds to reach full throughput because
        the I/O overhead can be reduced and large contiguous memory I/O can then
        be scheduled by the devices. However, using huge pages often requires the
        modification of applications if transparent huge pages cannot be used.
        Transparent huge pages also require application specific setup to work
        effectively.

        Speakers: Christopher Lameter (Jump Trading LLC), Mike Kravetz
      • 11:30
        Workqueues and CPU Hotplug 15m

        Flexible workqueue: Currently we have two pool setting-up for workqueue: 1) per-cpu workqueue pool and 2) unbound workqueue pool, the former require the users of workqueues to have some knowledge of cpu online state, as shown in:

        https://lore.kernel.org/lkml/20180625224332.10596-2-paulmck@linux.vnet.ibm.com/T/#u

        While the latter (unbound workqueue) only has one pool per-NUMA, and that may hurt the scalability if we want to run multiple tasks in parallel inside a NUMA node.

        Therefore, that is a clear requirement for having a setting-up for workqueue to provide flexible level of parallelism (i.e. could run as many tasks as possible while save users from worrying about race with cpu hotplug).

        We'd like to have a session to talk about requirement and possible solution.

        Speaker: Boqun Feng
      • 11:45
        ktask: Parallelizing CPU-intensive kernel work 15m

        Certain CPU-intensive tasks in the kernel can benefit from multithreading, such as zeroing large ranges of memory, initializing massive state (struct page) at boot, VFIO page pinning, XFS quotacheck, and freeing memory on munmap/exit. There is currently no interface that provides this service. ktask is a framework built on workqueues that splits up the work, chooses the number of threads to use, synchronizes these threads, and load balances the work between them. I want to discuss current issues with this work, including allowing ktask threads to play well with the scheduler, cgroup awareness so ktask threads are throttled appropriately, and appropriately enabling ktask according to power management settings.

        Speaker: Daniel Jordan
      • 12:00
        Reducing the number of users of mmap_sem 15m

        The mmap_sem has long been a contention point in the memory management
        subsystem. In this session some mmap_sem related topics will be
        discussed. Some optimization has been merged by the upstream kernel to
        solve holding mmap_sem for write for excessive period of time in
        munmap path by downgrading write mmap_sem to read. And, some
        optimization are under discussion on the mailing list, i.e. release
        mmap_sem earlier for page cache readahead, speculative page fault.
        There is still optimization room by figuring out just what mmap_sem
        protects. It covers access to many fields in the mm_struct structure.
        It is also used for the virtual memory area (VMA) red-black tree, the
        process VMA list, and various fields within the VMA structure itself.
        Finer grain locks might be better to replace mmap_sem to reduce
        contention, i.e. range lock or per vma lock.

        Speaker: Yang Shi (Alibaba Group)
      • 12:15
        Performance and scalability MC Closing 15m
        Speakers: Daniel Jordan, Pavel Tatashin, Ying Huang
    • 14:00 17:40
      Android MC Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-D

      Sheraton Vancouver Wall Center

      77

      Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena. Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel. Other areas where there is ongoing work include the low memory killer, dynamically-allocated Binder devices, kernel namespaces, EAS, userdata FS checkpointing and DT.

      The working planning document is here:
      https://docs.google.com/spreadsheets/d/1ymzOB4wapccX6t1b11T2-m9ny8buN7EuUqhCxrsmKe4

      • 14:00
        Symbol namespaces 15m

        As of Linux 4.18, there are more than 30000 exported symbols in the kernel that can be used by loadable kernel modules. These exported symbols are all part of a global namespace, and there seems to be consensus among kernel devs that the export surface is too large, and hard to reason about. This talk describes a series of patches that introduce symbol namespaces, in order to more clearly delineate the export surface.

        Speaker: Martijn Coenen (Google)
      • 14:15
        Android usage of memory pressure signals in userspace low memory killer 15m

        Android's transition from in-kernel lowmemorykiller to userspace LMKD introduced a number of challenges including low-latency memory pressure detection and signaling, improving kill decision mechanisms and improving SIGKILL delivery latency. This talk will focus on the memory pressure detection mechanism based on PSI (Pressure Stall Information) recently developed by Johannes Weiner. It will also present PSI Monitor module currently in development stage.

        Speaker: Suren Baghdasaryan (Google)
      • 14:30
        Dynamically Allocated Binder Devices 15m

        The Binder driver currently does allow for the allocation of multiple binder devices through a kconfig option. However, this means how many binder devices the kernel will allocate is hard-coded and cannot be changed at runtime. This is inconvenient for scenarios where processes wish to allocate binder devices on the fly and the number of needed devices is not know in advance. For example, we are running large system container workloads where each container wants at least one binder device that is not shared with any other container. The number of running containers can change dynamically which causes binder devices to be freed or allocated on demand. In this session I want to propose and discuss two distinct approaches how this problem can be solved:
        1. /dev/binder-control: A new control device /dev/binder-control is added through which processes can allocate a free or add a new binder device to the system.
        2. binderfs: A new binderfs filesystem is added. Each mount of binderfs in a new mount (and ipc) namespace will be a new instance similar to how devpts works. Ideally, binderfs would be mountable from non-initial user namespaces. This idea is
        similar to earlier proposals of a lofs (filesystem for loop devices).
        This session hopes to start a fruitful discussion around the feasibility of this feature and nurture a technical discussion around the actual implementation.

        Speaker: Christian Brauner
      • 14:45
        How to be better citizens: from changes review to changes testing 15m

        Despite the continuous and encouraging improvements, AOSP stable kernels have still a certain delta with respect to mainline. Some features are still unique to AOSP (e.g. WALT or SchedTune), others are back-port from mainline (e.g. idle loop optimization). Whenever an existing feature is modified, or a new/backported feature is proposed for an AOSP stable kernel, apart from a great code review on gerrit, we would like to increase our confidence on the quality of the chances by testing their impact on few key areas: interactivity, energy efficiency and performance. This slot will be dedicated on describing a possible solution with, the main goal is to collect feedbacks on how to increase its adoption by AOSP common kernel's contributors.

        Speaker: Patrick Bellasi (ARM)
      • 15:00
        Userdata FS Checkpointing 15m

        Android A/B updates allows roll back of updates that fail to boot, rolling back system, vendor partitions. BUT if update modifies the userdata partition before failing, cannot roll back modifications, and Android does not support updated userdata with old system/vendor partitions. If the file system supports snapshots, use them! We are adding snapshot support to F2FS. If no filesystem support, consider a block level solution. We will discuss a dm-snap based solution vs a new solution from Google called dm-bow.

        Speakers: Daniel Rosenberg (Google), Paul Lawrence
      • 15:15
        LVM, Device Mapper and DM-Linear 15m

        The Android OS is stored on signed, read-only partitions. Sizing these partitions is difficult. After a few years, a major OS pdate may no longer fit in a device's existing partitions. To use space more intelligently, we are introducing a userspace partitioning system based on dm-linear, similar to LVM.

        Speaker: David Anderson
      • 15:30
        Break 30m
      • 16:00
        Android DTS fstab node requirements 15m

        Android Oreo introduced some device tree bindings that MUST be specified for Android to be able to mount core partitions early (before SELinux enforcement) in the boot sequence for Project Treble. This talk is about the plans to get rid of that kernel / device tree dependency.. Instead, have a single global fstab for all and how the device tree fstab is to be deprecated.

        Speaker: Sandeep Patil (Google)
      • 16:15
        How to Get Ashmem Out of Staging 15m

        Android uses ashmem for sharing memory regions. We are trying to migrate all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd for shared memory in Android, and contributing to improving memfd upstream. Note staging drivers are also not ABI and generally can be removed at anytime. This talk is about the current open challenges with this, patches that are recently sent to LKML, technical difficulties, userspace requirements, etc. One of the big difficulties with having a "pinning" interface. John Stultz has proposed vrange syscall before. It would be good to some consensus on the direction that we should go in this regard.

        Speaker: Joel Fernandes (Google)
      • 16:30
        Readiness of ARM64 Kernels for Running on Any Device 15m

        Android kernels is a cocktail of upstream, android common kernels and big amounts of out-of-tree vendor code to support the SoCs and board peripherals. Android Pie now requires the board peripherals to be described using a device tree overlay. It is recommended that the drivers for those peripherals be loaded at boot time as a kernel module.

        This discussion is intended to evaluate, seek feedback and find out possible hurdles for if taking it a step further for Android devices. So as to also have the SoC code loaded as kernel modules as well. This obviously facilitates faster core kernel updates on Android devices. More importantly though, Android and the linux kernel can move together year over year without having to worry about older kernels.

        Speaker: Sandeep Patil (Google)
      • 16:45
        DRM/KMS for Android 15m

        A short update on what Google is doing to help move partners away from proprietary interfaces for their display drivers and towards DRM/KMS and upstreaming.

        Speaker: Alistair Strachan (Google)
      • 17:00
        ION Upstreaming Update 15m

        Follow-up discussion to previous years about remaining work to be done to get ION driver merged upstream.

        Speaker: Laura Abbott
      • 17:15
        Cuttlefish 15m

        Introduction to Cuttlefish VM

        Speaker: Alistair Strachan
      • 17:30
        Android and Linux Kernel: Herding billions of penguins, one version at a time 5m

        Refereed track talk

        Speaker: Sandeep Patil (Google)
      • 17:35
        Progress Report 5m

        Collective progress report from Android MC 2018

        Speaker: Karim Yaghmour (Opersys inc.)
    • 14:00 17:30
      Kernel Summit Track Junior/Ballroom-D (Sheraton Vancouver Wall Center)

      Junior/Ballroom-D

      Sheraton Vancouver Wall Center

      67
      • 14:00
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

      • 14:45
        Elivepatch: Flexible Distributed Linux Kernel Live Patching 45m
        Speaker: Alice Ferrazzi
      • 15:30
        Break 30m
      • 16:00
        XArray 45m

        Now that the XArray is in, it's time to make use of it. I've got a git tree which converts every current user of the radix tree to the XArray as well as converting some users of the IDA to the XArray.

        The XArray may also be a great replacement for a list_head in your data structure.

        I can also talk about currently inappropriate uses for the XArray and what I might do in future to make the XArray useful for more users.

        Speaker: Matthew Wilcox
      • 16:45
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

    • 14:00 17:30
      LPC Main Track Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 14:00
        What could be done in the kernel to make strace happy 45m

        What could be done in the kernel to make strace happy.

        Being a traditional tool with a long history, strace has been making every effort to overcome various deficiencies in the kernel API. Unfortunately, some of these workarounds are fragile, and in some cases no workaround is possible. In this talk maintainers of strace will describe these deficiencies and propose extensions to the kernel API so that tools like strace could work in a more reliable way.

        1

        Problem: there is no kernel API to find out whether the tracee is entering or exiting syscall.

        Current workarounds: strace does its best to sort out and track ptrace events, this works in most cases, but in case of strace attaching to a tracee being inside exec when its first syscall stop is syscall-exit-stop instead of syscall-enter-stop, the workaround is fragile, and in infamous case of int 0x80 on x86_64 there is no reliable workaround.

        Proposed solution: extend the ptrace API with PTRACE_GET_SYSCALL_INFO request.

        2

        Problem: there is no kernel API to invoke wait4 syscall with changed signal mask.

        Current workarounds: strace does its best to implement a race-free workaround, but it is way too complex and hard to maintain.

        Proposed solution: add wait6 syscall which is wait4 with additional signal mask arguments, like pselect vs select and ppoll vs poll.

        3

        Problem: time precision provided by struct rusage is too low for strace -c nowadays.

        Current workarounds: none.

        Proposed solution: when adding wait6 syscall, change struct rusage argument to a different structure with fields of type struct timespec instead of struct timeval.

        4

        Problem: PID namespaces have been introduced without a proper kernel API to translate between tracer and tracee views of pids. This causes confusion among strace users, e.g. https://bugzilla.redhat.com/1035433

        Current workarounds: none.

        Proposed solution: add translate_pid syscall, e.g. https://lkml.org/lkml/2018/7/3/589

        5

        Problem: there are no consistent declarative syscall descriptions, this forces every user to reinvent its own wheel and catch up with the kernel.

        Current workarounds: a lot of manual work has been done in strace to implement parsers of all syscalls. Some of these parsers are quite complex and hard to test. Other projects, e.g. syzkaller, implement their own representation of syscall ABI.

        Proposed solution: provide declarative descriptions for all syscalls consistently.

        Speakers: Dmitry Levin (BaseALT), Elvira Khabirova (BaseALT), Eugene Syromyatnikov (RedHat)
      • 14:45
        Formal Methods for Kernel Hackers 45m

        Formal methods have a reputation of being difficult, accessible mostly to academics and of little use to the typical kernel hacker. This talk aims to show how, without "formal" training, one can use such tools for the benefit of the Linux kernel. It will introduce a few formal models that helped find actual bugs in the Linux kernel and start a discussion around future uses from modelling existing kernel implementation (e.g. cpu hotplug, page cache states, mm_user/mm_count) to formally specifying new design choices. The introductory examples are written in PlusCal (an algorithm language based on TLA+) but no prior knowledge is required.

        Speaker: Catalin Marinas
      • 15:30
        Break 30m
      • 16:00
        Managing Memory Bandwidth Antagonism at Scale 45m

        Providing a consistent and predictable performance experience for applications is an important goal for cloud providers. Creating isolated job domains in a multi-tenant shared environment can be extremely challenging. At Google, performance isolation challenges due to memory bandwidth has been on the rise with newer workloads. This talk covers our attempt to understand and mitigate isolation issues caused by memory bandwidth saturation.

        The recent Intel RDT support in Linux helps us both monitor and manage memory bandwidth use on newer platforms. However, it still leaves a large chunk of our fleet at risk of memory bandwidth issues. The talk covers three aspects of our isolation attempts:

        1. At Google and Borg, we run all application in containers. Our first attempt was to estimate memory bandwidth utilization for each container on all supported platform by using existing performance counters. The talk will cover details on our approximation methodology and issues we identified in monitoring as well as some usage trends across different workloads.
        2. The second part of our effort was focussed on building actuators and policies for memory bandwidth control. We will cover multiple iterations of our enforcement efforts at node and cluster level with production use-cases and lessons learnt.
        3. For newer platforms, we attempted to use Intel RDT support via the resctrl interface. We ran into issues on both the monitoring and isolation side. We’ll discuss the fixes and workarounds we used and changes we proposed for resource-control support in Linux.

        We believe the problems and trends we have observed are universally applicable. We hope to inform and initiate discussion around common solutions across the community.

        Speakers: Mr Rohit Jnagal (Google Inc), Mr David Lo (Google Inc), Mr Dragos Sbirlea (Google Inc)
      • 16:45
        oomd: a userspace OOM killer 45m

        Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. In this talk, Daniel Xu will cover why the Linux kernel OOM killer is surprisingly ineffective and how oomd, a newly opensourced userspace OOM killer, does a more effective and reliable job. Not only does the switch from kernel space to userspace result a more flexible solution, but it also directly translates to better resource utilization. His talk will also do a deep dive into the Linux kernel changes and improvements necessary for oomd to operate.

        Speaker: Mr Daniel Xu (Facebook)
    • 14:00 17:30
      Toolchain MC Junior/Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior/Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The GNU Toolchain and Clang/LLVM play a critical role at the nexus of the Linux Kernel, the Open Source Software ecosystem, and computer hardware. The rapid innovation and progress in each of these components requires greater cooperation and coordination. This Toolchain Microconference will explore recent developments in the toolchain projects, the roadmaps, and how to address the challenges and opportunities ahead as the pace of change continues to accelerate.

      • 14:00
        GCC and the GNU Toolchain: The Essential Collection 20m

        Current successes and future challenges for the GNU Toolchain. This talk will discuss the recent improvement in GCC, GLIBC, GDB and Binutils and future directions. How can the GNU Toolchain better engage the Linux kernel community?

        Speaker: David Edelsohn (IBM Research)
      • 14:20
        Support for Control-flow Enforcement Technology 20m

        CET is a security enhancement technology coming in upcoming Intel hardware. This paper will talk about all the changes in the software that are required to enable CET in the marketplace. The changes are all-encompassing affecting the kernel, linker, compilers, libraries, applications etc.

        Speaker: Mr H. J. Lu (Intel)
      • 14:40
        GLIBC API to access x86 specific platform features CPU run-time library for C 20m

        In 2018, People are still using glibc 2.17, which was released in February 2013, on SKX, even when the current released glibc 2.28 has the new memory, string and math functions optimized for SKX. The same thing will happen five years from now.

        The CPU runtime C library for x86-64, libcpu-rt-c:

        The latest memory, string functions from glibc Binary compatible with any x86-64 OSes. Link directly or LD_PRELOAD.

        Speaker: Mr H. J. Lu (Intel)
      • 15:00
        improve glibc and kernel iteration 30m
        • Ideas to improve glibc and Kernel interaction

        This is RFC session of to check lacking kernel features glibc lacks (such as termios2), some features glibc might require to implement correctly some standards (such as pthread cancellation), and how to improve communication between kernel and gnu toolchain developers.

        Speaker: Mr Adhemerval Zanella (Linaro)
      • 15:30
        Break 30m
      • 16:00
        RISC-V 32-bit time_t kernel ABI 20m

        The 32-bit RISC-V glibc port is not currently upstream so we've taken the opportunity to leave the 32-bit Linux ABI a bit slushy in the hope that we can avoid any known to be legacy interfaces. The last major interface remaining that we plan on deprecating is the 32-bit time_t interface, and while we don't want to delay our glibc release just to have a clean time_t we think it's possible to get everything done in time.

        This session exists to determine if this is feasible, and assuming it is feasible how we can go about doing it.

        Speakers: Mr Atish Patra (Western Digital), Mr Palmer Dabbelt (SiFive)
      • 16:20
        Toolchain plans for Armv8.5 20m

        This session gives a brief introduction to the new features introduced in AArch64 with Armv8.5 and an overview of how these features will make it into toolchains in upcoming releases.

        Speaker: Mr Ramana Radhakrishnan (Arm)
      • 16:40
        TBD/Open Discussion 20m
    • 09:00 16:00
      BPF MC Pavillion/Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-C

      Sheraton Vancouver Wall Center

      58

      BPF is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP, tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.

      BPF has seen a lot of progress since last year's Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one.

      This year's BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year's event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.

      Irrelevant text added by Paul E. McKenney.

      Official BPF MC website: http://vger.kernel.org/lpc-bpf2018.html

      • 09:00
        Welcome 10m

        BPF MC opening session.

        Speakers: Alexei Starovoitov (Facebook), Daniel Borkmann (Cilium)
      • 09:10
        Scaling Linux Traffic Shaping with BPF 20m

        Google servers classify, measure, and shape their outgoing traffic. The original implementation is based on Linux kernel traffic control (TC). As server platforms scale so does their network bandwidth and number of classified flows, exposing scalability limits in the TC system - specifically contention on the root qdisc lock.

        Mechanisms like selective qdisc bypass, sharded qdisc hierarchies, and low-overhead prequeue ameliorate the contention up to a point. But they cannot fully resolve it. Recent changes to the Linux kernel make it possible to move classification, measurement, and packet mangling outside this critical section, potentially scaling to much higher rates while simultaneously shaping more flows and applying more flexible policies.

        By moving classification and measurement to BPF at the new TC egress hook, servers avoid taking a lock millions of times per second. Running BPF programs at socket connect time with TCP_BPF converts overhead from per-packet to per-flow. The programmability of BPF also allows us to implement entirely new functions, such as runtime configurable congestion control, first-packet classification and socket-based QoS policies. It enables faster deployment cycles and as this business logic can be updated dynamically from a user agent. The discussion will focus on our experience converting an existing traffic shaping system to a solution based on BPF, and the issues we’ve encountered during testing and debugging.

        Speakers: Willem de Bruijn (Google), Eddie Hao (Google), Vlad Dumitrescu (Google)
      • 09:30
        Compile-Once Run-Everywhere BPF Programs? 20m

        Compile-once and run-everywhere can make deployment simpler and may consume less resources on the target host, e.g., without llvm compiler and kernel devel package. Currently bpf programs for networking can compile once and run over different kernel versions. But bpf programs for tracing cannot since it accesses kernel internal headers and these headers are subject to change between kernel versions.

        But compile-once run-everywhere for tracing is not easy. BPF programs could access anything in the kernel headers, including data structures, macros and inline functions. To achieve this goal, we need (1) preserving header-level accesses for the bpf program, and (2) abstracting header info of vmlinux. Right before program load on the target host, some kind of resolution is done for bpf program against the running kernel so the resulted program is just like to that compiled against host kernel headers.

        In this talk, we will explore how BTF could be used by both bpf program and vmlinux to advance the possibility of bpf program compile-once and run-everywhere.

        Speakers: Yonghong Song (Facebook), Alexei Starovoitov (Facebook)
      • 09:50
        ELF relocation for static data in BPF 20m

        BPF program writers today who build and distribute their programs as ELF objects typically write their programs using one of a small set of (mostly) similar headers that establish norms around ELF section definitions. One such norm is the presence of a "maps" section which allows maps to be referenced within BPF instructions using virtual file descriptors. When a BPF loader (eg, iproute2) opens the ELF, it loads each map referred in this section, creates a real file descriptor for that map, then updates all BPF instructions which refer to the same map to specify the real file descriptor. This allows symbolic referencing to maps without requiring writers to implement their own loaders or recompile their programs every time they create a map.

        This discussion will take a look at how to provide similar symbolic referencing for static data. Existing implementations already templatize information such as MAC or IP addresses using C macros, then invoke a compiler to replace such static data at load time, at a cost of one compilation per load. By extending the support for static variables into ELF sections, programs could be written and compiled once then reloaded many times with different static data.

        Speakers: Joe Stringer (Cilium), Daniel Borkmann (Cilium)
      • 10:10
        BPF control flow, supporting loops and other patterns 20m

        Currently, BPF can not support basic loops such as for, while, do/while, etc. Users work around this by forcing the compiler to "unroll" these control flow constructs in the LLVM backend. However, this only works up to a point. Unrolling increases instruction count and complexity on the verifier and further LLVM can not easily unroll all loops. The result is developers end up writing code that is unnatural, iterating until they find a version that LLVM will compile into a form the verifier backend will support.

        We developed a verifier extension to detect bounded loops here,

        https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/log/?h=wip/bpf-loop-detection

        This requires building a DOM tree (computationally expensive) and then matching loop patterns to find loop invariants to verify loops terminate. In this discussion we would like to cover the pros and cons of this approach. As well as discuss another proposal to use explicit control flow instructions to simplify this task.

        The goal of this discussion would be to come to a consensus on how to proceed to make progress on supporting bounded loops.

        Speaker: John Fastabend (Cilium)
      • 10:30
        Break 30m
      • 11:00
        Efficient JIT to 32-bit architectures through data flow analysis 20m

        eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.

        In this talk, implementations for both versions of DF analyser will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.

        Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.

        Speaker: Jiong Wang (Netronome)
      • 11:20
        eBPF Debugging Infrastructure - Current Techniques and Additional Proposals 20m

        eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.

        The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.

        Speaker: Quentin Monnet (Netronome)
      • 11:40
        eBPF-based tracing tools under 32 bit architectures 20m

        Complex software usually depends on many different components, which sometimes perform background tasks with side effects not directly visible to their users. Without proper tools it can be hard to identify which component is responsible for performance hits or undesired behaviors.

        We were challenged to implement D-Bus observability tools in embedded, ARM32 or ARM64 kernel based environments, both with 32-bit userspace. While we found bcc-tools, an open source compiler set useful, it appeared that it lacks support for 32-bit environments. We extended bcc-tools with support for 32-bit architectures. Using bcc-tools we created Linux eBPF programs – small programs written in a subset of C language, loaded from user-space and executed in kernel context. We attached them to uprobes and kprobes - user and kernel space special kinds of breakpoints. While it worked on ARM32 kernel based system, we faced another problem - ARM64 kernel lacked support for uprobes set in 32-bit binaries. The 64-bit ARM Linux kernel was extended with the ability to probe 32-bit binaries.

        We propose to discuss challenges we faced trying to implement bcc-tools based tracing tools on ARM devices. We present a working solution to overcome lack of support for 32-bit architectures in bcc-tools, leaving space for discussion about other ways to achieve the same result. We also introduce 32-bit instruction probing in ARM64 kernel - a solution that we found very useful in our case. As a proof of concept we present tools that monitor D-Bus usage in ARM32 or ARM64 kernel based system with 32-bit userspace. We list what needs to be done for complete eBPF-based tools to be fully usable on ARM.

        Speakers: Maciej Slodczyk (Samsung), Adrian Szyndela (Samsung)
      • 12:00
        Using eBPF as a heterogeneous processing ABI 20m

        eBPF (extended Berkeley Packet Filter) is an in-kernel generic virtual machine, which can be used to execute simple programs injected by the user at various hooks in the kernel, on the occurrences of events such as incoming packets. eBPF was designed to simplify the work of in-kernel just-in-time compilers, i.e. translation of eBPF intermediate representation to CPU machine code. Upstream Linux kernel currently contains JITs for all major 64-bit instruction set architectures (ISAs) (x86, AArch64, MIPS, PowerPC, SPARC, s390) as well as some 32-bit translators (ARM, x86, also NFP - Netronome Flow Processor).

        The eBPF generic virtual machine with clearly defined semantics makes it a very good vehicle for enabling programming of custom hardware. From storage devices to networking processors most host I/O controllers today are built based on or with accompaniment with general purpose processing cores, e.g. ARM. As vendors try to expose more and more capabilities of their hardware, using a general purpose machine definition like eBPF to inject code into hardware directly allows us to avoid creation of vendor specific APIs.

        In this talk I will describe the eBPF offload mechanism which exists today in the Linux kernel and how they compare to other offloading stacks e.g. for compute or graphics. I will present a proof-of-concept work on of reusing existing eBPF JITs for non-host architecture (e.g. ARM JIT on x86) to program a emulated device, followed by a short description of the eBPF offload for NFP hardware as an example of a real-life offload.

        Speaker: Jakub Kicinski (Netronome)
      • 12:20
        Traffic policing in eBPF: applying token bucket algorithm 20m

        eBPF-based traffic policer as a replacement* of Hierarchical Token Bucket queuing discipline.

        The key idea is two rate three color marker (rfc2698) algorithm, which inputs are committed and peak rates with the corresponding burst sizes and the output is a color or category assigned to a packet. There are conforming, exceeding, violating categories. An action is applied to violating category - either drop or dscp remark. Another action may optionally be applied to exceeding category.

        Close-up of eBPF implementation**. Write intensiveness is a cornerstone: an update of available tokens is required on each packet, as well as tracking of time. Naive implementation and its exposure to data races on multi-core processors system. A problem of updating both timestamp and the number of available tokens atomically. Slicing the timeline into chunks of the size of burst duration as a solution for races, mapping each packet into its chunk, so there is no need in updating global timestamp. Two approaches of storing timeline chunks: bpf LRU hash map and a block of timeline chunks in bpf array. Circulating over a block of timeline chunks. Proc and cons of the latter approach: lock-free with bpf array as the only data structure used vs. increased amount of locked memory.

        Combining several policers. Linear chain of policers instead of hierarchy. Passing a packet over the chain. Dealing with bandwidth underutilization when first K policers in a chain conform a packet and K+1 rejects. Commutative property of chained policers. Interaction with UDP and TCP. TCP reacting on drop by changing congestion window which affects the actual rate.

        • Note, that it's a replacement, not the alternative: eBPF based implementation doesn't assume putting packets into queues.
          ** Since the action is per packet, eBPF program should be attached to tc chain, it doesn't work with cgroups.
        Speaker: Julia Kartseva (Facebook)
      • 12:40
        Break 1h 20m
      • 14:00
        In-kernel protocol aware filtering 20m

        Deep packet inspection seems to be a largely unexplored area of BPF use cases. The 4096 instruction limit and the lack of loops make such implementations non-straightforward for many protocols. Using XDP and socket filters, at Red Sift, we implemented DNS and TLS handshake detection to provide better monitoring for our clusters. We learned that while the protocol implementation is not necessarily straightforward, the BPF VM provides a reasonably safe environment for DPI-style parsing. When coupled with our Rust userspace implementation, it can provide information and functionality that previously would have required userspace intercepting proxies or middleboxes, at a comparable performance to iptables-style packet filters. Further work is needed to explore how we can turn this into a more comprehensive, active component, mainly due to the BPF VM restrictions around 4096 instruction programs.

        Speaker: Peter Parkanyi (Red Sift)
      • 14:20
        Enhancing User Defined Tracepoints 20m

        BPF trace tools such as bcc/trace and bpftrace can attach to Systemtap USDT (user application statically defined tracepoints) probes. These probes can be created by a macro imported from "sys/sdt.h" or by a provider file. Either way, Systemtap will register those probes as entries in the note section of the ELF file with the name of the probe, its address and the arguments as assembly locations. This approach is fairly simple, easy to parse and non-intrusive. Unfortunately, it is also obsolete and lacks features such as typed arguments and built-in dynamic instrumentation. Since BPF tools are growing in popularity, it makes sense to create a new enhanced format to fix these shortcomings.

        We can discuss and make decisions about the future of USDT probes used by BPF trace tools. Some possible alternatives are: extend Systemtap USDT to introduce these new features or extend kernel tracepoints so that user applications can also register them.

        Speaker: Matheus Marchini (Sthima)
      • 14:40
        Augmenting syscalls in 'perf trace' using eBPF 20m

        The 'perf trace' tool uses the syscall tracepoints to provide a !ptrace based 'strace' like tool, augmenting the syscall arguments provided by the tracepoints with integer->strings tables automatically generated from the kernel headers, showing the paths associated with fds, pid COMMs, etc.

        That is enough for integer arguments, pointer arguments needs either kprobes put in special locations, which is fragile and has been so far implemented only for getname_flags (open, etc filenames), or using eBPF to hook into the syscall enter/exit tracepoints to collect pointer contents right after the existing tracepoint payload.

        This has been done to some extent and is present in the kernel sources in the tools/perf/examples/bpf/augmented_syscalls.c, using the pre-existing support in perf to use BPF C programs as event names, automagically using clang/llvm to build and load it via sys_bpf(), 'perf trace' hooks this to the existing beautifiers that seeing that extra data use it to get the filename, struct sockaddr_in, etc.

        This was done for a bunch of syscalls, what is left is to get this all automated using BTF, allow passing filters attached to the syscalls, select which syscalls should be traced, use a pre-compiled augmented_syscalls.c just selecting what bits of the obj should be used, etc, i.e. the open issues about this streamlining process to avoid requiring the clang toolchain, etc will be the matter of this discussion.

        Speaker: Arnaldo Carvalho de Melo (Red Hat)
      • 15:00
        bpftrace - high-level tracing language powered by BPF 20m

        BPFtrace is a high-level tracing language powered by BPF. Inspired by awk and C, as well as predecessor tracers such as DTrace and SystemTap, BPFtrace has a clean and simple syntax which empower users to easily create BPF programs and attach them to kprobes, uprobes, and tracepoints.

        We can discuss the future of this work, including BTF integration for kprobe struct arguments, and solicit feedback.

        Speaker: Matheus Marchini (Sthima)
      • 15:20
        eBPF as execution engine for DTrace 20m

        The existence and power of eBPF provides a generic execution engine at the kernel level. We have been exploring leveraging the power of eBPF as a way to integrate DTrace more into the existing tracing framework that has matured within the Linux kernel. While DTrace comes with some more lightweight ways for getting probes fired, and while it has a pretty nice userlevel consumer with useful features, there should be no need to duplicate a lot of effort on the level of processing probe events and generating data for the consumer.

        We want to move forward with modifying DTrace to make use of the eBPF subsystem, and propose and contribute extensions to eBPF (and most likely some other tracing related subsystems) to provide more support for not only DTrace but tracing tools in general. In order to contribute things that benefit more than just us, we need to get together and talk, so let's get it started...

        Speaker: Kris van Hees (Oracle)
    • 09:00 12:30
      Kernel Summit Track Junior/Ballroom-D (Sheraton Vancouver Wall Center)

      Junior/Ballroom-D

      Sheraton Vancouver Wall Center

      67
      • 09:00
        ZUFS - Zero Copy User-Mode FileSystem - One Year Later 45m

        One year after its inception there are real hardened FSs.
        Many innovative fixtures. But is it ready for upstream?

        Few highlights:

        • ALL VFS api working including dax, mmap IOCTL xattrs ACLs ....
          (Missing quota)
        • IO API changed (From last year)
        • Support for ASYNC operations
        • Support for both pmem and regular block devices
        • Support for private memory pools
        • ZTs multy-channel and dynamic channel allocation
        • And many more ...

        In the talk I will give a short architectural and functional overview. Then will go over some of the leftover challenges.

        And finally hope to engage in an open discussion of how this project should move forward to be accepted into the Kernel, gain more users and FS implementations.

        Speaker: Mr Boaz Harrosh
      • 09:45
        Filename encoding and case-insensitive filesystems 45m

        Case-insensitive file name lookups is a recurrent topic on Linux filesystems, but its stalled development has regained traction in the past few years, thanks to its applications in platforms like Valve's SteamOS and Android. Despite aiming at simplifying the file lookup operation from a user point of view, since human languages don't directly correlate to arbitrary case folding and encoding composition premises, the actual implementation of encoding and case-insensitive awareness carry an outstanding number of issues and corner cases, which require a clear behavioral definition from the file system layer in order to get it right. File systems developers are invited to come discuss such premises and what is expected from an in-kernel common encoding and case-insensitive abstraction for file systems.

        Speaker: Gabriel Krisman Bertazi (Collabora)
      • 10:30
        Break 30m
      • 11:00
        Who stole my CPU? Steal time mitigation at Digital Ocean 45m

        Steal time due to hypervisor overcommitment is a widespread and well-understood phenomena in virtualized environments.
        However, sometimes steal appears even when a hypervisor is not overcommitted. This talk will lay out our quest for better utilization of hypervisor hardware by reducing steal. We will talk about kernel heuristics causing it, how we handle disabling these heuristics by implementing a userspace daemon and the issues that arise from this.

        Speakers: Leonid Podolny (DigitalOcean), Vineeth Remanan Pillai (DigitalOcean)
      • 11:45
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

    • 09:00 12:30
      LPC Main Track Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 09:00
        The hard work behind large physical allocations in the kernel 45m

        The physical memory management in the Linux kernel is mostly based on single page allocations, but there are many situations where a larger physically continuous memory needs to be allocated. Some are for the benefit of userspace (e.g. huge pages), others for better performance in the kernel (SLAB/SLUB, networking, and others).

        Making sure that contiguous physical memory is available for allocation is far from trivial, as pages are reclaimed for reuse roughly in last-recently-used (LRU) order, which is typically different from their physical placement. The freed memory is thus fragmented. The kernel has two complementary mechanisms to defragment the free memory. One is memory compaction, which migrates used pages to make the free pages contiguous. The other is page grouping by mobility, which tries to make sure that pages that cannot be migrated are grouped together, so the rest of pages can be effectively compacted. Both mechanisms employ various heuristics to balance the success of large allocations, and their overhead in terms of latencies due to processor and lock usage.

        The talk will discuss the two mechanisms, focusing on the known problems and their possible solutions, that have been proposed by several memory management developers.

        Speaker: Vlastimil Babka (SUSE)
      • 09:45
        WireGuard: Next-Generation Secure Kernel Network Tunnel 45m

        WireGuard [1] [2] is a new network tunneling mechanism written for
        Linux, which, after three years of development, is nearly ready for
        upstream. It uses a formally proven cryptographic protocol, custom
        tailored for the Linux kernel, and has already seen very widespread
        deployment, in everything from smart phones to massive data center
        clusters. WireGuard uses a novel timer mechanism to hide state from
        userspace, and in general presents userspace with a "stateless" and
        "declarative" system of establishing secure tunnels. The codebase is
        also remarkably small and has been written with a number of defense in
        depth techniques. Integration into the larger Linux ecosystem is
        advancing at a health rate, with recent patches for systemd and
        NetworkManager merged. There is also ongoing work into combining
        WireGuard with automatic configuration and mesh routing daemons on
        Linux. This talk will focus on a wide variety of WireGuard’s innards
        and tentacles onto other projects. The presentation will walk through
        WireGuard's integration into the netdev subsystem, its unique use of
        network namespaces, why kernel space is necessary is necessary, the
        various hurdles that have gone into designing a cryptographic protocol
        specifically with kernel constraints in mind. It will also examine a
        practical approach to formal verification, suitable for kernel
        engineers and not just academics, and connect the ideas of that with
        our extensive continuous integration testing framework across multiple
        kernel architectures and versions. As if that was not already enough,
        we will also take a close look at the interesting performance aspects
        of doing high throughput CPU-bound computations in kernel space while
        still keeping latency to a minimum. On the topic of smartphones, the
        talk will examine power efficiency techniques of both the
        implementation and of the protocol design, our experience in
        integrating this into Android kernels, and the relationship between
        cryptographic secrets and smartphones suspend cycles. Finally we will
        look carefully at the WireGuard userspace API and its usage in various
        daemons and managers. In short, this presentation will examine the
        networking and cryptography design, the kernel engineering, and the
        userspace integration considerations of WireGuard.

        [1] https://www.wireguard.com
        [2] https://www.wireguard.com/papers/wireguard.pdf

        Speaker: Jason Donenfeld
      • 10:30
        Break 30m
      • 11:00
        Recursive read deadlocks and Where to find them 45m

        Lockdep (the deadlock detector in the Linux kernel) is a powerful tool to detect deadlocks, and has been used for a long time by kernel developers. However, when comes to read/write lock deadlock detections, lockdep only has limited support. Another thing makes this limited support worse is some major architectures (x86 and arm64) has switched or is trying to switch its rwlock implementation to queued rwlock. One example is we found some deadlock cases that happened in kernel but we could not detect it with lockdep.

        To improve this situation, a patchset to support read/write deadlock detection in lockdep has been post to lkml and got to its v6. Althrough it got several positive feedbacks, some details about the reasoning of the correctness and other things still need more discussion.

        This topic will give a brief introduction on rwlock related deadlocks (recursive read deadlocks) and how we can tweak lockdep to detect them. It will focus on the detection algorithm and its correctness, but also some implementation details.

        This topic will provide the opportunity to discuss the reasoning and the overall design with some core lock developers, along with the opportunity to discuss the usage scenarios with potential users. The expected result is we have a cleaner plan on upstreaming this and more developers get educated on how to use this to help their work.

        Speaker: Boqun Feng
      • 11:45
        Enhancing perf to export processor hazard information 45m

        Most modern microprocessors employ complex instruction execution pipelines such that many instructions can be 'in flight' at any given point in time. The efficiency of this pipelining is typically measured in how many instructions get completed per CPU cycle and the metric gets variously called as Instructions Per Cycle (IPC) or the inverse metric Cycles Per Instruction (CPI). Various factors affect this metric and hazards are the primary among them. Different types of hazards exist - Data hazards, Structural hazards and Control hazards. Data hazard is the case where data dependencies exist between instructions in different stages in the pipeline. Structural hazard is when the same processor hardware is needed by more than one instruction in flight at the same time. Control hazards are more the branch misprediction kinds. Information about these hazards are critical towards analyzing performance issues and also to tune software to overcome such issues. Modern processors export such hazard data in Performance Monitoring Unit (PMU) registers. In this talk, we propose an arch neutral extension to perf to export the hazard data presented in different ways by different architectures. We also present how this extension has been applied to the IBM Power processor, the APIs and example output.

        Speaker: Mr Madhavan Srinivasan (IBM Linux Technology Center)
    • 09:00 12:30
      Power Management and Energy-awareness MC Junior/Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior/Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The focus will be on power management frameworks, task scheduling in relation to power/energy optimization, and platform power management mechanisms. The goal is to facilitate cross framework and cross platform discussions that can help improve power and energy-awareness in Linux.

      • 09:00
        Energy-aware scheduling 25m

        An updated proposal for Energy Aware Scheduling has been posted and discussed on LKML during this year [1]. The patch set introduces an independent Energy Model framework holding active power cost of CPUs, and changes the scheduler's wake-up balancing code to use this newly available information when deciding on which CPU a task should run.

        This session aims at discussing the open problems identified during the review as well as possible improvements to other areas of the scheduler to further improve energy efficiency.

        [1] https://lore.kernel.org/lkml/20181016101513.26919-1-quentin.perret@arm.com/

        Speakers: Dietmar Eggemann (ARM), Quentin Perret (ARM)
      • 09:25
        Expressing per-task/per-cgroup performance hints 25m

        The Linux scheduler is able to drive frequency selection, when the schedutil cpufreq's governor is in use, based on task utilization aggregated at CPU level. The CPU utilization is then used to select the frequency which better fits the task's generated workload. The current translation of utilization values into a frequency selection is pretty simple: we just go to max for RT tasks or to the minimum frequency which can accommodate the utilization of DL+FAIR tasks.

        While this simple mechanism is good enough for DL tasks, for RT and FAIR tasks we can aim at some better frequency driving which can take into consideration hints coming from user-space.

        Utilization clamping is a mechanism which allows to filter the utilization generated by RT and FAIR tasks within a range defined from user-space, either for a task or for task groups. The clamped utilization requirements of RUNNABLE tasks are aggregated at CPU level and used to enforce its minimum and/or maximum frequency.

        This session is meant to give an update on the most recent LKML posting of the utilization clamping patchset and to open a discussion on how to better progress this proposal.

        Speakers: Morten Rasmussen (Arm), Patrick Bellasi (Arm Ltd.)
      • 09:50
        Towards improved selection of CPU idle states 20m

        The venerable menu governor does some thigns that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (becuase it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors are not correlated to the list of available idle states in any way whatever and different correction factors are used depending on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular.

        A major rework of the menu governor would be required to address these issues and it is likely that the performance of some workloads would suffer from that. That raises the question of whether or not to try to improve the menu governor or to introduce an entirely new one to replace it, or to do both these things simultaneously.

        Speaker: Rafael Wysocki (Intel Open Source Technology Center)
      • 10:10
        Generic power domains (genpd) framework improvements 20m

        The Generic PM domains framework (genpd) keeps evolving to deal with new problems. Lately, we have for example seen genpd to incorporate support for active states power management and also support for multiple PM domains per device. Let's walk through these new changes that have been made and discuss their impact.

        Speaker: Ulf Hansson (Linaro)
      • 10:30
        Break 25m
      • 10:55
        Firmware interfaces for power management vs direct control of resources 25m

        While new technologies in platform power management continue to evolve, we need to look at ways to ensure it's independent of the OSPM. Custom vendor solutions for power management and device/system configuration lead to fragmentation.

        ACPI solved the problem for some market segments by abstracting details, but we still need an alternative for the traditional embedded/mobile market. ARM SCMI continues to address concerns in a few of these functional areas, but there still is a lot of resistance to move away from direct control of power resources in the OS. Examples include:

        a. Voltage dependencies for clocks (DVFS) [1] - genpd and performance domain integration
        b. Generic cpufreq governor for devfreq [2]
        c. On-chip interconnect API [3]

        This session aims at reaching some consensus and guidelines going forward to avoid further fragmentation.

        [1] https://www.spinics.net/lists/linux-clk/msg27587.html
        [2] https://patchwork.ozlabs.org/cover/916114/
        [3] https://patchwork.kernel.org/cover/10562761/

        Speaker: Sudeep Holla (ARM)
      • 11:20
        Runtime power sharing among CPUs, GPUs and others 25m

        Due to high performance demands systems tend to be over-provisioned, where it is not possible to run at peak power of each component. Even if each component has capability to report power and set power limits, there is no kernel level framework to achieve that. IPA addresses part of it, but on the systems in question thermal limits usually are not a problem, but sudden power overdraw is a bigger issue (particularly on unlocked systems). In addition, without proper power balance among components, they can starve each other. For example, in Intel KabyLake-G there are 4 big power consumers: CPUs, two GPUs and memory. If CPUs take most of the power, it will hurt graphics performance as the GPU will not be able to handle requests timely. So the power has to be managed at run time based on the workload demand.

        Speaker: Srinivas Pandruvada (Intel)
      • 11:45
        Runtime PM timer granularity issue 20m

        Runtime PM allows drivers to automatically suspend devices that have not been used for a defined amount of time. This autosuspend feature is really efficient for handling bursts of activity on a device by optimizing the number of runtime PM suspend/resume calls. However, the runtime PM timers used for that are fully based on jiffies granularity which raises problems for some embedded ARM platforms that want to optimize their energy usage as much as possible. For example, the minimum timeout value on arm64 is between 4 and 8 ms.

        The session will discuss the impact of switching runtime PM over to using hrtimers and a more fine graied time scale. It also will highlight the advantages and drawbacks of the changes relative to the current situation.

        Speaker: Vincent Guittot (Linaro)
      • 12:05
        On-chip Interconnect API Proposal 25m

        Modern SoCs have multiple CPUs and DSPs that generate a lot of data flowing through the on-chip interconnects. The topologies could be multi-tiered and complex. These buses are designed to handle use cases with high data throughput, but as the workload varies they need to be scaled to avoid wasting power. Furthermore, the priority between masters can vary depending on the running use case like video playback or CPU intensive tasks. The purpose of this new API is to allow drivers to express their QoS needs for interconnect paths between the various SoC components. The requests from drivers are aggregated and the system configures the interconnect hardware to the most optimal performance and power profile.

        The session will discuss the following:
        - How the consumer drivers can determine their bandwidth needs.
        - How to support different QoS configurations based on whether each CPU/DSP device is active or sleeping.

        Speaker: Vincent Guittot (Linaro)
    • 09:00 17:30
      RDMA MC Junior/Ballroom-C (Sheraton Vancouver Wall Center)

      Junior/Ballroom-C

      Sheraton Vancouver Wall Center

      67

      Remote DMA Microconference

      • 09:00
        Wellcome 20m

        Opening RDMA session with agenda, announcements and some statistics from last year.

        Speakers: Mr Jason Gunthorpe, Leon Romanovsky
      • 09:20
        Container and namespaces for RDMA topics 40m
        • Remaining sticky situations with containers namespaces in sysfs and legacy all-namespace operation
        • Remaining CM issues * Security isolation problems
        Speakers: Doug Ledford, Parav Pandit
      • 10:00
        Remote page faults over RDMA 30m

        Discussion of the best way to govern 3rd party memory registration and if it is acceptable to implement RDMA-specific functionality (in this case, page fault handling) inside the kernel in order to avoid exposing additional interfaces.

        Speakers: Joel Nider, Mike Rapoport (IBM)
      • 10:30
        Break 30m
      • 11:00
        RDMA and get_user_pages 1h

        RDMA, DAX and persistant memory co-existence.

        Explore the limits of what is possible without using On
        Demand Paging Memory Registration. Discuss 'shootdown'
        of userspace MRs

        Dirtying pages obtained with get_user_pages() can oops ext4
        discuss open solutions.

        Speakers: Dan Williams (Intel Open Source Technology Center), Jan Kara, John Hubbard (NVIDIA), Mathew Wilcox
      • 12:00
        Very large Contiguous regions in userspace 30m

        Poor performance of get_user_pages on very large virtual ranges.
        No standardized API to allocate regions to user space
        Carry over from last year

        Speakers: Christopher Lameter (Jump Trading LLC), Mike Kravetz
      • 14:00
        RDMA and PCI peer to peer 1h

        RDMA and PCI peer to peer transactions. IOMMU issues. Integration with HMM. How to expose PCI BAR memory to userspace and other drivers as a DMA target.

        Speaker: Stephen Bates
      • 15:00
        Improving testing of RDMA with syzkaller, RXE and Python 30m

        Problem solve RDMA's distinct lack of public tests.
        Provide a better framework for all drivers to test with, and a framework for basic testing in userspace.

        Worst remaining unfixed syzkaller bugs and how to try to fix them.

        Speakers: Jason Gunthorpe, Noa Osherovich
      • 15:30
        Break 30m
      • 16:00
        IOCTL conversion and new kABI topics 30m

        Attempt to close on the remaining tasks to complete the project.

        Restore fork() support to userspace

        Speaker: Jason Gunthorpe (Mellanox Technologies)
      • 16:30
        RDMA BoF and Closing Session 1h

        Let's gather together and try to plane next year.

        Speakers: Mr Jason Gunthorpe (Mellanox Technologies), Leon Romanovsky
    • 09:00 12:30
      RISC-V MC Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-D

      Sheraton Vancouver Wall Center

      77

      The momentum behind RISC-V ecosystem is really commendable and its open nature has a large role in its growth. It allowed contributions from both academic and industry community leading to an unprecedented number of hardware designs proposals in a very short span of time. Soon, a wider variety of RISC-V based hardware boards and extensions
      will be available, allowing a larger choice of applications not limited to embedded micro-controllers. RISC-V software ecosystem also need to grow across the stack so that RISC-V can be a true alternative to existing ISA. Linux kernel support holds the key in this.

      The primary objective of the RISC-V track at Plumbers to initiate a community wide discussion about the design problems/ideas for different Linux kernel features implemented or to be implemented. We believe that this will also result in significant increase in active developer participation in code review/patch submissions which will definitely lead to a better & stable kernel for RISC-V.

      • 09:00
        RISC-V Platform Specification Kick-Off 30m

        There has been a lot of talk about requirement of a RISC-V platform specification to standardise various key components.
        One of them is Platform Level Interrupt Controller (PLIC) and local interrupts. We also need a stable yet extensible firmware interface for RISC-V to support virtualization and power management extensions.
        Another area that can be discussed is a standard boot flow for any RISC-V based unix platform.

        Speaker: Palmer Dabbelt (SiFive)
      • 09:30
        Supervisor Binary Interface(SBI) extension in RISC-V 30m

        This is a proposal to make SBI a flexible and extensible interface. It is based on the foundation policy of RISC-V i.e. modularity and openness. The current RISC-V SBI only defines a few mandatory functions such as inter-processor interrupts (IPI) interface, reprogramming timer, serial console, and memory barrier instructions. Many important functionalities such as power management/cpu-hotplug are not yet defined due to difficulties in accommodating modifications without breaking the backward compatibility with the current interface.

        Speaker: Atish Patra (Western Digital)
      • 10:00
        High performance computing in RISC-V 30m

        RISC-V is currently focused on the embedded market. However, RISC-V already has a design of a superior vector unit that could make the architecture very relevant for High Performance Computing because it would provide superior floating point performance. There are other issues though that would need to be addressed in the RISC-V architecture like the ability to handle large memory sizes properly, scalable locking and scalable I/O. This is an overview of things that may have to be addressed to make RISC-V competitive in the HPC area. These features overlap to some extend to what is also needed to enable cloud computing and we are also briefly going into how that could be accomplished. Ideally, (in the far and distant future) I would like to have RISC-V cover all areas of computing so that a single instruction set can be used for all use cases in our company so that our support overhead can be drastically reduced since we would not have to deal with multiple architectures for different use cases anymore.

        Speaker: Christopher Lameter
      • 10:30
        Break 30m
      • 11:00
        Power Management in RISC-V 30m

        Power Management need to designed from ground up for RISC-V.

        Speaker: Paul Walmsley (SiFive)
      • 11:30
        RISC-V hypervisor Spec - The Good, the Bad and the Ugly 30m

        The RISC-V ISA is still missing a key aspect in modern computing by not having virtualization support. The spec is currently in draft state, although most of the key elements are there. We can discuss what the next steps are in order to start getting hypervisors running, at least in QEMU. We can also discuss having the spec ratified and included in the official RISC-V ISA.

        Speaker: Alistair Francis (Western Digital)
      • 12:00
        Experiences from Andes Technology 30m

        Andes Technology involved in RISC-V Linux Development since mid -2017 and have submitted 20+ patches to enhance functionality. We will discuss challenges supporting features such as loadable module, perf, ELF attributes, ASID, cache coherence and AndeStar V5 extension specific problems.

        Speakers: Alan Kao (Andes Technology), Zong Li (Andes Technology)
    • 14:00 17:30
      Kernel Summit Track Junior/Ballroom-D (Sheraton Vancouver Wall Center)

      Junior/Ballroom-D

      Sheraton Vancouver Wall Center

      67
      • 14:00
        Concurrency with tools/memory-model 45m
        Speakers: Andrea Parri, Paul McKenney (IBM Linux Technology Center)
      • 14:45
        SoC Maintainer Group 45m
        Speaker: Olof Johansson (Facebook)
      • 15:30
        Break 30m
      • 16:00
        Multiple Time Domains 45m
        Speaker: Thomas Gleixner
      • 16:45
        TBD / Unconference 45m

        Send e-mail to the ksummit-discuss list with a subject line prefix of [TOPIC] if you would like to schedule a session.

    • 14:00 18:30
      LPC Main Track Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 14:00
        Migrating to Gitlab 45m Pavillion/Ballroom-AB

        Pavillion/Ballroom-AB

        Sheraton Vancouver Wall Center

        35

        Over the past few years the graphics subsystem has been spearheading experiments in running things differently: Pre-merge CI wrapped around mailing lists using patchwork, committer model as a form of group maintainership on steroids, and other things. As a result the graphics people have run into some interesting new corner cases of the kernel's "patches carved on stone tablets" process.

        On the other hand the freedesktop.org project, which provides all the server infrastracture for the graphics subsystem, is undergoing a big reorganization of how they provide their services. The biggest change is migrating all source hosting over to a gitlab instance.

        This talk will go into the why of these changes and detail what is definitely going to change, and what is being looked into more as experiments with open outcomes.

        Speaker: Daniel Vetter (Intel)
      • 14:45
        Task Migration at Scale Using CRIU 45m Pavillion/Ballroom-AB

        Pavillion/Ballroom-AB

        Sheraton Vancouver Wall Center

        35

        The Google computing infrastructure uses containers to manage millions of simultaneously running jobs in data centers worldwide. Although the applications are container aware and are designed to be resilient to failures, evictions due to resource contention and scheduled maintenance events can reduce overall efficiency due to the time required to rebuild complex application state. This talk discusses the ongoing use of the open source Checkpoint/Restore in Userspace (CRIU) software to migrate container workloads between machines without loss of application state, allowing improvements in efficiency and utilization. We’ll present our experiences with using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.

        Speakers: Mr Victor Marmol (Google), Mr Andy Tucker (Google)
      • 15:30
        Break 30m Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

        Pavillion/Ballroom-AB

        Sheraton Vancouver Wall Center

        35
      • 16:00
        A practical introduction to XDP 45m Pavillion/Ballroom-AB

        Pavillion/Ballroom-AB

        Sheraton Vancouver Wall Center

        35

        The eXpress Data Path (XDP) has been gradually integrated into the Linux kernel over several releases. XDP offers fast and programmable packet processing in kernel context. The operating system kernel itself provides a safe execution environment for custom packet processing applications, in form of eBPF programs, executed in device driver context. XDP provides a fully integrated solution working in concert with the kernel's networking stack. Applications are written in higher level languages such as C and compiled via LLVM into eBPF bytecode which the kernel statically analyses for safety, and JIT translates into native instructions. This is an alternative approach, compared to kernel bypass mechanisms (like DPDK and netmap).

        This talk gives a practical focused introduction to XDP. Describing and giving code examples for the programming environment provided to the XDP developer. The programmer need to change their mindeset a bit, when coding for this XDP/eBPF execution environment. XDP programs are often split between eBPF-code running kernel side and userspace control plane. The control plane API not predefined, and is up to the programmer, through userspace manipulating shared eBPF maps.

        Speakers: Jesper Dangaard Brouer (Red Hat), Andy Gospodarek (Broadcom)
      • 17:30
        Closing Plenary Session: Kernel Panel 1h Pavillion/Ballroom-AB

        Pavillion/Ballroom-AB

        Sheraton Vancouver Wall Center

        35

        Lively discussion among top level kernel developers about interesting topics raised during Plumbers and Elsewhere

        • Kernel Panel 40m

          Lively discussion among top level kernel developers about interesting topics raised during Plumbers and elsewhere

    • 14:00 17:00
      Live kernel patching MC Junior/Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior/Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The main purpose of the Linux Plumbers 2018 Live kernel patching miniconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make Live patching of the Linux Kernel (more or less) feature complete.

      The main purpose of the proposed miniconference is focusing on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.

      • 14:00
        Livepatch patch creation tooling 20m
        Speaker: Nicolai Stange
      • 14:20
        GCC optimizations and their impacts on live patching 20m
        Speaker: Miroslav Benes
      • 14:40
        Livepatch callback state management 20m
        Speaker: Nicolai Stange
      • 15:00
        User space live patching (libpulp) 15m
        Speaker: Joao Moreira
      • 15:15
        Livepatch stable trees 15m
        Speaker: Jason Baron
      • 15:30
        Break 30m
      • 16:00
        Elivepatch - flexible distributed live patch generation 15m
        Speaker: Alice Ferrazzi
      • 16:15
        Livepatch s390x consistency model 10m
        Speaker: Joe Lawrence
      • 16:25
        Livepatch arm64 support 10m
      • 16:35
        Objtool powerpc support 10m
        Speaker: Kamalesh Babulal
    • 17:30 18:30
      LPC Main Track Pavillion/Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-C

      Sheraton Vancouver Wall Center

      58
      • 17:30
        Closing Plenary Session: Kernel Panel 1h
    • 17:30 18:30
      LPC Main Track Pavillion/Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion/Ballroom-D

      Sheraton Vancouver Wall Center

      77
      • 17:30
        Closing Plenary Session: Kernel Panel 1h
    • 18:30 22:30
      Closing Party 4h