Linux Plumbers Conference 2019

Europe/Lisbon

Description

September 9-11, Lisbon, Portugal

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.  LPC 2019 will be held September 9-11 in Lisbon, Portugal.  We are looking forward to seeing you there!

    • Distribution Kernels MC Esmerelda/room-I&II (Corinthia Hotel Lisbon)

      Esmerelda/room-I&II

      Corinthia Hotel Lisbon

      126

      The upstream kernel community is where active kernel development happens but the majority of kernels deployed do not come directly from upstream but distributions. "Distribution" here can refer to a traditional Linux distribution such as Debian or Gentoo but also Android or a custom cloud distribution. The goal of this Microconference is to discuss common problems that arise when trying to maintain a kernel.

      Expected topics
      Backporting kernel patches and how to make it easier
      Consuming the stable kernel trees
      Automated testing for distributions
      Managing ABIs
      Distribution packaging/infrastructure
      Cross distribution bug reporting and tracking
      Common distribution kconfig
      Distribution default settings
      Which patch sets are distributions carrying?
      More to be added based on CfP for this microconference

      "Distribution kernel" is used in a very broad manner. If you maintain a kernel tree for use by others, we welcome you to come and share your experiences.

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Laura Abbott labbott@redhat.com

      • 1
        Upstream 1st: Tools and workflows for multi kernel version juggling of short term fixes, long term support, board enablement and features with the upstream kernel

        Having maintained a distribution agnostic reference kernel (Yocto), an operating
        system vendor kernel (Wind River) and finally a semi-conductor kernel (Xilinx),
        there are a lot of obvious workflows and tools that are used to deliver kernels
        and support them after release.

        The less than obvious workflows (and tools) are often related to distro kernel
        tree maintenance and balancing the needs of short term fixes (often security
        related), with a model that allows long term support, all in trees that may be
        carrying specific features or board support that are destined for upstream
        eventually. Many methods to juggle these demands are ad-hoc or specific to the
        various distros.

        If a tree is not (somewhat) history clean, and patch history is not tracked
        over time, moving to a new kernel version, understanding why a change was made
        or debugging a problem are made much harder.

        All the competing demands are coupled with the need to have development supported
        with the goal of getting changes into the mainline kernel. Understanding the
        technical solutions (tools), workflows (tools + social) and how to support the
        community at large to reduce everyone's workload is often given limited time.
        Stepping back and looking at the different solutions that maintainers are using
        may highlight common patterns and opportunities to collaborate/standardize on
        various techniques. Less-than-ideal solutions are also valuable as lessons
        learned and are worth sharing.

        Speaker: Bruce Ashfield (Xilinx)
      • 2
        Using Yocto to build a distro and maintain a kernel tree

        We'd like to spend a few minutes to provide some background around how we're using Yocto to produce kernel builds as well as bigger images that contain userspace as well, and then try to address some of the issues we're seeing with this process.

        There are a few topics we'd like to discuss with the room:

        • Using a single kernel branch for multiple, very different projects?
        • Working with kernel config fragments?
        • Reproducible kernel builds/cloning sources?
        • Is there anything saner than cve-check for pointing out known security vulnerabilities?
        Speakers: Senthil Rajaram, Sasha Levin
      • 3
        Making it easier for distros to package kernel source

        Every distro has to package the kernel tree using their own unique package
        files. Some parts of the process are built-in to the kernel source and are
        easy: build, install, and headers. Some parts are not: configs, devel
        package, userspace tools package, tests, distro versioning, changelogs,
        custom patches, etc.

        This discussion revolves around some of the issues and difficulties a
        distro maintainer faces when packaging the kernel source code. What changes
        can we agree to push upstream to make our lives easier.

        Further, discuss possibilities of plugging in distro packaging into the
        kernel source tree (through external means or internal hooks). This allows
        developers to quickly build (from a common devel env) a particular
        distro-like kernel for proper testing.

        Sample topics include:
        * config maintainence for distros
        * top-level Makefile hooks for distros
        * make devel_install -like command
        * distro versioning

        Speaker: Don Zickus (Red Hat)
      • 4
        Monitoring and Stabilizing the In-Kernel ABI

        The Kernel's API and ABI exposed to Kernel modules is not something that is usually maintained in upstream. Deliberately. In fact, the ability to break APIs and ABIs can greatly benefit the development. Good reasons for that have been stated multiple times. See e.g. Documentation/process/stable-api-nonsense.rst.
        The reality for distributions might look different though. Especially - but not exclusively - enterprise distributions aim to guarantee ABI stability for the lifetime of their released kernels while constantly consuming upstream patches to improve stability and security for said kernels. Their customers rely on both: upstream fixes and the ability to use the released kernels with out-of-tree modules that are compiled and linked against the stable ABI.

        In this talk I will give a brief overview about how this very same requirement applies to the Kernels that are part of the Android distribution. The methods presented here are reasonable measures to reduce the complexity of the problem by addressing issues introduced by ABI influencing factors like build toolchain, configurations, etc.

        While we focus on Android Kernels, the tools and mechanisms are generally useful for Kernel distributors that aim for a similar level of stability. I will talk about the tools we use (like e.g. libabigail), how we automate compliance checking and eventually enforce ABI stability.

        Speaker: Matthias Maennich (Google)
      • 5
        KernelCI applied to distributions

        While kernelci.org as a project is dedicated to testing the
        upstream Linux kernel, the same KernelCI software may be reused
        for alternative purposes. One typical example is distribution
        kernels, which often track a stable branch but also carry some
        extra patches and a specific configuration. Aside from covering
        a particular downstream branch, having a separate KernelCI
        instance also makes it possible to add specific tests that cover
        user-space functionality.

        A key aspect of KernelCI however is that the moving part remains
        the kernel revision. It is in theory possible to cover a full OS
        image with moving parts in user-space too, but that is not
        something it was originally designed for - hence an interesting
        subject for discussion.

        Speaker: Guillaume Tucker (Collabora Limited)
      • 6
        Automatically testing distribution kernel packages

        Provide better kernel packages to the distribution users, is a really hot topic in distributions, as the kernel package is the fundamental part of the distribution.
        One of the way to provide a better quality kernel is to implement a quality control by using automated tests.
        Each distributions are probably using different tools and tests suits.
        Let's share our knowledge and which tools are using.

        Which Continuous integrations tools are better to use? (buildbot, jenkins)
        What kernel tools are better to use for testing (lpt, kselftest)

        Speaker: Alice Ferrazzi
      • 7
        Distros and Syzkaller - Why bother?

        Syzkaller is run on Upstream and Stable trees. When paired with KASAN it has proven its usefulness uncovering large numbers of Out-of-Bounds (OOB) and Use-after-free (UAF) bugs. These results are readily available on the syzbot dashboard. What do distros gain by running Syzkaller?

        Distros regularly add features to their kernels, fix bugs and add third party drivers. Syzkaller testing focused on these changes and additions can uncover bugs and detect regressions.

        Syzkaller can be part of a distro's continuous integration (CI) strategy. Dedicated Syzkaller CI servers can be running the distro's next release candidate, only being halted and restarted as features, bug fixes or third party drivers are added.

        How can distros collaborate? There are many third party drivers common to all distros. Distros can collaborate on the Syzkaller testing framework for these drivers. Likewise for features that are going Upstream.

    • Kernel Summit Track Floriana/room-III (Corinthia Hotel Lisbon)

      Floriana/room-III

      Corinthia Hotel Lisbon

      100

      This year, the Maintainer's and Kernel Summit will be at the Corinthia Hotel in Lisbon, Portugal, September 9th -- 12th. The Kernel Summit will be held as a track during the Linux Plumbers Conference September 9th -- 11th. The Maintainer's Summit will be held afterwards, on September 12th. As in previous years, the "Maintainer's Summit" is an invite-only, half-day event, where the primary focus will be process issues around Linux Kernel Development.

      The "Kernel Summit" is organized as a track which is run in parallel with the other tracks at the Linux Plumber's Conference (LPC), and is open to all registered attendees of LPC. The goal of the Kernel Summit track will be to provide a forum to discuss specific technical issues that would be easier to resolve in person than over e-mail.

      We will reserving roughly some Kernel Summit slots for last-minute discussions that will be scheduled during the week, in an "unconference style". This allows ideas that come up in hallway discussions, and in the LPC miniconferences, to be given
      scheduled, dedicated times for discussion.

      • 8
        Reworking of KVA allocator in Linux kernel Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        Hello.

        I would like to give a talk about KVA allocator in the kernel and about
        improvements i have done.

        See below the presentation:

        ftp://vps418301.ovh.net/incoming/Reworking_of_KVA_allocator_in_Linux_kernel.pdf

        Thank you in advance!

        --
        Vlad Rezki

        Speaker: Mr Uladzislau Rezki
      • 9
        Touch but don’t look: Running the kernel in execute only memory Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        Execute only memory can protect from attacks that involve reading executable code. This feature already exists on some CPUs and is enabled for userspace.

        This talk will explain how we are working on creating a virtualized “not-readable” permission bit for guest page tables for x86 and the impact to the kernel. This bit can be used to create execute-only memory for userspace programs as done on other architectures, but newly also kernel text itself. This project has a working POC, but requires extra care being taking in the kernel going forward around certain code patterns in order for the kernel to run in execute only. This will be the main “call to action” of the talk.

        The talk will cover three areas:

        -Benefits of execute only memory

        As was covered in the talk last year by Kristen Accardi, execute only memory can protect code diversification schemes like KASLR, ASLR, and especially fined grained ASLR. This would be a brief summary and will also touch on some attacks that involve reading kernel text

        -How we are implementing this across QEMU, KVM, and the guest Linux Kernel.

        The solution is sort of novel and interesting it itself, but most of the talk will be about kernel impact of this feature on not the hypervisor implementation. The gist of the solution involves pretending to the guest that the CPU has one less physical address bit than it actually does, so what looks to the guest like a reserved bit looks to the CPU like a physical address bit. Our proposed new KVM APIs can allow userspace VMMs to duplicate memory such that this bit selects from differently permission-ed copies of the same guest physical memory. Intel EPT has the ability to create execute only guest physical memory, so by having the second half of the memory as execute only, we can make a bit that can mark guest virtual memory as execute only.

        -Proposed APIs for using execute only memory in userspace and changes and restrictions required to the Linux kernel in order for it to map its own executable code as execute only.

        Our POC required making surprisingly few changes to the Linux kernel, however there were impacts especially around features that involve modifying or mapping new executable code. Long term, however, supporting this feature fully would involve the community agreeing that going forward, code patterns that violate execute only memory would not be allowed in the kernel.

        Speaker: Rick Edgecombe (Intel)
      • 11:30 AM
        Break Floriana/room-III (Corinthia Hotel Lisbon)

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
      • 10
        Maple Tree Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        The Red-Black tree and Radix tree are used in many places in the kernel to store ranges. Both of these trees have drawbacks when used for ranges. The Red-Black tree requires writing your own insertion & search code. It is also designed with the assumption that memory accesses are cheap, which is no longer true. The Radix tree performs acceptably well when ranges are aligned to a power of 2, but has awful worst-case performance.

        The Maple tree is a fast, cache efficient tree with a simple API. It supports contiguous ranges efficiently, while suffering only minor penalties for discontiguous ranges. Single entries are also supported as a range of length one.

        The Maple tree can optionally track free ranges to allow for more efficient allocation. In order to allow it to be used as the basis for the page cache, it will need support for search marks as well as handling reclamation of shadow entries. Other potential users of the Maple tree want more than the three search marks currently supported by the Radix tree.

        We want to discuss requirements with potential users of the Maple tree, and to present development since the last Plumbers conference where the broad outlines of the tree were first presented.

        Speaker: Mr Liam Howlett (Oracle)
      • 11
        The list is our process: An analysis of the kernel's email-based development process Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        Implementing safety-critical systems usually requires adhering to meticulously defined development processes that specify how code is supposed to be developed, integrated and reviewed, driven by the assumption that a disciplined approach leads to reliably high quality. While known to produce code that can satisfy the highest quality standards, Linux kernel development does not follow such strict patterns, although it is certainly far from a random process. But how can we ensure the quality of a mostly informal approach?

        Our work aims at identifying core properties, strengths and weaknesses in the development process by tracking the evolution of components from initial submissions on mailing lists to the final merged contributions.

        This talk will:

        • introduce heuristics to identify the evolution of patches on the mailing list and match patch emails against their included git commit counterparts.

        • present our publicly available data set of kernel-related email available, curated large-scale data set from more than 200 kernel-related mailing lists

        We discuss observations and insights and we draw, ranging form simpler questions like how long the average time from the first version of a patch submission to its final inclusion is, down to a categorisation and analysis of off-list patches and ignored patches.

        We particularly seek interaction with experts from the community to discuss benefits and limitations of our approach. We will show how we would like to make this information available in the patchwork tool, and present prototypes of tools and development process analyses that that we would like to refine so that they are useful to Linux kernel developers and maintainers in their every day work. We hope this work can contribute to a future kernel maintainers handbook.

        Speakers: Mr Ralf Ramsauer (OTH Regensburg), Prof. Wolfgang Mauerer (OTH Regensburg), Lukas Bulwahn (BMW AG)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 12
        Upstream Graphics: Too little, too late Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        DRM is merging new drivers at a brisk pace, and with lima and panfrost to support ARM Mali GPUs the last obvious gap in not yet reverse-engineered hardware is getting closed. Plus new features, more contributors, more patches - in general upstream graphics is as healthy as it's never been before.

        Time for some celebratory drinks, except this talk will be none of that. Now that we've achieved the goal of supporting all things graphics in upstream, the struggles didn't disappear. The promised land of "Upstream First" is leaving a rather sour aftertaste.

        This talk will go through all the ways companies and teams have tried to ship graphics drivers using upstream, and how they all go wrong.

        It will, unfortunately, not present solutions.

        Speaker: Daniel Vetter (Intel)
      • 13
        Deep Argument Inspection and Seccomp Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
        Speaker: Christian Brauner
      • 4:30 PM
        Break Floriana/room-III (Corinthia Hotel Lisbon)

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
      • 14
        Inline Encryption Support Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        Storage hardware with built-in “inline” encryption support is becoming increasingly common, especially on mobile SoCs running Android; it's also now part of the UFS and eMMC standards. These devices en/decrypt data between the application processor and disk without generating disk latency or cpu overhead. Inline encryption hardware can be programmed to hold multiple encryption keys simultaneously and can be dynamically reprogrammed to use any of these programmed encryption keys to en/decrypt a particular request. This makes this new class of storage ideal for supporting fscrypt (file-based encryption). Unfortunately, there isn’t currently a unified approach for supporting inline encryption hardware in the Linux kernel.

        We’ve sent out an RFC patchset to add support for inline encryption to the block subsystem, UFS driver, f2fs, and fscrypt
        (https://www.spinics.net/lists/linux-block/msg40330.html).
        We’ll discuss our approach including:

        • How the filesystem communicates an encryption key to inline
          encryption hardware for each struct bio it submits.
        • How to add support for inline encryption to storage drivers.
        • Support for layered devices like device mapper.
        • A software crypto fallback.
        • How this work can make future encryption tasks cleaner - like
          metadata encryption, file-based encryption on removable storage and
          the possibility of unifying how fscrypt, dm-crypt, and eCryptfs
          implement encryption.
        Speaker: Satya Tangirala
      • 15
        TAB Elections Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
    • LPC Refereed Track Floriana/room-II (Corinthia Hotel Lisbon)

      Floriana/room-II

      Corinthia Hotel Lisbon

      200
      • 16
        oomd2 and beyond: a year of improvements Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. Perhaps surprisingly, the kernel does not often handle this well. oomd builds on top of recent kernel development to effectively implement OOM killing in userspace. This results in a faster, more predictable, and more accurate handling of OOM scenarios.

        oomd has gained a number of new features and interesting deployments in the last year. The most notable feature is a complete redesign of the control plane which enables arbitrary but "gotcha"-free configurations. In this talk, Daniel Xu will cover past, present, future, and path-not-taken development plans along with experiences gained from overseeing large deployments of oomd.

        Speaker: Daniel Xu (Facebook)
      • 17
        Core Scheduling: Taming Hyper-Threads to be secure Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Last couple of years, we have witnessed an onslaught of vulnerabilities in the design and architecture of cpus. It is interesting and surprising to note that the vulnerabilities are mainly targeting the features designed to improve the performance of cpus - most notable being the hyperthreading(smt). While some of the vulnerabilities could be mitigated in software and cpu microcodes, couple of others didn't have any satisfiable mitigation other than making sure that smt is off and every context switch needed to flush the cache to clear the data used by the task that is being switched out. Turning smt off is not a viable alternative to many production scenarios like cloud environment where you lose a considerable amount of computing power by turning off smt. To address this, there have been community efforts to keep smt on while trying to make sure that non-trusting applications are never run concurrently in the hyperthreads of the core, they have been widely called as core scheduling.

        This talk is about the development, testing and profiling efforts of core scheduling in the community. There were multiple proof of concepts - while differing in the design, ultimately trying to make sure that only mutually trusted applications run concurrently on the core. We discuss the design, implementation and performance of the POCs. We also discuss the profiling attempts to understand the correctness and performance of the patches - various powerful kernel features that we leveraged to get the most time sensitive data from the kernel to understand the effect of scheduler with the core scheduling feature. We plan to conclude with a brief discussion of the future directions of core scheduling.

        The core idea about core scheduling is to have smt on and make sure that only trusted applications run concurrently on siblings of a core. If there are no group of trusting applications runnable on the core, we need to make sure that remaining siblings should idle while applications run in isolation on the core. This should also consider the performance aspects of the system. Theoretically it is impossible to reach the same level of performance where the cores are allowed to any runnable applications. But if the performance of core scheduling is worse than or same as the smt off situation, we do not gain anything from this feature other than the added complexity in the scheduler. So the idea is to achieve a considerable boost in performance compared to smt-off for the majority of production workloads.

        Security boundary is another aspect of critical importance in core scheduling. What should be considered as a trust boundary? Should it be at the user/group level, process level or thread level? Should kernel be considered trusty by applications or vice-versa? With virtualization and nested virtualization in picture, this gets even more complicated. But answers to most of these questions are environment and workload dependent and hence these are implemented as policies rather than hardcoding in the code. And then arises the question - how the policies should be implemented? Kernel has a variety of mechanisms to implement these kind of policies and the proof of concepts posted upstream mainly uses cgroups. This talk also discusses other viable options for implementing the policies.

        Speakers: Julien Desfossez (DigitalOcean), Vineeth Remanan Pillai
      • 11:30 AM
        Break Floriana/room-II (Corinthia Hotel Lisbon)

        Floriana/room-II

        Corinthia Hotel Lisbon

        200
      • 18
        Scaling performance profiling infrastructure for data centers Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Understanding Application performance and utilization characteristics is critically important for cloud-based computing infrastructure. Minor improvements in predictability and performance of tasks can result in large savings. Google runs all workloads inside containers and as such, cgroup performance monitoring is heavily utilized for profiling. We rely on two approaches built on Linux performance monitoring infrastructure to provide task, machine, and fleet performance views and trends. A sampling approach collect metrics across the machine and try to attribute it back to cgroups while a counting approach tracks when a cgroup is scheduled and maintains state per cgroup. There are number of trade-offs associated with both approaches. We will present an overview and associated use-cases for both approaches at Google.

        As the servers have gotten bigger, number of cores and containers on a machine have grown significantly. With the bigger scale, interference is a bigger problem for multi-tenant machines and performance profiling becomes even more critical. However, we have hit multiple issues in scaling the underlying Linux performance monitoring infrastructure to provide fresh and accurate data for our fleet. The performance profiling has to deal with the following issues:

        • Interference: To be tolerated by workloads, monitoring overhead
          should be minimal - usually below 2%, some latency-sensitive workloads
          are certainly even less tolerant than that. As we gain more
          introspection into our workloads, we end up having to use more and
          more events, to pinpoint certain bottlenecks. That unavoidably
          incurs event multiplexing as the number of core hardware counters is
          very limited compared to containers profiled and number of events monitored. Adding counters is not free in hardware and similarly in the kernel as more
          work registers must be saved and restored on context switches which can cause jitters for applications being profiled.
        • Accuracy: Sampling at machine level reduces some of the associated costs, but attributing the counters back to containers is lossy and we see a large drop in accuracy of profiling. The attribution gets progressively worse as we move to bigger machines with large number of threads. The attribution errors severely limit the granularity of performance improvements and degradations we can measure in our fleet.
        • Kernel overheads: Perf_events event multiplexing is a complex and expensive algorithm that is especially taxing when run in cgroup mode. As implemented, scheduling of cgroup events is bound by the number of cgroup events per-cpu and not the number of counters, unlike regular per-cpu monitoring. To get a consistent view of activity on a server, Google needs to periodically count events per-cgroup. Cgroup monitoring is preferred over per-thread monitoring because Google workloads tend to use an extensive number of threads, so that would be prohibitively expensive to use. We have explored ways to avoid these scaling issues and make event multiplexing faster.
        • User-space overheads: The bigger the machines, the larger the volume of profiling data generated. Google relies extensively on the perf record tool to collect profiles. There are significant user-space overheads to merge the per-cpu profiles and post-process for attribution. As we look to make perf-record multi-threaded for scalability, data collection and merging becomes yet another challenge.
        • Symbolization overheads : Perf tools rely on /proc/PID/maps to understand process mappings and to symbolize samples. The parsing and scanning of /proc/PID/maps is time-consuming with large overheads. It is also riddled with race conditions as processes are created and destroyed during parsing.

        These are some of the challenges we have encountered while using perf_events and the perf tool at scale. To continue to make this infrastructure popular, it needs to adapt to new hardware and data-center realities fast now. We are planning to share our findings and optimizations followed by an open discussion on how to best solve these challenges.

        Speakers: Rohit Jnagal, Stephane Eranian (Google Inc), Ian Rogers (Google Inc)
      • 19
        printk: Why is it so complicated? Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The printk() function has a long history of issues and has undergone many iterations to improve performance and reliability. Yet it is still not an acceptable solution to reliably allow the kernel to send detailed information to the user. And these problems are even magnified when using a real-time system. So why is printk() so complicated and why are we having such a hard time finding a good solution?

        This talk will briefly cover the history of printk() and why the recent major rework was necessary. It will go through the details of the rework and why we believe it solves many of the issues. And it will present the issues still not solved (such as fully synchronous console writing), why these issues are particularly complex and controversial, and review some of the proposed solutions for moving forward.

        This talk may be of particular interest to developers with experience or interest in lockless ring buffers, memory barriers, and NMI-safe synchronization.

        Speaker: John Ogness (Linutronix GmbH)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 20
        What does remote attestation buy you? Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        TPM remote attestation (a mechanism allowing remote sites to ask a computer to prove what software it booted) was an object of fear in the open source community in the 2000s, a potential existential threat to Linux's ability to interact with the free internet. These concerns have largely not been realised, and now there's increasing interest in ways we can use remote attestation to improve security while avoiding privacy concerns or attacks on user freedom.

        More modern uses of remote attestation include simplifying deployment of machines to remote locations, easy recovery of systems with nothing more than a network connection, automatic issuance of machine identity tokens, trust-based access control to sensitive resources and more. We've released a full implementation, so this presentation will discuss how it can be tied in to various layers of the Linux stack in ways that give us new functionality without sacrificing security or freedom.

        Speaker: Matthew Garrett (Google)
      • 21
        Linux kernel fastboot on the way Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Linux kernel fastboot is critical for all kinds of platforms: from embedded/smartphone to desktop/cloud, and it has been hugely improved over years. But, is it all done? Not yet!

        This topic will first share the optimizations done for our platform, which cut the kernel (inside a VM) bootime from 3000ms to 300ms, and then list the future potential optimization points.

        Here are our optimizations:
        1. really enable device drivers' asynchronous probing, like i915 to improve boot parallelization
        2. deferred memory init leveraging memory hotplug feature
        3. Optimize rootfs mounting (including storage driver and mounting)
        4. kernel modules and configs optimization
        5. reduce the hypervisor cost
        6. tools for profiling/analyzing

        Potential optimizations spots for future, which needs discussion and collaboration from the whole community:
        1. how to make maximal use of multi-core and effectively distribute boot tasks to each core
        2. smp init for each CPU core costs about 8ms, a big burden for large systems
        3. force highest cpufreq as early as possible (kernel decompress time)
        4. devices enumeration for firmware (like ACPI) set to be parallel
        5. in-kernel deferred memory init (for 4GB+ platform)
        6. user space optimization like systemd

        Speaker: Mr Feng Tang
      • 4:30 PM
        Break Floriana/room-II (Corinthia Hotel Lisbon)

        Floriana/room-II

        Corinthia Hotel Lisbon

        200
      • 22
        Red Hat joins CI party, brings cookies Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        For the past couple of years the CKI ("cookie") project at Red Hat has been transforming the way the company tests kernels, going from staged testing to continuous integration. We've been testing patches posted to internal maillists, responding with our results, and last year we started testing stable queues maintained by Greg KH, posting results to the "stable" maillist.

        Now we'd like to expand our efforts to more upstream maillists, and join forces with CI systems already out there. We'll introduce you to the way our CI works, which tests we run, our extensive park of hardware, and how we report results. We'd like to hear what you need from a CI system, and how we can improve. We'd like to invite you to cooperation, both long-term, and right there, at a hackfest organized during the conference.

        Naturally, real cookies will make an appearance.

        Speakers: Nikolai Kondrashov (Red Hat), Veronika Kabatova (Red Hat)
      • 23
        Challenges of the RDMA subsystem Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The RDMA subsystem in Linux (drivers/infiniband) is now becoming widely used and deployed outside its traditional use case of HPC. This wider deployment is creating demand for new interactions with the rest of the kernel and many of these topics are challenging. 

        This talk will include a brief overview of RDMA technology followed by an examination & discussion of the main areas where the subsystem has presented challenges in Linux: 

        Very complex user API. An overview of the current design, and some reflection on historical poor choices 

        The DMA from user space programming model and the challenge matching that to the DMA API in Linux 

        Development of user space drivers along with kernel drivers 

        Delegation of security decisions to HW 

        Interaction with file systems, DAX, and the page cache for long term DMA 

        Inter-operation with GPU, DMABUF, VFIO and other direct DMA subsystems 

        Growing breadth of networking functionality and overlap with netdev, virtio, and nvme 

        Fragmentation of wire protocols and resulting HW designs 

        Placing high performance as paramount and how this results in HW restrictions limiting the architecture and APIs of the subsystem 

        The advent of new general computation acceleration hardware is seeing new drivers proposed for Linux that have many similar properties to RDMA. These emerging drivers are likely to face these same challenges and can benefit from lessons learned. 

        RDMA has been a successful mini-conference at the last three LPC events, and this talk is intended to complement the proposed RDMA micro-conference this year. This longer more general topic is intended to engage people unfamiliar with the RDMA subsystem and the detailed topics that would be included in the RDMA track. 

        The main goal would be to help others in the kernel community have more background for RDMA and its role when making decisions. In part this proposal is motivated by the number of times I heard the word 'RDMA' mentioned at LSF/MM.  Often as some opaque consumer of some feature. 

        Jason Gunthorpe is a Sr. Principal Engineer at Mellanox and has been the co-maintainer for the RDMA subsystem for the last year and a half. He has 20 years' experience working with the Linux kernel and in RDMA and InfiniBand technologies.

        Speaker: Mr Jason Gunthorpe (Mellanox Technologies)
    • Networking Summit Track Floriana/room-I (Corinthia Hotel Lisbon)

      Floriana/room-I

      Corinthia Hotel Lisbon

      180
      • 24
        Linux Kernel VxLan with Multicast Routing for flood handling Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        The Linux kernel VxLan driver supports two ways of handling flooded traffic to multiple remote VxLan termination end points (VTEPS):
        (a) Head end replication: where the VxLan driver sends a copy of the packet to each participating remote VTEPs
        (b) Use of multicast routing to forward to participating remote VTEPs

        (b) is generally preferred for both hardware and software VTEP deployments because it scales better. The kernel VxLan driver supports (b) with static config today. One has to specify the multicast group with the outgoing uplink interface for VxLan multicast replication to work. This is mostly ok for deployments where VTEPs are deployed on the host/hypervisor. When deploying Linux VTEPs on the Top-Of-the-Rack (TOR) switches in a data center CLOS network, it is impossible to configure the outgoing interface statically. Typically a multicast routing protocol like PIM is used to dynamically calculate multicast trees and install forwarding paths for multicast traffic.

        In this talk we will cover:
        - Vxlan Multicast deployment scenarios with Vxlan VTEPs at the TOR switches
        - Current challenges with integrating Vxlan Multicast replication in a dynamic multicast routing environment
        - Solutions to these challenges: (a) Patches to fix routing of locally generated multicast packets (need for ip_mr_output) (b) Patches to VxLan driver to allow multicast replication without a static outgoing interface
        - Scale
        - Futures on VxLan deployments in multicast environment

        Speaker: Roopa Prabhu (Roopa)
      • 25
        BPF packet capture helpers, libbpf interfaces Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Packet capture is useful from a general debugging standpoint, and is useful in particular in debugging BPF programs that do packet processing. For general debugging, being able to initiate arbitrary packet capture from kprobes and tracepoints is highly valuable (e.g. what do the packets that reach kfree_skb() - representing error codepaths - look like?). Arbitrary packet capture is distinct from the traditional concept of pre-defined hooks, and gives much more flexibility in probing system behaviour. For packet-processing BPF programs, packet capture can be useful for doing things such as debugging checksum errors. The intent of this proposal is to help drive discussion around how to ease use of such features in BPF programs, namely:

        • should additional BPF helper(s) be provided to format packet data suitable for libpcap interpretation?
        • should libbpf provide interfaces for retrieving packet capture data?
        • should interfaces be provided for pushing filters?

        Note that while there has been some work in this area already, such as

        https://new.blog.cloudflare.com/xdpcap/

        ...it seems like such efforts would be made much simpler if APIs were provided.

        Speaker: Alan Maguire (Oracle)
      • 11:30 AM
        Break Floriana/room-I (Corinthia Hotel Lisbon)

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
      • 26
        Multipath TCP Upstreaming Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Multipath TCP (MPTCP) is an increasingly popular protocol that members of the kernel community are actively working to upstream. A Linux kernel fork implementing the protocol has been developed and maintained since March 2009. While there are some large MPTCP deployments using this custom kernel, an upstream implementation will make the protocol available on Linux devices of all flavors.

        MPTCP is closely coupled with TCP, but an implementation does not need to interfere with operation of normal TCP connections. Our roadmap for MPTCP in Linux begins with the server use case, where connections and additional TCP subflows are generally initiated by peer devices. This will start with RFC 6824 compliance, but with a minimal feature set to limit the code footprint for initial review and testing.

        The MPTCP upstreaming community has shared a RFC patch set on the netdev list that shows our progress and how we plan to build around the TCP stack. We'll share our roadmap for how this patch set will evolve before final submission, and discuss how this first step will differ from the forked implementation.

        Once we have merged our baseline code, we have plans to continue development of more advanced features for managing subflow creation (path management), scheduling outgoing packets across TCP subflows, and other capabilities important for client devices that initiate connections. This includes making use of a userspace path manager, which has an alpha release available already. In future kernel releases we will make use of additional TCP features and optimize MPTCP performance as we get more feedback from kernel users.

        Both the communication and the code are public and open. You can find us at mptcp@lists.01.org and https://is.gd/mptcp_upstream

        Speakers: Mat Martineau (Intel), Matthieu Baerts (Tessares)
        Paper
        Slides
      • 27
        Programmable socket lookup with BPF Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        At Netconf 2019 we have presented a BPF-based alternative to steering
        packets into sockets with iptables and TPROXY extension. A mechanism
        which is of interest to us because it allows (1) services to share a
        port number when their IP address ranges don't overlap, and (2) reverse
        proxies to listen on all available port numbers.

        The solution adds a new BPF program type BPF_INET_LOOKUP, which is
        invoked during the socket lookup. The BPF program is able to steer SKBs
        by overwriting the key used for listening socket lookup. The attach
        point is associated with a network namespace.

        Since then, we have been reworking the solution to follow the existing
        pattern of using maps of socket references for redirecting packets, that
        is REUSEPORT_SOCKARRAY, SOCKMAP, or XSKMAP. We expect to publish the
        next version of BPF_INET_LOOKUP RFC patch set, which addresses the
        feedback from Netconf, in August.

        During LPC 2019 BPF Microconference we would like to briefly recap on
        how BPF-driven socket lookup compares to classic bind()-based dispatch,
        TPROXY packet steering, and socket dispatch on TC ingress currently in
        development by Cilium.

        Next we would like discuss low-level implementation challenges. How to
        best ensure that packet delivery to connected UDP sockets remains
        unaffected? Can a BPF_INET_LOOKUP program co-exist with reuseport
        groups? Is there a possibility of code sharing with REUSEPORT_SOCKARRAY
        implementation?

        Following the implementation discussion, we will touch on performance
        aspects, that is what is the observed cost of running BPF during socket
        lookup both in SYN flood and UDP flood scenarios.

        Finally, we want to go into the usability of user-space API. Redirection
        with a BPF map of sockets raises a question who populates the map, and
        if existing network applications like NGINX need to be modified in any
        way to receive traffic steered with this new mechanism.

        The desired outcome of the discussion is to identify steps needed to
        graduate the patch set from an RFC series to a ready-for-review
        submission.

        Speakers: Jakub Sitnicki (Cloudflare), Lorenz Bauer (Cloudflare), Marek Majkowski (Cloudflare)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 28
        XDP bulk packet processing Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        It is well known that batching can often improve software performance. This is
        mainly because it utilizes the instruction cache in a more efficient way.
        From the networking perspective, the size of driver's packet processing
        pipeline is larger than the sizes of instruction caches. Even though NAPI
        batches packets over the full stack and driver execution, they are processed
        one by one by many large sub systems in the processing path. Initially this
        was raised by Jesper Brouer. With Edward Cree's listifying SKBs idea, the
        first implementation results look promising. How can we take this a step
        further and apply this technique to the XDP processing pipeline?

        To do that, the proposition is to back down from preparing xdp_buff struct
        one-by-one, passing it to XDP program and then acting on it, but instead we
        would prepare in driver an array of XDP buffers to be processed. Then, we
        would have only a single call per NAPI budget to XDP program, which would give
        us back a list of actions that driver needs to take. Furthermore, the number
        of indirect function calls, gets reduced, as driver gets to jited BPF program
        via indirect function call.

        In this talk I would like to present the proof-of-concept of described idea,
        which was yielding around 20% better XDP performance for dropping packets with
        touching headers memory (modified xdp1 from linux kernel's bpf samples).

        However, the main focus of this presentation should be a discussion about a
        proper, generic implementation, which should take place after showing out the
        POC, instead of the current POC. I would like to consider implementation
        details, such as:
        - would it be better to provide an additional BPF verifier logic, that when
        properly instrumented (make use of prologue/epilogue?), would emit BPF
        instructions responsible for looping over XDP program, or should we have the
        loop within the XDP programs?
        - the mentioned POC has a whole new NAPI clean Rx interrupt routine; what
        should we do to make it more generic in order to make driver changes
        smaller?
        - How about batching the XDP actions? Do all the drops first, then Tx/redirect,
        then the passes. Would that pay off?

        Speaker: Maciej Fijałkowski
      • 29
        LAG and hardware offload to support RDMA and IO virtualized interfaces Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Link Aggregation (LAG) is traditionally served by bonding driver. Linux bonding driver supports all LAG modes on almost any LAN drivers - in the software. However modern hardware features like SR-IOV-based virtualization and state full offloads such as RDMA are currently not well supported by this model. One of possible options to solve that is to implement LAG functionality entirely in NIC's hardware or firmware. In our presentation we present another approach, where LAG functionality for state full offloads such as RDMA and IO virtualization is implemented mostly in software, with very limited support from existing Hardware and firmware. A concept that should make the solution more generic without complicating the HW any further.

        The presentation is focused on 3 areas: implementation of active-backup mode for RDMA and virtual functions, usage of RX hash value to implement flow-based active-active mode and new active-active mode for virtual functions.

        Proposed implementation of the active-backup mode for RDMA is done in RDMA and LAN drivers. An application continues using direct HW support for RDMA. LAN driver (with the help of RDMA driver) observes notifications from the bonding driver and accordingly controls low-level TX scheduling and RX rules for RDMA queues. The same mechanism can be used to transparently redirect network virtual functions from active to backup. We further explore the use of RX hash to implement active-active mode.

        Speakers: Mr Vivek Kashyap (Intel), Ms Anjali Singhai Jain (Intel), Dr Piotr Uminski (Intel)
      • 4:30 PM
        Break Floriana/room-I (Corinthia Hotel Lisbon)

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
      • 30
        netfilter hardware offloads Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        With the advent of the the flow rule and flow block API, ethtool_rx, netfilter and tc can share the same infrastructure to represent hardware offloads.

        This presentation discusses the reuse of the existing infrastructure originally implemented by tc, such as the netdev_ops->ndo_setup_tc() interface and the TC_SETUP_CLSFLOWER classifier.

        Speaker: Mr Pablo Neira
      • 31
        SwitchDev offload optimizations Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Linux has a nice SW bridge implementation which provides most of the classic
        Ethernet switching features. DSA and SwitchDev frameworks allow us to
        represent HW switch devices in Linux and potentially offload the SW forwarding
        to HW.

        But the offloading facilities are not perfect, and there seem to be room for
        further improvements:

        • Limiting the flooding of L2-Multicast traffic. IGMP snooping can limit the
          flooding of L3 traffic, but L2-Multicast traffic are always flooded.

        • Today all bridge slave interfaces are put into promiscuous mode to allow
          learning/flooding. But if the bridge is offloaded with HW capable of doing
          learning/learning, then this should not be necessary.

        • When not put into promiscuous mode, the struct net_device structure has a
          list of multicast addresses which should be received by the interface. But
          when VLAN sub-interfaces are created, the VLAN information is lost when
          addresses are installed in the mc list.

        • The assumption in the bridge code is that all multicast frames goes to the
          CPU. But what would it actually take only to request the needed multicast
          frames to the CPU?

        • Challenges in adding new redundancy and protection protocols to the kernel,
          and how to offload such protocols to HW.

        The intend with the talk is to present some of the issues we are facing in
        adding DSA/SwitchDev drivers for existing and near-time future HW. I will have few solutions to present, but will give our thoughts on how it may be solved. Hopefully with will result in good discussions and input from the audience.

        Background information: I'm working on a SwitchDev driver for a yet to be
        released HW Ethernet switch. It will be a TSN switch targeting industrial
        networks, with HW accelerators to implement redundancy protocols. CPU power are very limited, and latency are extremely important, which is why it is important for us to improve the HW offload facilities.

        Speaker: Mr Allan Nielsen
    • RISC-V MC Jade/room-I&II (Corinthia Hotel Lisbon)

      Jade/room-I&II

      Corinthia Hotel Lisbon

      160

      The Linux Plumbers 2019 RISC-V MC will continue the trend established in 2018 [2] to address different relevant problems in RISC-V Linux land.

      The overall progress in RISC-V software ecosystem since last year has been really impressive. To continue the similar growth, RISC-V track at Plumbers will focus on finding solutions and discussing ideas that require kernel changes. This will also result in a significant increase in active developer participation in code review/patch submissions which will definitely lead to a better and more stable kernel for RISC-V.

      Expected topics
      RISC-V Platform Specification Progress, including some extensions such as power management - Palmer Dabbelt
      Fixing the Linux boot process in RISC-V (RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done) - Atish Patra
      RISC-V hypervisor emulation [5] - Alistair Francis
      RISC-V hypervisor implementation - Anup Patel
      NOMMU Linux for RISC-V - Damien Le Moal
      More to be added based on CfP for this microconference

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Atish Patra (atish.patra@wdc.com) or Palmer Dabbelt (palmer@dabbelt.com)

      • 32
        RISC-V Platform Specification Progress

        The RISC-V UNIX-Class platform specification working group started in May and aims to have a first release by the end of the year. This talk will discuss where we are and where we're going.

        Speakers: Palmer Dabbelt (SiFive), ATISH PATRA (Western Digital)
      • 33
        Fixing the Linux boot process in RISC-V

        RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done. We will discuss the current state of the boot flow and pending issues.

        Speaker: ATISH PATRA (Western Digital)
      • 34
        Introduce an implementation of IOMMU in linux-riscv

        IOMMU is a very popluar equipment for both embed and server virtualization area. In the topic we'll focus on embed area and shared virtual address.

        Firstly, we'll talk about the value of IOMMU for the embed system and what the benefit we could get from IOMMU in our cost-down embed system.

        Secondly, Guo will share the experience on the IOMMU implementation, eg: How to keep the same asid with CPU and IOMMU in hardware. How to share CPU's page table with IOMMU for user space address.

        Lastly, let's have a free discussion on riscv mmu, iommu and SVA related issues.

        Speakers: Mr Ren Guo (c-sky.com (belong to Alibaba.com)), Mr Han Mao (c-sky.com (belong to Alibaba.com))
      • 35
        Introduce an implementation of perf trace in riscv system

        RISC-V trace spec draft have defined some trace format, we'll share our implementation of linux perf trace based on the spec. How to deal with SMP perf issues, how to verify our design in qemu, demonstrate a demo of perf trace with riscv-qemu.

        Lastly, let's discuss perf issues from PMU to trace, any riscv perf topic.

        Speakers: Mr Guo Ren, Mr Han Mao (c-sky.com (belong to Alibaba.com))
      • 11:30 AM
        Break
      • 36
        Early HPC uses cases for RISC V

        The current main uses cases of RISC V center on embedded uses and small configurations. However, RISC V seems to be also a useful platform to do High Performance Computing and may be able to deliver custom solutions that can go well beyond what the traditional processor vendors can offer. There are already efforts underway to use ARM for that purpose but those approaches are constrained by limits placed on that platform through licensing. It is natural to expect a move to RISC V there as well.

        This talk is looking at use cases in HPC such as to create custom compute solutions replacing GPUs and numerous vector processing extensions of typical processors. HPC users often feel constrained by the limits on the implementations provided to them and are hopeful that RISC V will offer a heretofore unavailable flexibility for them.

        Other further use cases may be customizing access to newer forms of memory (such as HBM, Persistent memory, DDR5/6 and other approaches) as well as providing implementations of fast packed processing for High Speed Networks (such as Infiniband, NVlink and Ethernet). The problem of line rate processing at 100Gbps and higher may actually require the development of custom processors to have a reasonable way to process data at these speeds.

        Speaker: Christopher Lameter (Jump Trading LLC)
      • 37
        RISC-V hypervisor implementation

        The RISC-V hypervisor extension is carefully designed to be compliant with both Type-1 and Type-2 hypervisors. We have ported Xvisor (Type-1) and KVM (Type-2) for RISC-V architecture. In this session, we share our experience porting these hypervisors and also discuss future work on RISC-V hypervisors.

        Speaker: Mr Anup Patel (Western Digital)
      • 38
        RISC-V Hypervisor ISA Emulation

        This presentation discusses the work done to add the RISC-V Hypervisor Extension support to QEMU. This allows everyone to use QEMU as a development platform for porting Hypervisors to RISC-V. This can be seen by the recent effort to port KVM to RISC-V.

        This presentation will discuss how the RISC-V Hypervisor extension works and how it is different to other common architectures Hypervisor support. It will talk about how the extension was implemented in QEMU and problems that were identified with the draft specification in the process. Finally it will conclude with the current upstream status and any pending work related to both QEMU and the RISC-V Hypervisor specification in general, including current Hypervisor project porting status.

        We are also looking for feedback on existing issues in the RISC-V Hypervisor specification and possible solutions. This will help in making a more software friendly and robust specification.

        Speaker: Mr Alistair Francis
      • 39
        Taking RISC-V to the Datacenter

        What's it going to take to allow us to make the benefits of the RISC-V
        architecture available in centralized computing systems? Are there some
        things we need to be working on right now to pave the way for future
        success here? How can the state of the ARM architecture help us
        understand this problem?

        This presentation will explore the technical decisions made in designing
        a data-center scale ARM server. Then, highlight the technical and
        product differences between x86 and ARM systems, and show where RISC-V
        is heading in relation to those. Finally, describe a few places where
        focusing the RISC-V Linux architecture in a particular direction may
        help enable datacenter-class machines while still allowing the existing
        embedded Linux roadmap to succeed.

        Speaker: Keith Packard (SiFive)
      • 40
        RISCV NOMMU/M-Mode Linux

        This presentation will discuss the work ongoing to implement Linux kernel
        support for RISCV hardware lacking a memory management unit (MMU). A side effect
        of this work is also the ability to execute the kernel directly in M-Mode and
        how this is implemented while keeping most of the architecture code unmodified.
        The presentation will include examples of testing environment builds, discuss
        the support state of userspace toolchains and C libraries and will present the
        direct application of this work to a real hardware platform (Kendryte K210 SoC).

        Speaker: Damien Le Moal (Western Digital)
    • Tracing MC Opala/room-I&II (Corinthia Hotel Lisbon)

      Opala/room-I&II

      Corinthia Hotel Lisbon

      126

      The Linux Plumbers 2019 is pleased to welcome the Tracing microconference again this year. Tracing is once again picking up in activity. New and exciting topics are emerging.

      There is a broad list of ways to perform Tracing in Linux. From the original mainline Linux tracer, Ftrace, to profiling tools like perf, more complex customized tracing like BPF and out of tree tracers like LTTng, systemtap and Dtrace. Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!

      Expected topics
      bpf tracing – Anything to do with BPF and tracing combined
      libtrace – Making libraries from our tools
      Packaging – Packaging these libraries
      babeltrace – Anything that we need to do to get all tracers talking to each other
      Those pesky tracepoints – How to get what we want from places where trace events are taboo
      Changing tracepoints – Without breaking userspace
      Function tracing – Modification of current implementation
      Rewriting of the Function Graph tracer – Can kretprobes and function graph tracer merge as one
      Histogram and synthetic tracepoints – Making a better interface that is more intuitive to use
      More to be added based on CfP for this microconference

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Steven Rostedt (rostedt@goodmis.org)

      • 41
        drgn: Programmable Debugging

        drgn (https://github.com/osandov/drgn) is a programmable debugger that makes it easy to introspect and debug state in the kernel. With drgn, it's possible to explore and analyze data structures with the full power of Python. See the LWN coverage of the presentation at LSF/MM: https://lwn.net/Articles/789641/. This presentation will demonstrate the capabilities of drgn, discuss future plans, and explore ways that the kernel and surrounding ecosystem can make introspection easier and more powerful.

        Speaker: Omar Sandoval
      • 42
        Kernel Boot Time Tracing

        Tracing kernel boot is useful when we chase a bug in device and machine initialization, boot performance issue etc. Ftrace already supports to enable basic tracing features in kernel cmdline. However, since the cmdline is very limited and too simple, it is hard to enable complex features which are recently introduced, e.g. multiple kprobe events, trigger actions, and event histogram.
        To solve this limitation, I introduce a boot time tracing feature on new structured kernel cmdline, which allows us to write complex tracing features in treed key-value style text file.
        In this talk, I would like to discuss how this solves the boot time tracing, and the syntax of tracing subsystem for this structured kernel cmdline.

        Speaker: Masami Hiramatsu (Linaro Ltd.)
      • 43
        Sharing PMU counters across compatible perf events

        Hardware PMU counters are limited resources. When there are more perf events than the available hardware counters, it is necessary to use time multiplexing, and the perf events could not run 100% of time.

        On the other hand, different perf events may measure the same metric, e.g., instructions. We call these perf events "compatible perf events". Technically, one hardware counter could serve multiple compatible events at the same time. However, current perf implementation doesn't allow compatible events to share hardware counters.

        There are efforts to enable sharing of compatible perf events. To the best of our knowledge, the latest attempt was https://lkml.org/lkml/2019/2/26/823. Unfortunately, we haven't make much progress on this front.

        At Facebook we are investing on user space sharing of compatible performance counters to reduce the need for time multiplexing and the cost of context switch when monitoring the same events in several threads and cgroups. A kernel solution would be preferable.

        In the Tracing MC, we would like to discuss how we can enable PMU sharing compatible perf events. This topic may open other discussions in perf subsystem. We think this would be a fun section.

        Speakers: Song Liu, David Carrillo Cisneros (Facebook)
      • 44
        A trace-cmd front end interface to ftrace histogram, triggers and synthetic events.

        Ftrace histograms, based on triggers and synthetic events were implemented few years ago by Tom Zanussi. They are very powerful instrument for analyzing the kernel internals, using ftrace events, but its user interface is very complex and hard to use. This proposal is to discuss possible ways to define more easy to use and intuitive interface to this feature, using trace-cmd application.

        Speaker: Tzvetomir Stoyanov
      • 11:30 AM
        Break
      • 45
        Unifying trace processing ecosystems with Babeltrace

        Babeltrace started out as the reference implementation of a Common
        Trace Format (CTF) reader. As the project evolved, many
        trace manipulation use-cases (merging, trimming, filtering,
        conversion, analysis, etc.) emerged and were implemented either
        as part of the Babeltrace project, on top of its APIs or through
        custom tools.

        Today, as more tracers emerged, each using their own trace format, the
        tracing ecosystem has become fragmented making tools exclusive to
        certain tracers. The newest version of Babeltrace aims at bridging
        the gap between the various tracing ecosystems by making it easy
        to implement trace processing tools over an agnostic trace IR.

        The discussion will aim at identifying the work needed to accommodate
        the various tracers and their associated tooling (scripts, graphical
        viewers, etc.) over the next releases.

        Speaker: Jérémie Galarneau (EfficiOS/LTTng/Babeltrace)
      • 46
        libtrace - making libraries of our tracing tools

        I would like to discuss how to implement a series of libraries for all the tracing tools that are out there, and have a repository that at least points to them. From libftrace, libperf, libdtrace to liblttng and libbabletrace.

        Speaker: Steven Rostedt
      • 47
        bpftrace

        bpftrace is a high level tracing language running on top of BPF: https://github.com/iovisor/bpftrace

        We'll talk about important updates from the past year, including improved tracing providers and new language features, and we'll also discuss future plans for the project.

        Speaker: Mr Alastair Robertson (Yellowbrick)
      • 48
        BPF Tracing Tools: New Observability for Performance Analysis

        Many new BPF tracing tools are about to be published, deepening our view of kernel internals on production systems. This session will summarize what has been done and what will be next with BPF tracing, discussing the challenges with taking kernel and application analysis further, and the potential kernel changes needed.

        Speaker: Brendan Gregg (Netflix)
    • Birds of a feather (BoF) Ametista/room-I (Corinthia Hotel Lisbon)

      Ametista/room-I

      Corinthia Hotel Lisbon

      50

      Our BoF session proposes topics as informal meeting during the conference. The topic lead (submitter) will drive the conversations on the area of interest described in each BoF.

      The attendees group together based on a shared interest and carry out discussions without any pre-planned agenda.

      • 49
        Kernel Debugging Tools

        For many years developers have leveraged gdb or crash to look at kernel crash dumps on linux. Although those tools have served us well, it can sometimes be difficult to navigate the crash dump to find the information you really need. In this talk, we would like to present some new tools that make it easier to debug kernel crash dumps and enhance kernel developer's ability to root causes problems the first time they happen. We will present information about the following tools:

        • crash-python
        • drgn
        • sdb

        In addition, we're looking for interested members to join the kernel debugging community to continue to build on these tools, provide feedback, and help generate ideas on how we can make kernel crash dump debugging simpler.

        Speakers: George Wilson (Delphix), Omar Sandoval, Serapheim Dimitropoulos
      • 50
        Wayland

        Wayland is getting close to being ready for day 2 day generic desktop use, close but there still are many small issues to tackle, see e.g. :
        https://hansdegoede.livejournal.com/21944.html
        https://hansdegoede.livejournal.com/22212.html

        The purpose of this microconference is to get people together to discuss the various open issues, try to come up with solutions for some of them and possibly implement some of them.

        Expected audience

        Anyone who is present at plumbers and is interested in furthering Wayland support.

        Expected Topics:
        -Discussion about allowing apps run by other users to connect through Wayland, e.g. apps run by sudo
        -Should apps (games) be able to change the monitor resolution, should this be a Wayland protocol extension or a portal
        -Getting the compositor out of the way for fullscreen games (unredirect support)
        -Unified API for monitor configuration à la xrandr to allow commandline configuration of monitor settings?
        -More to be added based on CfP for this microconference

        Possible speakers/participants which I know plan to be present at plumbers are Alberto Ruiz, Benjamin Berg, Christian Kellner and me.

        I also expect Benjamin and or Christian to be willing to co-host the Microconf with me, but I still need to ask them.

        Speaker: Hans de Goede (Red Hat)
      • 4:30 PM
        Break
      • 51
        Having one, unified eBPF network packet filter, no more, no less.

        For long time, The kernel have contained two mechanisms with similar packet filtering functionality: tc filter (with chains) and iptables/nftables.

        As eBPF is starting to take over, once again we seem to have two mechanisms with similar functionality: BPFilter and the newly suggested OVS-eBPF datapath (on top on tc).

        As we move to using eBPF, I'd like to discuss the possibility of uniting those two functionalities, both the BPFilter and OVS-eBPF path, into a single one and let go of all the duplicate code.

      • 52
        Upstream kernel CI

        Testing the upstream kernel is not an easy task. The burden is
        still largely put on developers, although several projects are
        now covering parts of it such as 0-day, LKFT, CKI, Coccinelle,
        syzkaller and kernelci.org. While they all tend to have their
        own speciality, they also face a lot of similar challenges.

        This BoF is to give an opportunity to exchange ideas and bring
        together people from the upstream kernel testing community. Are
        there ways to share kernel builds, platforms or code between
        projects to remove duplication of efforts? Which open tools are
        you using, and is there a need for anything new? Which areas of
        the kernel are suffering the most from a lack of test coverage?
        How does one even power up a dev board in a lab?

        Tackling these problems requires a lot of energy. Last but not
        least and thanks to Collabora, attendees will be offered food
        (if the venue permits it)!

        Speaker: Guillaume Tucker (Collabora Limited)
    • Scheduler MC Esmerelda/room-I&II (Corinthia Hotel Lisbon)

      Esmerelda/room-I&II

      Corinthia Hotel Lisbon

      126

      The Linux Plumbers 2019 Scheduler Microconference is about all scheduler topics, which are not Realtime

      Potential topics:
      - Load Balancer Rework - prototype
      - Idle Balance optimizations
      - Flattening the group scheduling hierrachy
      - Core scheduling
      - Proxy Execution for CFS
      - Improving scheduling latency with SCHED_IDLE task
      - Scheduler tunables - Mobile vs Server
      - nohz
      - LISA for scheduler verification

      We plan to continue the discussions that started at OSPM in May'19 and get a wider audience outside the core scheduler developers at LPC.

      Potential attendees:
      Juri Lelli
      Vincent Guittot
      Subhra Mazumdar
      Daniel Bristot
      Dhaval Giani
      PeterZ
      Paul Turner
      Rik van Riel
      Patrick Bellasi
      Morten Rasmussen
      Dietmar Eggman
      Steven Rostedt
      Thomas Gleixner
      Viresh Kumar
      Phil Auld
      Waiman Long
      Josef Bacik
      Joel Fernandes
      Paul McKenney
      Alessio Balsini
      Frederic Weisbecker

      This microconference is picking scheduler topics which are not RT, but this should take place either immediately before or after that MC.

      MC leads:
      Juri Lelli juri.lelli@redhat.com, Vincent Guittot vincent.guittot@linaro.org, Daniel Bristot de Oliveira bristot@redhat.com, Subhra Mazumdar subhra.mazumdar@oracle.com, Dhaval Giani dhaval.giani@gmail.com

      • 53
        Core scheduling

        There have been two different approaches proposposed on the LKML over the past year on core scheduling. One was the coscheduling approach by Jan Schönherr, originally posted at https://lkml.org/lkml/2018/9/7/1521 and the next version posted at https://lkml.org/lkml/2018/10/19/859

        Upstream chose a different route and decided to modify CFS, and only do "core-scheduling". Vineeth picked up the patches from Peter Zijlstra. This is a discussion on how we can further that work, especially when there are security implications such as L1TF and MDS, which make important this work to go upstream.

        Aubrey Li will talk about Core scheduling: Fixing when fast instructions go slow

        Keeping system utilization high is important both to keep costs down and to keep energy efficiency up. That often means tightly packing compute jobs and using the latest processor features. However, these approaches can be at odds when a new processor feature like AVX512 is used. The performance of latency critical jobs can be reduced by 10% if co-located with deep learning training jobs. These jobs use AVX512 instructions to accelerate wide vector operations. Whenever a core executes AVX512 instructions, the core automatically reduces its frequency. This can lead to a significant overall performance loss for a non-AVX512 job on the same core. In this presentation, we will discuss how to preserve performance while still allowing AVX512-based acceleration.

        AVX512 task detection
        - From user space, PMU events can be used but it's expensive.
        - In the kernel, I proposed to expose process AVX512 usage elapsed time as a heuristic hint.
        - Discuss an interface for tasks in cgroup.

        AVX512 task isolation
        - Discuss kernel space solution, if the recent proposal core scheduling can be leveraged for isolation.
        - Discuss user space solution, if user space job scheduler is better than kernel scheduler

        Speakers: Mr Aubrey Li, Jan Schönherr, Hugo Reis, Vineeth Remanan Pillai
      • 54
        Proxy Execution

        Proxy execution can be considered as a generalization of the real-time priority inheritance mechanism. With proxy execution a task can run using the context of some other task that is "willing" to let the first task run as this improves performace for both. With this topic I'd like to detail about progress that has been made after the initial RFC posting on LKML and discuss about open problems and questions.

        Speaker: Juri Lelli (Red Hat)
      • 55
        Making SCHED_DEADLINE safe for kernel kthreads

        Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods, not millisecond, not seconds, not minutes, not even hours, but days. Given that RCU CPU stall warnings are issued whenever an RCU grace period fails to complete within a few tens of seconds, the system did not suffer silently. Although one could argue that people should avoid abusing sched_setattr(), people are human and humans make mistakes. Responding to simple mistakes with RCU CPU stall warnings is all well and good, but a more severe case could OOM the system, which is a particularly unhelpful error message.

        It would be better if the system were capable of operating reasonably despite such abuse. Several approaches have been suggested.

        First, sched_setattr() could recognize parameter settings that put kthreads at risk and refuse to honor those settings. This approach of course requires that we identify precisely what combinations of sched_setattr() parameters settings are risky, especially given that there are likely to be parameter settings that are both risky and highly useful.

        Second, in theory, RCU could detect this situation and take the "dueling banjos" approach of increasing its priority as needed to get the CPU time that its kthreads need to operate correctly. However, the required amount of CPU time can vary greatly depending on the workload. Furthermore, non-RCU kthreads also need some amount of CPU time, and replicating "dueling banjos" across all such Linux-kernel subsystems seems both wasteful and error-prone. Finally, experience has shown that setting RCU's kthreads to real-time priorities significantly harms performance by increasing context-switch rates.

        Third, stress testing could be limited to non-risky regimes, such that kthreads get CPU time every 5-40 seconds, depending on configuration and experience. People needing risky parameter settings could then test the settings that they actually need, and also take responsibility for ensuring that kthreads get the CPU time that they need. (This of course includes per-CPU kthreads!)

        Fourth, bandwidth throttling could treat tasks in other scheduling classes as an aggregate group having a reasonable aggregate deadline and CPU budget. This has the advantage of allowing "abusive" testing to proceed, which allows people requiring risky parameter settings to rely on this testing. Additionally, it avoids complex progress checking and priority setting on the part of many kthreads throughout the system. However, if this was an easy choice, the SCHED_DEADLINE developers would likely have selected it. For example, it is necessary to determine what might be a "reasonable" aggregate deadline and CPU budget. Reserving 5% seems quite generous, and RCU's grace-period kthread would optimally like a deadline in the milliseconds, but would do reasonably well with many tens of milliseconds, and absolutely needs a few seconds. However, for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might well need a full CPU each! (This happens when the CPU being offloaded generates a high rate of callbacks.)

        The goal of this proposal is therefore to generate face-to-face discussion, hopefully resulting in a good and sufficient solution to this problem.

        Speaker: Paul McKenney (IBM Linux Technology Center)
      • 4:30 PM
        Break

        Coffee, Tea and Snacks

      • 56
        CFS load balance rework

        The cfs load_balance has became more and more complex over the years and has reached the point where policy can't be explained sometimes. Furthermore, available metrics have evolved and load balance doesn't always take full advantage of it to calculate the imbalance. It's probably the good time to do a rework of the load balance code as proposed in this patchset:
        https://lkml.org/lkml/2019/7/19/594
        In addition to this patchset , we could discuss the next evolution that could be done on the load_balance

        Speaker: Vincent Guittot (Linaro)
      • 57
        flattening the hierarchy discussion

        There is a presentation in the refereed track on flattening the CPU controller runqueue hierarchy, but it may be useful to have a discussion on the same topic in the scheduler microconference.

        Speaker: Rik van Riel (Facebook)
      • 58
        Scheduler domains and cache bandwidth

        The Linux Kernel scheduler represents a system's topology by the means of
        scheduler domains. In the common case, these domains map to the cache topology
        of the system.

        The Cavium ThunderX is an ARMv8-A 2-node NUMA system, each node containing
        48 CPUs (no hyperthreading). Each CPU has its own L1 cache, and CPUs within
        the same node will share a same L2 cache.

        Running some memory-intensive tasks on this system shows that, within a
        given NUMA node, there are "socklets" of CPUs. Executing those tasks
        (which involve the L2 cache) on CPUs of the same "socklet" leads to a reduction
        of per-task memory bandwidth.
        On the other hand, running those same tasks on CPUs of different "socklets"
        (but still within the same node) does not lead to such a memory bandwidth
        reduction.

        While not truly equivalent to sub-NUMA clustering, such a system could benefit
        from a more fragmented scheduler domain representation, i.e. grouping these
        "socklets" in different domains.

        This talk will be an opportunity to discuss ways for the scheduler to leverage
        this topology characteristic and potentially change the way scheduler domains
        are built.

        Speaker: Valentin Schneider (Arm Ltd)
      • 59
        TurboSched: Core capacity Computation and other challenges

        Turbosched is a proposed scheduler enhancement that aims to sustain turbo frequencies for a longer duration by explicitly marking small tasks that are known to be jitters and pack them on a smaller number of cores. This ensures that the other cores will remain idle, and the energy thus saved can be used by CPU intensive tasks for sustaining higher frequencies for a longer duration.

        The current TurboSched RFCv4 (https://lkml.org/lkml/2019/7/25/296) has some challenges:

        • Core Capacity Computation: Spare core capacity defines the upper bound for task packing above which jitter tasks should not be packed further into a core, else it hurts the performance of the other tasks running on that core. To achieve this we need a mechanism to compute the capacity of the cores in terms of its active SMT threads. But the computation of CPU Capacity itself if arguable and non-reliable in case of CPU hotplug events. This makes the TurboSched to have unexpected behavior in case of hotplugs or in presence of asymmetric CPU capacities. The discussion also involves the use of other parameters like nr_running with utilization to decide upper bound for task packing.

        • Interface: There are multiple approaches to mark a small-task as a jitter. A cgroup based approach is favorable to the distros as it is a well-understood interface requiring minimal modification for the existing tools. However, the kernel community has expressed objection to this interface since whether a task is jitter or not is a task-attribute and not a task-group attribute. Further, a task being a jitter is not a resource-partition problem, which is what cgroup aims to solve. The other approach would be to define this via a sched_attribute which can be updated via an existing syscall. Finally, we can support both the approaches as discussed on LWN https://lwn.net/Articles/792471/

        • Limiting the Search Domain for packing: On systems with a large number of CPUs, searching all the CPUs where the small-tasks should be packed can be expensive in the task-wakeup path. Hence we should limit the
          domain of CPUs over which the search is conducted. In the current implementation, TurboSched uses the DIE domain to pack tasks on PowerPC, but certain architectures might prefer the LLC or the NUMA domains. Thus we need to discuss a unified way of describing the search domain which can work across all architectures.

        This topic is a continuation from the OSPM talk and aims to mitigate these problems generic across architectures.

        Speaker: Parth Shah
      • 60
        Task latency-nice

        Currently there is no user control on how much time scheduler should spend searching for CPUs when scheduling a task. It is hardcoded logic based on some heuristics that doesn't work well in many cases. e.g. very short running tasks. Provide a new latency-nice property user can set for a task (similar to nice value) that controls the search time and also potentially the preemption logic. Also discuss best interfaces to have this (potentially Cgroups).

        Speaker: Subhra Mazumdar
    • VFIO/IOMMU/PCI MC Opala/room-I&II (Corinthia Hotel Lisbon)

      Opala/room-I&II

      Corinthia Hotel Lisbon

      126

      The PCI interconnect specification and the devices implementing it are incorporating more and more features aimed at high performance systems (eg RDMA, peer-to-peer, CCIX, PCI ATS (Address Translation Service)/PRI(Page Request Interface), enabling Shared Virtual Addressing (SVA) between devices and CPUs), that require the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to managed them (for userspace access and device passthrough) with related kernel interfaces that have to be designed in-sync for all three subsystems.

      The kernel code that enables these new system features requires coordination between VFIO/IOMMU/PCI subsystems, so that kernel interfaces and userspace APIs can be designed in a clean way.

      Following up the successful LPC 2017 VFIO/IOMMU/PCI microconference, the Linux Plumbers 2019 VFIO/IOMMU/PCI track will therefore focus on promoting discussions on the current kernel patches aimed at VFIO/IOMMU/PCI subsystems with specific sessions targeting discussion for kernel patches that enable technology (eg device/sub-device assignment, peer-to-peer PCI, IOMMU enhancements) requiring the three subsystems coordination; the microconference will also cover VFIO/IOMMU/PCI subsystem specific tracks to debate patches status for the respective subsystems plumbing.

      Tentative topics for discussion:

      VFIO
      Shared Virtual Addressing (SVA) interface
      SRIOV/PASID integration
      Device assignment/sub-assignment
      IOMMU
      IOMMU drivers SVA interface consolidation
      IOMMUs virtualization
      IOMMU-API enhancements for mediated devices/SVA
      Possible IOMMU core changes (like splitting up iommu_ops, better integration with device-driver core)
      DMA-API layer interactions and how to get towards generic dma-ops for IOMMU drivers
      PCI
      Resources claiming/assignment consolidation
      Peer-to-Peer
      PCI error management
      PCI endpoint subsystem
      prefetchable vs non-prefetchable BAR address mappings (cacheability)
      Kernel NoSnoop TLP attribute handling
      CCIX and accelerators management
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Bjorn Helgaas bjorn@helgaas.com, Lorenzo Pieralisi lorenzo.pieralisi@arm.com, Joerg Roedel joro@8bytes.org, and Alex Williamson alex.williamson@redhat.com

      • 61
        User interfaces for per-group default domain type

        This topic will discuss 1) why do we need per-group default domain type, 2) how it solves the problems in the real IOMMU driver, and 3) the user interfaces.

        Speaker: Baolu Lu
      • 3:25 PM
        VFIO/IOMMU/PCI speaker change
      • 62
        Status of Dual Stage SMMUv3 integration

        Since August 2018 I have been working on SMMUv3 nested stage integration
        at IOMMU/VFIO levels, to allow virtual SMMUv3/VFIO integration.

        This shares some APIs with the Intel and ARM SVA series (cache invalidation,
        fault reporting) but also introduces some specific ones to pass information
        about guest stage 1 configuration and MSI bindings.

        In this session I would like to discuss the upstream status and get a chance
        to clarify open points. This is also an opportunity to synchronize about the VFIO fault reporting requirements for recoverable errors.

        Speaker: Eric Auger (Red Hat)
      • 3:55 PM
        VFIO/IOMMU/PCI speaker change
      • 63
        PASID Management in Linux

        PASID (Process Address Space ID) is a PCIe capability that enables sharing of a single device across multiple isolated address domains. It has been becoming a hot term in I/O technology evolution. e.g. it is foundation of SVM and SIOV. Combined with the usages of PASID and the configuration difference due to architecture difference across vendors, it brings an interesting topic on PASID management in Linux. Especially regards to software complexity for VM live migration support in cloud. This talk will review the PASID usages and configuration methods, then elaborate the gaps for PASID management. Finally propose a solution and start talks with peers.

        Speaker: Mr Pan Jacob (Intel)
      • 4:25 PM
        VFIO/IOMMU/PCI speaker change
      • 64
        Architecture considerations for vfio/iommu handling

        While x86 is probably the most prominent platform for vfio/iommu development and usage, other architectures also see quite a bit of movement. These architectures are similar to x86 in some parts and quite different in others; therefore, sometimes issues come up that may be surprising to folks mostly working on more common platforms.

        For example, PCI on s390 is using special instructions. QEMU needs to fill in 'real' values for some memory-layout values for devices passed via vfio and needs a way to retrieve them.

        Other architectures (e.g. ARM) may also have some unusual requirements not obvious to people not working on those platforms. It seems beneficial to at least raise awareness of those issues so that we don't end up with interfaces/designs that are hard to implement or not sufficient on less common platforms.

        Speaker: Cornelia Huck
      • 4:45 PM
        VFIO/IOMMU/PCI main break
      • 65
        Optional or reduced PCI BARs

        Modern PCI graphics devices may contain several gigabytes of memory mapped in its BAR. This trend is continuing into storage with NVMe devices containing large Controller Memory Buffers and Persistent Memory Regions.

        Some PCI hierarchies are resource constrained and cannot fit as many devices as desired. In NVMe's case, it's preferable to enumerate and attach all devices rather than use the entire memory window for one or two devices with large, optional BARs.

        Current PCI core architecture will prevent a PCI device from being enabled if any of the BARs are unset. This proposal is about a way to hint at the PCI layer that some BARs are optional and could be omitted or reduced (by limiting it at the bridge window) in order to keep such devices enabled.

        Speaker: Jonathan Derrick
      • 5:30 PM
        VFIO/IOMMU/PCI speaker change
      • 66
        PCI Resources assignment policies

        This is meant to be a rather open discussion on PCI resource assignment policies. I plan to discuss a bit what the different arch/platforms do today, how I've tried to consolidate it, then we can debate the pro/cons of the different approaches and decide where to go from there.

        Speaker: Benjamin Herrenschmidt (Amazon AWS)
      • 6:00 PM
        VFIO/IOMMU/PCI speaker change
      • 67
        Implementing NTB controller using PCIe endpoint

        A PCI-Express non-transparent bridge (NTB) is a point-to-point PCIe bus
        connecting 2 host systems. NTB functionality can be achieved in a platform
        having 2 endpoint instances. Here each of the endpoint instance will be
        connected to an independent host and the hosts can communicate with each other
        using endpoint as a bridge. The endpoint framework and the "new" NTB EP
        function driver should configure the endpoint instances in such a way that the
        transactions from one endpoint is routed to the other endpoint instance. The
        host will see the connected endpoint as an NTB port and the existing NTB tools
        (ntb_pingpong, ntb_perf) in Linux kernel could be used.

        Speaker: Mr Kishon Vijay Abraham I
      • 6:20 PM
        VFIO/IOMMU/PCI speaker change
      • 68
        Use IOMMU to prevent DMA attacks from Thunderbolt devices

        The Thunderbolt vulnerabilities are public and have a nice name as Thunderclap (https://thunderclap.io/) nowadays. This topic will introduce what kind of vulnerabilities we have identified with Linux and how we are fixing them.

        Speaker: Baolu Lu
    • You, Me, and IoT MC Jade/room-I&II (Corinthia Hotel Lisbon)

      Jade/room-I&II

      Corinthia Hotel Lisbon

      160

      Live session notes are on Etherpad:
      https://etherpad.net/p/LPC2019_IoT

      The Internet of Things (IoT) has been growing at an incredible pace as of late.

      Some IoT application frameworks expose a model-based view of endpoints, such as

      on-off switches
      dimmable switches
      temperature controls
      door and window sensors
      metering
      cameras
      

      Other IoT application frameworks provide direct device access, by creating real and virtual device pairs that communicate over the network. In those cases, writing to the virtual /dev node on a client affects the real /dev node on the server. Examples are

      GPIO (/dev/gpiochipN)
      I2C (/dev/i2cN)
      SPI (/dev/spiN)
      UART (/dev/ttySN)
      

      Interoperability (e.g. ZigBee to Thread) has been a large focus of many vendors due to the surge in popularity of voice-recognition in smart devices and the markets that they are driving. Corporate heavyweights are in full force in those economies. OpenHAB, on the other hand, has become relatively mature as a technology and vendor agnostic open-source front-end for interacting with multiple different IoT frameworks.

      The Linux Foundation has made excellent progress bringing together the business community around the Zephyr RTOS, although there are also plenty of other open-source RTOS solutions available. The linux-wpan developers have brought 6LowPan to the community, which works over 802.15.4 and Bluetooth, and that has paved the way for Thread, LoRa, and others. However, some closed or quasi-closed standards must rely on bridging techniques mainly due to license incompatibility. For that reason, it is helpful for the kernel community to preemptively start working on application layer frameworks and bridges, both community-driven and business-driven.

      For completely open-source implementations, experimental results have shown results with Greybus, with a significant amount of code already in staging. The immediate benefits to the community in that case are clear. There are a variety of key subjects below the application layer that come into play for Greybus and other frameworks that are actively under development, such as

      Device Management
      are devices abstracted through an API or is a virtual /dev node provided?
      unique ID / management of possibly many virtual /dev nodes and connection info
      Network Management
      standards are nice (e.g. 802.15.4) and help to streamline in-tree support
      non-standard tech best to keep out of tree?
      userspace utilities beyond command-line (e.g. NetworkManager, NetLink extensions)
      Network Authentication
      re-use machinery for e.g. 802.11 / 802.15.4 ?
      generic approach for other MAC layers ?
      Encryption
      in userspace via e.g. SSL, /dev/crypto
      Firmware Updates
      generally different protocol for each IoT framework / application layer
      Linux solutions should re-use components e.g. SWUpdate
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      This Microconference will be a meeting ground for industry and hobbyist contributors alike and promises to shed some light on the what is yet to come. There might even be a sneak peak at some new OSHW IoT developer kits.

      The hope is that some of the more experienced maintainers in linux-wpan, LoRa and OpenHAB can provide feedback and suggestions for those who are actively developing open-source IoT frameworks, protocols, and hardware.

      MC leads
      Christopher Friedt chris@friedt.co, Jason Kridner jkridner@beagleboard.org, and Drew Fustini drew@beagleboard.org

      • 69
        Greybus for IoT

        Greybus is an RPC like protocol on top UniPro bus that has been designed for the Project ARA. This goal of that project was to develop a modular smartphone. Greybus gives the ability to the host to control remotely the buses (such as i2c or spi) of the modules.
        Although Project ARA has been aborted, Greybus has been merged to Linux kernel, and it is still maintained by the community.
        Greybus has been designed for modular smartphones, but there are many others pertinents use cases for it:
        - IoT, to let a Linux base station directly control sensors, and avoid writing complex firmware for the modules
        - USB, to control peripherals on the board using existing Linux drivers
        - To control system-on-chip hardware peripherals managed by a small core, with messages sent from a larger CPU This approach would be more generic that writing a custom protocol on top of RPMSG
        The intent of this talk is to briefly present Greybus, how we could use it for general purpose, and talk about the work in progress, that would make it possible.

        Speaker: Mr Alexandre Bailon (BayLibre)
      • 70
        Over the Air (OTA) Updates: State of the Union? Democratize?

        IoT applications, be they Autonomous Cars [1] or Health Care or Smart Home or Factory Automation, the IoT devices (sensors and actuators), gateways, and cloud/datacenter endpoints need software and/or firmware updates, to fix security issues, patch bugs, and/or release new features. IoT with its numerous remote devices and gateways presents a large attack surface, making the application of security patches as they become available especially important. Let us review key OTA Update requirements, available open source solutions, and how to ease adoption through the introduction of a standard API, one that abstracts the complexities and trade-offs. The underlying implementation would be selected based on the application needs and where in IoT architectural stack a given node lies (device/edge/cloud).

        Most of us are familiar with OTA in the context of our mobile phones. A large proportion of Tesla’s success and customer confidence stems from its OTA update support [2], for example a braking distance issue fix, accepting only signed updates, and rolling out new features.

        What are key OTA requirements [3]?
        1) Ability to upgrade the bootloader, kernel, root filesystem, firmware, applications, device specific data.
        2) Robust - Never “brick” the device.
        3) Atomic - success or failure, nothing partial with undefined behavior.
        4) Automated – not requiring human interaction during the process
        5) Auditable – logs – what got updated
        6) Preserve user data (customizations etc)
        7) Signed, accepting updates only from trusted sources.
        8) Secure communication channel.

        Note: We shall drop the bootloader in item 1 because it is a transient power-on, process is rarely a source of runtime bugs.

        What are OTA implementation considerations [4]?
        Inline or Shadow Partition?
        OTA is not easy and there are many implementation options with their respective trade-offs. Should the update happen in-place or use a shadow partition to copy over? What size should the partition be? The shadow partition approach is certainly safer just in case there is a power glitch at either the recipient or server node, or a network glitch or some other error condition that could corrupt an in-place update. Roll-back in case of a corrupted update is easy with shadow partitions because the original boot image is intact.
        Block-based or file update?
        The former is a complete image, easy to verify with a version number and hash signature, making it simpler to manage across a fleet of devices. The latter is more concise but should issues arise in the application of the patch, the system could become inoperable.
        Trusted Source?
        Can the update payload be trusted? Is it signed by a trusted entity? This requires the nodes have the public key and certificate of the trusted entity.
        Where can updates be obtained?
        Perhaps a vendor specific site. Possibly even a public site if open source.
        Update Push or Pull? Frequency?
        Should nodes poll for updates or should a management application push updates to nodes?
        Secure transmission?
        Is the transmission channel secured using TLS/HTTPS or over a VPN?

        What open source projects address OTA?
        The projects below vary in their robustness, network bandwidth needs, and the types of hardware they support.

        OSTree [6,7]
        Provides a git like approach to version control for Linux operating systems that does an atomic complete filesystem update. The userspace solution can operate either standalone or be layered with a package manager for a hybrid solution.

        Balena.io [8]
        balenaOS Yocto Linux based host OS that comes packaged with balenaEngine, a lightweight docker-compatible container engine. A device supervisor runs in its own container and allows pulling new code even if the application code crashes.

        SWUpdate [9,10]
        SWUpdate is a Linux Update agent with the goal to provide an efficient and safe way to update an embedded system. SWUpdate supports local and remote updates, multiple update strategies and it can be well integrated in the Yocto build system by adding the meta-swupdate layer. Supports updating FPGAs and Microcontrollers.

        Swupd [11]
        swupd is an operating system software manager and update program that operates at a file-level to enable verifiable integrity and update efficiency.

        Mender.io [12]
        An open source update manager for embedded devices based on the client-server model with security and robustness.

        How can we make OTA Update Easier?

        Linux is the King/Queen of IoT, be it on small form factor highly resource constrained devices or on server class gateways. The OTA implementation depends upon the node, whose selection depends upon the demands of the use case. What if we could abstract away the nuances of the implementation and ease consumption, along the lines of Libvirt [13] for virtual machines that abstracts away for Cloud orchestrators machine architecture (ARM, X86) and hypervisor implementation (KVM/Xen/ESXi/HyperV/ACRN). What if we introduce “update” akin to reboot, with configuration and action sub-commands?

        update config source <source-url>
        update config key  <add|delete|update> <name> <public-key>
        update config schedule <monthly|hourly|minute> <integer>
        update config log <log path> // defaults to syslog
        update config verify [true|false] // verifies signature publickey
        update config boot-retry-limit <integer> 
        update config secure [true|false]  // mutual authentication [14]
        update [--secure [true|false]] [--verify [true|false]] [--source <source-url>] [--now] [--noreboot]
           // values specified override config settings
           // typically reboot after update
        

        Should no update implementation exist, these methods should gracefully fail reporting an error to the default log location. An update implementation when installed overwrites the default methods, which typically report “Not implemented. Consider installing X, Y or Z”

        Future Enhancements:
        We defer for the future supporting more secure WiFi access for IoT and OTA such as wpa3 [15]. Also in this vein is use of secure storage media such as self-encrypting-drives and read-only memory [16].

        References:
        1. https://www.slideshare.net/leonanavi/software-over-the-air-sota-for-automotive-grade-linux-agl
        2. https://electrek.co/2017/07/17/tesla-fleet-hack-elon-musk/
        3. https://mender.io/learn/whitepapers/_resources/Software%20Updates.pdf
        4. https://www.embedded.com/design/operating-systems/4461019/OTA-updates-for-Embedded-Linux--part-1-----Fundamentals-and-implementation
        5. https://elinux.org/Secure_OTA_Update
        6. https://ostree.readthedocs.io/en/latest/manual/introduction/
        7. https://samthursfield.wordpress.com/2014/01/08/os-level-version-control/
        8. https://www.balena.io/what-is-balena/
        9. https://github.com/sbabic/swupdate
        10. http://events17.linuxfoundation.org/sites/events/files/slides/ELC2017_SWUpdate.pdf
        11. https://clearlinux.org/documentation/clear-linux/concepts/swupd-about
        12. https://mender.io
        13. https://libvirt.org
        14. https://searchsecurity.techtarget.com/definition/mutual-authentication
        15. https://www.linux.com/news/wpa3-how-and-why-wi-fi-standard-matters
        16. https://openiotelcna2017.sched.com/event/9J5i/surviving-in-the-wilderness-integrity-protection-and-system-update-patrick-ohly-intel-gmbh

        Speaker: Dr Malini Bhandaru (VMware)
      • 71
        Implementing LoRa, FSK and further LPWAN interfaces

        This talk will give an overview of LoRa and related wireless technologies and their role in IoT infrastructure. An initial RFC for a socket interface had been submitted last summer as proof of concept - a linux-lora.git staging tree and linux-lpwan mailing list have been in use for collaboratively iterating on patches towards a mergeable proposal. Open topics include abandoning PF_LORA in favor of PF_PACKET and how to layer PF_LORAWAN on top of LoRa and FSK; on the driver side the LoRa gateway chipset SX1301/SX1308 has run into problems with clk/spi/reset, and no solution for expanding from DT to ACPI has been found yet; adding protocol families and testing them on a large range of devices has not been easy, and while many of these wireless technologies share design principles they have so far been unable to share any code on the PHY layer. 6LoWPAN and SCHC are candidates for higher-level soft-MACs. 3D-UNB is one of multiple candidates for getting similar treatment to LoRa.

        Speaker: Mr Andreas Färber (SUSE)
      • 4:30 PM
        Break
      • 72
        IoT from the point of view of view of a generic and enterprise distribution

        Having been focused on IoT in Fedora for Red Hat for 3 years and the wider Arm and embedded ecosystem for a lot longer and dealing with customers that are looking to prototype large scale IoT deployments for a range of use cases while using a distribution similar to what they use in their data centre but with IoT use cases, increased security I have a bunch of war wounds and ideas about the things that work, the things that need work and the things that don't work.

        The core pieces are there but there's bits missing or are incomplete, covering gpio and sensors, bluetooth and various wireless technologies through to security such as secure boot, TPM2s and IMA what are the technologies that users and customers are asking for and how can they be improved in Linux to make it easier for generic but IoT focused distros that need to address wide use cases in as generic a means as possible?

        This talk will cover the technologies being used and what makes it hard for end users to consume them in order to aide discussion. How we can take things that in some cases are developed on a single device running a single variant of Linux and how we can improve the overall ecosystem on Linux.

        Speaker: Peter Robinson (Red Hat)
      • 73
        The ieee802154 and 6lowpan Kernel Subsystems

        This talk will put the spotlight on the linux-wpan project, which brings IEEE 802.15.4 and 6LoWPAN support to the Linux Kernel. Designed for low-power devices these protocols are ideal for the use in some IoT applications. Over the last years IEEE 802.15.4 support has slowly found its way into the mainline kernel. The 6LoWPAN code is shared with the Bluetooth stack and the ieee802154 subsystem itself is growing in functionality.

        The talk will give an overview of the implemented functionality in the ieee802154 and 6lowpan subsystems and their use from userspace for the data (socket) and control (netlink) planes. It will describe the current hardware support, header compression techniques used in 6lowpan and areas where the stack is currently limited. We will close with a comparison of linux-wpan against other IEEE 802.15.4 stacks (Zephyr, RIOT, OpenThread).

        Speaker: Mr Stefan Schmidt
      • 74
        Using Greybus, mikroBus and PocketBeagle to consolidate kernel IoT sensor/actuator development

        Many "drivers" for IoT sensors and actuators live outside kernel space through efforts that seek to provide abstractions not sufficiently handled in the kernel today. This is resulting in great code fragmentation that can be resolved by better understanding the developer needs and communicating an achievable collaborative approach. Pushing the interface to these devices off to userspace is not the Linux-way.

        We'll look at the problems projects like MRAA/UPM, Adafruit_Blinka and numerous other projects from IoT tooling and breakout board providers are seeking to solve outside the kernel. These include providing libraries that support a very broad array of sensors, that help build understanding of the sensors themselves, make it easy to augment sensor parameters, and, at least for Adafruit_Blinka, include running the same interface code on microcontrollers.
        Using these userspace libraries also aid in rapid prototyping by avoiding the step of configuring the kernel to use these sensors on non-probable busses.

        Using Greybus, it is possible to, in a more flexible and secure way than device tree overlays, add IoT sensors in a rapid-prototyping fashion. See

        Using mikroBus, it is possible to collaborate across a large number of embedded Linux development platforms across a large number of IoT sensors and actuators. This is at least partially thanks to the almost 700 different available Click boards and a number of add-on daughter boards for embedded Linux development boards to interface to them.

        Speakers: Jason Kridner (BeagleBoard.org), Drew Fustini (OSH Park)
    • 75
      Welcome Reception Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

      Sete/Colinas-Restaurant

      Corinthia Hotel Lisbon

      20

      Sete Colinas Terrace

      19:00-21:00

      Same location as Lunch.

    • Birds of a feather (BoF) Ametista/room-I (Corinthia Hotel Lisbon)

      Ametista/room-I

      Corinthia Hotel Lisbon

      50

      Our BoF session proposes topics as informal meeting during the conference. The topic lead (submitter) will drive the conversations on the area of interest described in each BoF.

      The attendees group together based on a shared interest and carry out discussions without any pre-planned agenda.

      • 76
        Improving Buffered I/O

        What I'd like to get to is to discuss that buffered IO basically sucks for databases with high throughput, and direct IO sucks for databases that aren't individually well tuned, and is not adaptive to memory pressure at all.

        Buffered IO is slow, until recently only synchronous, has double buffering issues and writeback is hard to control.

        Direct IO requires that the application's equivalent of the page-cache is well tuned for the workload - but most installations don't have DBAs to do so, and in a lot of environments it's unrealistic to give all databases the peak required memory. In contrast to that the kernel page-cache adapts reasonably to changing workloads, caching data for the applications that need it most.

        Input both from the developers of other databases and from the kernel side would be very welcome.

      • 11:30 AM
        Break
      • 77
        Linux Perf advancements for compute intensive and server systems

        Modern server and compute intesive systems are naturally built around several top performance CPUs with large amount of cores and equipped by shared memory that spans a number of NUMA domains. Compute intensive workloads usually implement highly parallel CPU bound cyclic codes performing mathematics calculations that reference data located in the shared memory. Performance observability and profiling of these workloads on such systems have unique characteristics and impose specific requirements on software performance tools. The requirements include tools CPU scalability, coping with high rate and volume of collected performance data as well as NUMA awareness. In order to fulfill that requirements a number of extensions have been implemented in Linux Perf tool that are currently a part of the Linux kernel source tree:
        https://marc.info/?l=linux-kernel&m=154149439404555&w=2,
        https://marc.info/?l=linux-kernel&m=154149439404555&w=2,
        https://marc.info/?l=linux-kernel&m=155293062518459&w=2 .

        Speaker: Alexey Budankov
      • 78
        Tracing MC follow-up BoF

        Follow up on the tracing microconference

        Topics to be discussed:
        - Perf related events
        - Histogram sql syntaxes

    • Kernel Summit Track Floriana/room-III (Corinthia Hotel Lisbon)

      Floriana/room-III

      Corinthia Hotel Lisbon

      100

      This year, the Maintainer's and Kernel Summit will be at the Corinthia Hotel in Lisbon, Portugal, September 9th -- 12th. The Kernel Summit will be held as a track during the Linux Plumbers Conference September 9th -- 11th. The Maintainer's Summit will be held afterwards, on September 12th. As in previous years, the "Maintainer's Summit" is an invite-only, half-day event, where the primary focus will be process issues around Linux Kernel Development.

      The "Kernel Summit" is organized as a track which is run in parallel with the other tracks at the Linux Plumber's Conference (LPC), and is open to all registered attendees of LPC. The goal of the Kernel Summit track will be to provide a forum to discuss specific technical issues that would be easier to resolve in person than over e-mail.

      We will reserving roughly some Kernel Summit slots for last-minute discussions that will be scheduled during the week, in an "unconference style". This allows ideas that come up in hallway discussions, and in the LPC miniconferences, to be given
      scheduled, dedicated times for discussion.

      • 79
        Memory management bits in arch/* Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        There is a lot of similar and duplicated code in architecture specific
        bits of memory management.

        For instance, most architectures have

        #define PGALLOC_GFP (GFP_KERNEL | __GFP_ZERO)
        

        for allocating page table pages and many of them use similar, if not
        identical, implementation of pte_alloc_one*().

        But that's only the tip of the iceberg.

        There are several early_alloc() or similarily called routines that do

        if (slab_is_available())
            return kzalloc();
        else
            return memblock_alloc();
        

        Some other trivial examples are free_initmem(), free_initrd_mem()
        which were nearly identical accross many architectures until very
        recently.

        More complex cases are per-cpu initialization, passing of memory topology
        to the generic MM, reservation of crash kernel, mmap of vdso etc. They
        are not really duplicated, but still are very similar in at least
        several architectures.

        While factoring out the common code is an obvious step to take, I
        believe there is also room for refining arch <-> mm interface to avoid
        adding extra HAVE_ARCH_NO_BOOTMEM^w^wWHAT_NOT and then searching for the
        ways to get rid of them.

        Speaker: Mike Rapoport (IBM)
      • 80
        replacing mmap_sem with finer grained locks Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        In the linux kernel, most operations affecting a process's address space are protected by by mmap_sem (a per-process read-write semaphore).

        This simple design is increasingly a problem for multi-threaded applications, and often causes threads that operate on separate parts of their address space to end up blocking on each other due to false sharing issues - mmap_sem only supports locking the entire address space at once, so it can't take into consideration that the operations are not overlapping.

        I would like to discuss:
        1- The sort of blocking issues that are seen today due to the current mmap_sem design;
        2- mmap_sem mitigations that have been introduced over time, and have kept the situation bearable but not fundamentally solved the issue;
        3- try to discuss from first principles how the MM data structures and locking mechanisms would have to evolve to support finer grained MM locking, and how to progressively migrate the current MM codebase towards such a finer grained MM locking scheme;
        4- (hopefully) present early results with a fine grained MM locking prototype.

        Speaker: Michel Lespinasse (Google)
      • 11:30 AM
        Break Floriana/room-III (Corinthia Hotel Lisbon)

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
      • 81
        Killing the mmap_sem's contention Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        Big systems are becoming more common these days. Having thousands of CPUs is
        no more a dream and some applications are attempting to spread over all
        these CPUs by creating threads.
        This leads to contention on the mm->mmap_sem which is protecting the memory
        layout shared by these threads.
        There were multiple attempts to get rid of the mmap_sem's contention or the
        mmap_sem itself, Speculative Page Fault, RangeLock, Scalable Address Spaces
        Using RCU Balanced Trees...
        Unfortunately, these attempts didn't last enough to reach the upstream
        state. One the reason could be the major impact they are implying on the MM
        code or that they are only addressing part of the overall picture (SPF).
        Last discussions at the LSF/MM summit were not leading to an agreement on a
        solution (see LWN coverage).
        This topic is presenting one of emerging solution which didn't get the time
        to be proposed at the last LSF/MM. It is based on discussion some folks had
        at the end of the summit, trying to brainstorm a way to move to a split lock
        mechanism, as it was done for the PTE locking, removing the
        mm->page_table_lock.
        Currently, this work is still in progress and some deviations on the original
        design are expected to happen, so kind of split lock is the current option
        but this may change in the meantime.
        This topic is linked to the use of a Maple Tree to replace both the VMA RB
        tree and the VMA double linked list. Matthew Wilcox and Liam R. Howlett are
        working on.

        Speakers: Mr Jérôme Glisse, Mr Laurent Dufour
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 82
        Tracing Data Access Pattern with Bounded Overhead and Best-effort Accuracy Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        Background

        Memory pressure is inevitable in many environments. A decade size survey[1] of DRAM to CPU ratio in virtual machines and physical machines for data centers implies that the pressure will be even more common and severe. As an answer to this problem, heterogeneous memory systems utilizing recently evolved memory devices such as non-volatile memory along with the DRAM are rising.

        Nevertheless, because such devices are not only denser and cheaper but also obviously slower than DRAM, more optimal memory management is required.
        For this reason, various novel approaches[2,3,4] have proposed and discussed.
        One common goal of general memory management mechanisms including such approaches is placing each data object in proper location according to its data access pattern. Thus, knowing the data access pattern of given workloads is key.

        The Linux kernel is utilizing pseudo-LRU scheme for the purpose but it sacrificed too much accuracy for low overhead. Some approaches[3,5,6] track the access pattern based on page table access bit but such a technique could incur arbitrarily high overhead[3,6] or low accuracy[5] as the size of target workloads grows.

        Our solution

        We are developing a data access pattern profiling technique and tools that allow users to control the upper bound of profiling overhead while providing a best-effort quality of the result regardless of the size of the target workload.
        Basically, the solution is implemented on page table access bit sampling, which is widely used from other approaches[5]. In this approach, users can control the upper bound of the profiling overhead by setting the total number of the sampling regions.

        What differentiates ours from the others is its adaptive classification of sampling regions. The algorithm adaptively merges and splits each sampling region so that every data item in each region to have a similar data access pattern. In this way, our mechanism can minimize the number of sampling regions while maximizing the profiling accuracy.

        Implementation

        We implemented the mechanism as a kernel module that interacts with userspace via the 'debugfs' interface. We also provide userspace tools that help the use of the interface and visualization of the profiled results.

        Expected users

        We believe our mechanism could be used by both kernel space and user space.

        In kernel space, the aforementioned heterogeneous memory management approaches[2,3] could directly use this mechanism for efficient and effective data access pattern exploitation. Furthermore, this can be used by many traditional memory management subsystems that relying on the kernel's pseudo-LRU or naive assumptions. For example, selection of victim pages for page cache eviction or swap, pages to be promoted or demoted to or from huge pages (THP), target pages to compact nearby could use this.

        In userspace, system administrators or application programmers could use this tool to quickly analyze their workloads. The result can be used for a various way. Administrators might use the result to know a performance-effective working set size of their workloads and operate their system more efficiently.
        Programmers would optimize their programs using madvise()-like system calls[5] to give data access pattern hints to the system.

        Evaluations

        We applied this resulting tools to more than twenty of various realistic workloads including scientific, machine learning, and big data applications and confirmed that it provides effective and efficient profiling. For the confirmation, we visualized the output and compared with manual code review. We also evaluated its usefulness by manually estimating the performance-effective working set size and optimizing with madvise() system calls. Our performance-effective working set size was similar to that we found using time-consuming repetitive experiments and the optimization improved the performance under memory pressure situation up to 2x.

        Future plans

        We are planning to open source this tool and submit the patchset to LKML for upstream merge.

        Expected results of this talk

        We hope this talk to help discussions about the effective and efficient way to get data access pattern and how to use the data from memory management systems.
        Also, we would like to hear back kernel core developers' comments for upstreaming of this tool.

        References

        [1] Nitu, Vlad, et al. "Welcome to zombieland: practical and energy-efficient memory disaggregation in a datacenter." Proceedings of the Thirteenth EuroSys Conference. ACM, 2018.
        [2] "NUMA nodes for persistent-memory management." https://lwn.net/Articles/787418/
        [3] "Proactively reclaiming idle memory." https://lwn.net/Articles/787611/
        [4] "[RFCv2 0/6] introduce memory hinting API for external process." https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/T/#u
        [5] "Cache Modeling and Optimization using Miniature Simulations." https://www.usenix.org/conference/atc17/technical-sessions/presentation/waldspurger
        [6] "Idle Page Tracking." https://www.kernel.org/doc/html/latest/admin-guide/mm/idle_page_tracking.html

        Speaker: SeongJae Park
      • 83
        Interrupt Message Store: A scalable interrupt mechanism for the cloud Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        With virtualization being the key to the success of cloud computing, Intel's
        introduction of the Scalable IO Virtualization (SIOV) aims to further the cause by shifting the creation of assignable virtual devices from hardware to a more software assisted approach. Using SIOV, a device resource can be mapped directly to guest or other user space drivers for near native DMA (Direct Memory Access) performance. This flexible composition of direct assignable devices a.k.a. Assignable Device Interfaces (ADIs) is device specific and light weight, thus making them highly scalable. SIOV enables simpler device designs by unchaining hardware from costly PCI standard and can help address limitations associated with direct device assignment.

        Until now, message signaled interrupts (MSI and MSI-X) were the de facto standard for device interrupt mechanism and could support up to 2048 interrupts per device. But now with SIOV, there is a need to support a large number of ADIs (>2048), through a matching scalable interrupt management mechanism to service these ADIs.

        Interrupt message storage (IMS) is conceived as a scalable albeit device specific interrupt mechanism to meet such a demand. It allows non-PCI standard storage and enumeration of MSI address/data pair to reduce hardware overhead and achieve scalability. The size, location, and storage format for IMS is device-specific; some devices may implement IMS as on-device storage, while other devices may implement IMS in host memory.

        Also, one of the limitations with the current Linux MSI-X code is that PCIe device MSI-x enablement and allocation is static. i.e. device driver gets only one chance to enable MSI-X vectors, usually during probe. With IMS, we aim to make the vector negotiation with the OS dynamic, deferring vector allocation to post probe phase, where actual demand information is available.

        Through this presentation, the audience can view the internals of the complex
        and ever evolving Linux interrupt subsystem and understand how IMS can fit into the maze of interrupt domains, chips, remapping etc. Also, an initial IMS Linux implementation will be presented with highlights on some of the unique implementation challenges. We will conclude by demonstrating a test case using the SIOV enabled device as an example for a complete view of IMS in a scalable virtualization environment.

        Speaker: Megha Dey
      • 4:30 PM
        Break Floriana/room-III (Corinthia Hotel Lisbon)

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
      • 84
        Kernel documentation Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        What could be more fun than talking about kernel documentation? Things we
        could get into:

        • The state of the RST transition, what remains to be done, whether it's
          all just useless churn that makes the documentation worse, etc.

        • Things we'd like to improve in the documentation toolchain.

        • Overall organization of Documentation/ and moving docs when the need
          arises. It seems I end up fighting about this more than just about
          anything else, but I think it's important to organize our docs for the
          convenience of the people using them.

        • The ultimate vision for kernel docs (for now). RST conversion and
          imposing some organization are important, but they will not,
          themselves, give us a coherent set of documentation. What can we do to
          have documentation that is useful, current, and maintainable, rather
          than the dusty attic we have now?

        Speaker: Jonathan Corbet (Linux Plumbers Conference)
    • LPC Refereed Track Floriana/room-II (Corinthia Hotel Lisbon)

      Floriana/room-II

      Corinthia Hotel Lisbon

      200
      • 85
        BPF is eating the world, don't you see? Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The BPF VM in the kernel is being used in ever more scenarios where running a restricted, validated program in kernel space provides a super powerful mix of flexibility and performance which is transforming how a kernel work.

        That creates challenges for developers, sysadmins and support engineers, having tools for observing what BPF programs are doing in the system is critical.

        A lot has been done recently in improving tooling such as perf and bpftool to help with that, trying to make BPF fully supported for profiling, annotating, tracing, debugging.

        But not all perf tools can be used with JITed BPF programs right now, areas that need work, such as putting probes and collecting variable contents as well as further utilizing BTF for annotation are areas that require interactions with developers to gather insights for further improvements so as to have the full perf toolchest available for use with BPF programs.

        These recent advances and this quest for feedback about what to do next should be the topic of this talk.

        Speaker: Arnaldo Carvalho de Melo (Red Hat Inc.)
      • 86
        Maintaining out of tree patches over the long term Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The PREEMPT_RT patchset is the longest existing large patchset living outside the Linux kernel. Over the years, the realtime developers had to maintain several stable kernel versions of the patchset. This talk will present the lessons learned from this experience, including workflow, tooling and release management that has proven over time to scale. The workflow deals with upstream changes and changes to the patchset itself. Now that the PREEMPT_RT patchset is about the be merged upstream, we want to share our toolset and methods with others who may be able to benefit from our experience.

        This talk is for people who want to maintain an external patchset with stable releases.

        Speakers: Daniel Wagner, Daniel Bristot de Oliveira (Red Hat, Inc.), Steven Rostedt, Tom Zanussi, John Kacur
      • 11:30 AM
        Break Floriana/room-II (Corinthia Hotel Lisbon)

        Floriana/room-II

        Corinthia Hotel Lisbon

        200
      • 87
        Integration of PM-runtime with System-wide Power Management Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        There are two flavors of power management supported by the Linux kernel: system-wide PM based on transitions of the entire system into sleep states and working-state PM focused on controlling individual components when the system as a whole is working. PM-runtime is part of working-state PM concerned about the opportunity to put devices into low-power states when they are not in use.

        Since both PM-runtime and system-wide PM act on devices in a similar way (that is, they both put devices into low-power states and possibly enable them to generate wakeup signals), optimizations related to the handling of already suspended devices can be made, at least in principle. In particular:
        * It should be possible to avoid resuming devices already suspended by runtime PM during system-wide PM transitions to sleep states.
        * It should be possible to leave devices suspended during system-wide PM transitions to sleep states in PM-runtime suspend while resuming the system from those states.
        * It should be possible to re-use PM-runtime callbacks in device drivers for the handling of system-wide PM.

        These optimizations are done by some drivers, but making them work in general turns out to be a hard problem. They are achieved in different ways by different drivers and some of them are in effect only in specific platform configurations. Moreover, there are no general guidelines or recipes that driver writers can follow in order to arrange for these optimizations to take place. In an attempt to start a discussion on approaching this problem space more consistently, I will give an overview of it, describe the solutions proposed and used so far and suggest some changes that may help to improve the situation.

        Speaker: Rafael Wysocki (Intel Open Source Technology Center)
      • 88
        Kernel Address Space Isolation Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Recent vulnerabilities like L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS) have shown that the cpu hyper-threading architecture is very prone to leaking data with speculative execution attacks.

        Address space separation is a proven technology to prevent side channel vulnerabilities when speculative execution attacks are used. It has, in particular, been successfully used to fix the Meltdown vulnerability with the implementation of Kernel Page Table Isolation (KPTI).

        Kernel Address Space Isolation aims to use address spaces to isolate some parts of the kernel to prevent leaking sensitive data under speculative execution attacks.

        A particularly good example is KVM. When running KVM, a guest VM can use speculative execution attacks to leak data from the sibling hyper-thread, thus potentially accessing data from the host kernel, from the hypervisor or from another VM, as soon as they run on the same hyper-thread.

        If KVM can be run in an address space containing no sensitive data, and separated from the full kernel address space, then KVM would be immune from leaking secrets no matter on which cpu it is running, and no matter what is running on the sibling hyper-threads.

        A first proposal to implement KVM Address Space Isolation has recently been submitted and got some good feedback and discussions:

        https://lkml.org/lkml/2019/5/13/515

        This presentation would show progress and challenges faced while implementing KVM Address Space Isolation. It also looks forward to discuss the possibility to have a more generic kernel address space isolation framework (not limited to KVM), and how it can be interfaced with the current memory management subsystem in particular.

        MERGED with:

        Address space isolation has been used to protect the kernel from the
        userspace and userspace programs from each other since the invention of
        the virtual memory.

        Assuming that kernel bugs and therefore vulnerabilities are inevitable
        it might be worth isolating parts of the kernel to minimize damage
        that these vulnerabilities can cause.

        Recently we've implemented a proof-of-concept for "system call
        isolation (SCI)" mechanism that allows running a system call with
        significantly reduced page tables. In our model, the accesses to a
        significant part of the kernel memory generate page faults, thus
        giving the "core kernel" an opportunity to inspect the access and
        refuse it on a pre-defined policy.

        Our first target for the system call isolation was an attempt to
        prevent ROP gadget execution [1], and despite its weakness it makes a
        ROP attack harder to execute and as a nice side effect SCI can be used
        as Spectre mitigation.

        Another topic of interest is a marriage between namespaces and address
        spaces. For instance, the kernel objects that belong to a particular
        network namespace can be considered as private data and they should
        not be mapped in other network namespaces.

        This data separation greatly reduces the ability of a tenant in one
        namespace to exfiltrate data from a tenant in a different namespace
        via a kernel exploit because the data is no longer mapped in the
        global shared kernel address space.

        We believe it would be helpful to discuss the general idea of address
        space isolation inside the kernel, both from the technical aspect of
        how it can be achieved simply and efficiently and from the isolation
        aspect of what actual security guarantees it usefully provides.

        [1] https://lore.kernel.org/lkml/1556228754-12996-1-git-send-email-rppt@linux.ibm.com/

        Speakers: Alexandre Chartre (Oracle), James Bottomley (IBM), Mike Rapoport (IBM), Joel Nider (IBM Research)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 89
        Enabling TPM based system security features Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Nowadays all consumer PC/laptop devices contain TPM2.0 security chip (due to Windows hardware requirements). Also servers and embedded devices increasingly carry these TPMs. It provides several security functions to the system and the user, such as smartcard-like secure keystore and key operations, secure secret storage, bruteforce-protected access control, etc.

        These capabilities can be used in a multitude of scenarios and use cases, including disk encryption, device authentication, user authentication, network authentication, etc. of desktops/laptops, servers, IoTs, mobiles, etc.
        Utilizing the TPM requires several layers of software; the driver (inside the kernel), tpm middleware (a TSS implementation), security middleware (e.g. pkcs11), applications (e.g. ssh).

        This talk first gives an architectural overview of the hard-/software components involved in typical use cases. Then we will dive into a set of concrete use cases and on different ways in which they can be built up; these use cases will be related to device/user authentication around pkcs11 and openssl implementations.

        The talk will end with a list of software and works in progress for introducing TPM functionality to core applications. Finally, a list of potential projects for extending the utilization of the TPM in core software is presented. This latter list shall then drive the discussion on which software is missing or which software has cotributors attending that would like to include such features or which software is currently missing on the list. The current lists of core software are available and updated at https://tpm2-software.github.io/software

        Keywords: core libraries, device support, security, tpm, tss

        Speaker: Mr Andreas Fuchs (Fraunhofer SIT)
      • 90
        Utilizing tools made for "Big Data" to analyse Ftrace data - making it fast and easy Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Tools based on low level tracing tend to generate large amounts of data, typically outputted in some kind of text or binary format. On the other hand the predefined data analysis features of those tools are often useless when it comes to solving a nontrivial or very user-specific problem. This is when the possibility to make sophisticated analysis via scripting can be extremely useful.

        Fast and easy scripting inside the tracing data is possible if we take advantage of the already existing infrastructure, originally developed for the purposes of the "Big Data" and ML industries. A PoC interface for accessing Ftrace data in Python (via NumPy arrays) will be demonstrated, together with few examples of analysis scripts. Currently the prototype of the interface is implemented as an extension of KernelShark. This is a work in progress, and we hope to receive advice from experts in the field to make sure the end result works seamlessly for them.

        Speaker: Yordan Karadzhov (VMware)
      • 4:30 PM
        Break Floriana/room-II (Corinthia Hotel Lisbon)

        Floriana/room-II

        Corinthia Hotel Lisbon

        200
      • 91
        CPU controller on a single runqueue Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The cgroups CPU controller in the Linux scheduler is implemented using hierarchical runqueues, which introduces a lot of complexity, and incurs a large overhead with frequently scheduling workloads. This presentation is about a new design for the cgroups CPU controller, which uses just one runqueue, and instead scales the vruntime by the inverse of the task priority. The goal is to make people familiar with the new design, so they know what is going on, and do not need to spend a month examining kernel/sched/fair.c to figure things out.

        Speaker: Rik van Riel (Facebook)
      • 92
        Formal verification made easy (and fast)! Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Linux is complex, and formal verification has been gaining more and more attention because independent "asserts" in the code can be ambiguous and not cover all the desired points. Formal models aim to avoid such problems of natural language, but the problem is that "formal modeling and verification" sound complex. Things have been changing.

        What if I say it is possible to verify Linux behavior using a formal method?

        • Yes! We already have some models; people have been talking about it, but they seem to be very specific (Memory, Real-time...).

        What if I say it is possible to model many Linux subsystems, to auto-generate code from the model, to run the model on-the-fly, and that this can be as efficient as just tracing?

        • No way!

        Yes! It is! It is hard to believe, I know.

        In this talk, the author will present a methodology based on events and state (automata), and how to model Linux' complex behaviors with small and intuitive models. Then, how to transform the model into efficient C code, that can be loaded into the kernel on-the-fly to verify Linux! Experiments have also shown that this can be as efficient as tracing (sometimes even better)!

        This methodology can be applied on many the kernel subsystems, and the idea of this talk is also to discuss how to proceed towards a more formally verified Linux!

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
    • Networking Summit Track Floriana/room-I (Corinthia Hotel Lisbon)

      Floriana/room-I

      Corinthia Hotel Lisbon

      180
      • 93
        XDP: the Distro View Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        It goes without saying that XDP is wanted more and more by everyone. Of course, the Linux distributions want to bring to users what they want and need. Even better if it can be delivered in a polished package with as few surprises as possible: receiving bug reports stemming from users' misunderstanding and from their wrong expectations does not make good experience neither for the users nor for the distro developers.

        XDP presents interesting challenges to distros: from the initial enablement (what config options to choose) and security considerations, through user supportability (packets "mysteriously" disappearing, tcpdump not seeing everything), through future extension (what happens after XDP is embraced by different tools, some of those being part of the distro, how that should interact with users' XDP programs?), to more high level questions, such as user perception ("how comes my super-important use case cannot be implemented using XDP?").

        Some of those challenges are long solved, some are in progress or have good workarounds, some of them are yet unsolved. Some of those are solely the distro's responsibility, some of them need to be addressed upstream. The talk will present the challenges of enabling XDP in a distro. While it will also mention the solved ones, its main focus are the problems currently unsolved or in progress. We'll present some ideas and welcome discussion about possible solutions using the current infrastructure and about future directions.

        Speakers: Jiri Benc (Red Hat), Dr Toke Høiland-Jørgensen (RedHat), Jesper Dangaard Brouer (Red Hat)
      • 11:30 AM
        Break Floriana/room-I (Corinthia Hotel Lisbon)

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
      • 94
        Life at a Networking Vendor -- Keeping up with the Joneses Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Working for a networking hardware vendor can be an extremely rewarding experience for a kernel developer. The rate at which new features are accepted in the kernel also provides lots of motivation to develop new features that showcase hardware capabilities. This could be done by adding new support for dataplane offloads via cls flower, netfilter, or switchdev (if we still think it exists!). In-driver support for pre-SKB packet processing via XDP and AF_XDP also provide a chance for developers to search for new software optimizations in their driver receive and transmit path.

        In addition to thinking about what is happening upstream, developers at hardware vendors regularly find themselves managing internal and external expectations from those responsible for developing features that are not always exclusive to the Linux kernel. This could range from frameworks like DPDK and VPP that run on Linux or completely different OSes/stacks to functionality that is available without software interaction.

        There is no quicker way to develop new features and resolve issues than to have direct contact with hardware and firmware developers. The goal of this talk will be to share some experiences balancing the expectations of customers and partners along with those of the community.

        Speaker: Andy Gospodarek (Broadcom)
      • 95
        Future ipv4 unicast extensions Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        IPv4's success story was in carrying unicast packets
        worldwide.
        Service sites still need IPv4 addresses for everything,
        since the majority of Internet client nodes don't yet
        have IPv6 addresses. IPv4 addresses now cost 15 to 20
        dollars apiece (times the size of your network!) and
        the price is rising.

        The IPv4 address space includes hundreds of millions of
        addresses reserved for obscure (the ranges 0/8, and
        127/16), or obsolete (225/8-231/8) reasons, or for
        "future use" (240/4 - otherwise known as class E).
        Instead of leaving these IP addresses unused, we have
        started an effort to make them usable, generally. This
        work stalled out 10 years ago, because IPv6 was going
        to be universally deployed by now, and reliance on IPv4
        was expected to be much lower than it in fact still is.

        We have been reporting bugs and sending patches to
        various vendors. For Linux, we have patches accepted
        in the kernel and patches pending for the
        distributions, routing daemons, and userland tools.
        Slowly but surely, we are decontaminating these IP
        addresses so they can be used in the near future.

        Many routers already handle many of these addresses,
        or can easily be configured to do so, and so we are
        working to expand unicast treatment of these addresses
        in routers and other OSes. We plan an authorized
        experiment to route some of these addresses globally,
        monitor their reachability from different parts of the
        Internet, and talk to ISPs who are not yet treating
        them as unicast to update their networks.

        Wouldn't it be a better world with a few hundred
        million more IPv4 addresses in it?

        Speaker: Dr Dave Täht (Bufferbloat.net)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 96
        Making the Kubernetes Service Abstraction Scale using eBPF Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        In this talk, we will present a scalable re-implementation of the Kubernetes service abstraction with the help of eBPF. We will discuss recent changes in the kernel which made the implementation possible, and some changes in the future which would simplify the implementation.

        Kubernetes is an open-source container orchestration multi-component distributed system. It provides mechanisms for deploying, maintaining and scaling applications running in containers across a multi-host cluster. Its smallest scheduling unit is called a pod. A pod consists of multiple co-located containers. Each pod has its own network namespace and is addressed by an unique IP address in a cluster. Network connectivity to and among pods is handled by an external plugin.

        Multiple pods which provide the same functionality can be grouped into services. Each service is reachable within a cluster via its virtual IP address allocated by Kubernetes. Also, a service can be exposed to outside of a cluster via the public IP address of a cluster host IP address and a port which is allocated by Kubernetes. Each request sent to a service is load-balanced to any of its pods.

        Kube-proxy is a Kubernetes component which is responsible for the service abstraction implementation. The default implementation is based on Netfilter's iptables. For each service and its pods it creates couple rules in the nat table which do a load-balancing to pods. For example, for the "nginx" service which virtual IP address is 10.107.41.178 and which is running two pods with IP addresses 10.217.1.154 and 10.217.1.159 the following relevant iptables rules are created:

        
        -A KUBE-SERVICES -d 10.107.41.178/32 -p tcp -m comment --comment "default/nginx: cluster IP" -m tcp --dport 80 -j KUBE-SVC-253L2MOZ6TC5FE7P
        
        -A KUBE-SVC-253L2MOZ6TC5FE7P -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-PCCJCD7AQBIZDZ2N
        -A KUBE-SVC-253L2MOZ6TC5FE7P -j KUBE-SEP-UFVSO22B5A7KHVMO
        
        -A KUBE-SEP-PCCJCD7AQBIZDZ2N -s 10.217.1.154/32 -j KUBE-MARK-MASQ
        -A KUBE-SEP-PCCJCD7AQBIZDZ2N -p tcp -m tcp -j DNAT --to-destination 10.217.1.154:80
        -A KUBE-SEP-UFVSO22B5A7KHVMO -s 10.217.1.159/32 -j KUBE-MARK-MASQ
        -A KUBE-SEP-UFVSO22B5A7KHVMO -p tcp -m tcp -j DNAT --to-destination 10.217.1.159:80
        
        

        It has been demonstrated [1][2][3] that kube-proxy due to its foundational technologies (Netfilter, iptables) is one of the major pain points when running Kubernetes at large scale from performance, reliability, and operations perspective.

        Cilium is an open-source networking and security plugin for container orchestration systems, such as Kubernetes. Unlike the majority of such networking plugins, it heavily relies on eBPF technology which lets one to dynamically reprogram the kernel.

        The most recent Cilium v1.6 release brings the implementation in eBPF of the Kubernetes service abstraction. This allows one to run a Kubernetes cluster without kube-proxy. Thus, it makes Kubernetes no longer dependent on Netfilter/iptables. This improves scalability and reliability of a Kubernetes cluster.

        No Kubernetes knowledge is required. The talk might be relevant for those who are interested in container networking with eBPF (loadbalancing, NAT).

        [1]: https://sched.co/MPch
        [2]: https://bit.ly/2xKk2pr
        [3]: https://bit.ly/2WU7BCN

        Speakers: Mr Borkmann Daniel (Cilium), Mr Pumputis Martynas (Cilium)
      • 97
        Making Networking Queues a First Class Citizen in the Kernel Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        XDP (the eXpress Data Path) is a new method in Linux to process
        packets at L2 and L3 with really high performance. XDP has already
        been deployed for use cases involving ingress packet filtering, or
        transmission back through the ingress interface, are already well
        supported today. However, as we expand the use cases that involve the
        XDP_REDIRECT action, e.g., to send packets to other devices, or
        zero-copy them to userspace sockets, it becomes challenging to retain
        the high performance of the simpler operating modes.

        One of the keys to get good performance for these advanced use cases,
        is effective use of dedicated hardware queues (on both Rx and Tx), as
        this makes it possible to split traffic over multiple CPUs, with no
        synchronization overhead in the fast path. The problem with using
        hardware queues like this is that they are a constrained resource, but
        are hidden from the rest of the kernel: Currently, each driver
        allocates queues according to its own whims, and users have little or
        no control over how the queues are used or configured.

        In this presentation we discuss an abstraction that makes it possible
        to keep track of queues in a vendor-neutral way: We implement a new
        submodule in the Linux networking core that drivers can register their
        queues to. Other pieces of code can then allocate and free individual
        queues (or sets of them) satisfying certain properties (e.g., "a Tx/Rx
        pair", or "one queue per core"). This submodule also makes sure that
        the queues get IDs that are hardware independent, so that they can
        easily be used by other components. We show how this could be exposed
        to userspace, and how it can interact with the existing REDIRECT
        primitives, such as device maps.

        Finally if there is time, we would like to discuss a related problem:
        often a userspace program wants to express its configuration not in
        terms of queue IDs, but in terms of a set of packets it wants to
        process (e.g., by specifying an IP address). So how do we change user
        space APIs that use queue IDs to be able to use something more
        meaningful such as properties of the packet flow that a user wants? To
        solve this second problem, we propose to introduce a new bind option
        in AF_XDP that takes a simple description of the traffic that is
        desired (e.g. "VLAN ID 2", "IP address fc00:dead:cafe::1", or "all
        traffic on a netdev"). This hides queue IDs from userspace, but will
        use the new queue logic internally to allocate and configure an
        appropriate queue.

        Speakers: Magnus Karlsson (Intel), Björn Töpel (Intel), Jesper Dangaard Brouer (RedHat), Toke Höiland-Jörgensen (RedHat), Jakub Kicinski (Netronome), Maxim Mikityanskiy (Mellanox)
      • 4:30 PM
        Break Floriana/room-I (Corinthia Hotel Lisbon)

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
      • 98
        Seamless transparent encryption with BPF and Cilium Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Providing encryption in dynamic environments where nodes are added and removed on-the-fly and services spin-up and are then torn-down frequently, such as Kubernetes, has numerous challenges. Cilium, an open source software package for providing and transparently securing network connectivity, leverages BPF and the Linux encryption capabilities to provide L3/L7 encryption and authentication at the node and service layers. Giving users the ability to apply encryption either to entire nodes or on specified services. Once configured through a high level feature flag (--enable-encrypt-l3, --enable-encrypt-l7) the management is transparent to the user. Cilium will manage and ensure traffic is encrypted allowing for auditing of encrypted/unencrypted flows via a monitoring interface to ensure compliance.

        In this talk we will show how Cilium accomplishes this in the Linux datapath and control plane. As well as discuss how Cilium with Linux and BPF fits into the evolving encryption standards and frameworks such as IPsec, mTLS, Secure Production Identity Framework For Everyone (SPIFFE), and Istio. Looking forward we propose a set of extensions to the Linux kernel, specifically to the BPF infrastructure, to ease the adoption and improve the efficiency of these protocols. Specifically, we will look at a series of BPF helpers, possible hardware support, scaling to thousands of nodes, and transparently enforcing policy on encrypted sessions.

        Finally to show this is not mere slide-ware we will show a demo Cilium implementing transparent encryption.

        Speaker: Mr John Fastabend (Isovalent)
      • 99
        Ethernet Cable Diagnostic using Netlink Ethtool API Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Many Ethernet PHYs contain hardware to perform diagnostics of the
        Ethernet cable. Breaks in the cable and shorts within a twisted pair
        or to other pairs can be detected, and an estimate to the length along
        the cable to the fault can be made. The talk will explain, at a high
        level, how such diagnostics work, sending pulses down the cables and
        looking for reflections. There is no standardization on such
        diagnostics, and what information the PHY reports varies between
        vendors. The ongoing work to allow ethtool to make use of a netlink
        socket makes the ethtool API much more flexiable. This flexibility has
        been used to provide a generic API to request a PHY performs
        diagnostics tests and to report the results. Some aspects of this API
        will be discussed, using the Marvell PHYs as examples. The talk aims
        to spread knowledge on this work and encourage driver writers to
        implement diagnostics for other PHYs.

        Speaker: Andrew Lunn
    • Open Printing MC Opala/room-I&II (Corinthia Hotel Lisbon)

      Opala/room-I&II

      Corinthia Hotel Lisbon

      126

      Videos of the Topics:

      Printing in Linux as of today (00:0)
      Aveek Basu and Till Kamppeter

      Common Print Dialog Backends (29:14)
      Rithvik Patibandia

      Working with SANE to make IPP scanning a reality (51:51)
      Aveek Basu

      Printer/Scanner Application - The new format for printer and scanner drivers (1:15:30)
      Till Kamppeter

      The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service (2:05:10)
      Till Kamppeter

      3D Printing without the use of any slicer (2:20:04)
      Aveek Basu

      The Open Printing (OP) organisation works on the development of new printing architectures, technologies, printing infrastructure, and interface standards for Linux and Unix-style operating systems. OP collaborates with the IEEE-ISTO Printer Working Group (PWG) on IPP projects.

      We maintain cups-filters which allows CUPS to be used on any Unix-based (non-macOS) system. Open Printing also maintains the Foomatic database which is a database-driven system for integrating free software printer drivers with CUPS under Unix. It supports every free software printer driver known to us and every printer known to work with these drivers.

      Today it is very hard to think about printing in UNIX based OSs without the involvement of Open Printing. Open Printing has been successful in implementing driverless printing following the IPP standards proposed by the PWG as well.

      Proposed Topics:

      Working with SANE to make IPP scanning a reality. We need to make scanning work without device drivers similar to driverless printing.
      Common Print Dialog Backends.
      Printer/Scanner Applications - The new format for printer and scanner drivers. A simple daemon emulating a driverless IPP printer and/or scanner.
      The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service. Controlling tools like cups-browsed (or perhaps also the print dialog backends?) to make the user's print dialogs only showing the relevant ones or to create printer clusters.
      3D Printing without the use of any slicer. A filter that can convert a stl code to a gcode.
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Till Kamppeter (till.kamppeter@gmail.com ) or Aveek Basu (basu.aveek@gmail.com)

      • 100
        Printing in Linux as of today

        Today’s is a scenario when we can not think of having either a mobile phone or a laptop or a tablet. With the progress of technology and having all these handheld devices, we have been able to get many of our documents digitized. However, whatever advancements we see in this space of documentation, it is still very hard to find someone who did not have the need to print or scan a hard copy. Even today a critical agreement gets signed over a hard copy so do most of our banking documents or promo advertisements in a supermarket.

        The OpenPrinting (OP) organization works on the development of new printing architectures, technologies, printing infrastructure, and interface standards for Linux and UNIX-style operating systems. OP collaborates with the IEEE-ISTO Printer Working Group (PWG) on IPP projects.
        We maintains cups-filters which allows CUPS to be used on any Unix-based (non-macOS) system. OpenPrinting maintains the Foomatic database which is a database-driven system for integrating free software printer drivers with CUPS under Unix. It supports every free software printer driver known to us and every printer known to work with these drivers.

        OpenPrinting has been doing a commendable job in improving the way the world prints on a UNIX based system. The projects that we maintain are taken up by almost all the Linux distributions and most recently Google Chrome OS. It is also used by most of the printer manufacturers to support printing. Today it is very hard to think about printing in these OSs without the involvement of OpenPrinting. We have been successful in implementing the driverless printing following the IPP standards proposed by the PWG. Because of that, today someone can think of printing from a Linux box by just connecting a printer over network or USB. Now using a printer has become as simple as using a thumb drive.

        A short showcase on printing in Linux.

        Speakers: Aveek Basu, Till Kamppeter
      • 101
        Common Print Dialog Backends

        The OpenPrinting project “Common Print Dialog Backends” provides a D-Bus interface to separate the print dialog GUI from the communication with the actual printing system (CUPS, Google Cloud Print, e.t.c.) having each printing system being supported with a backend and these GUI-independent backends working with all print dialogs (GTK/GNOME, Qt/KDE, LibreOffice, e.t.c.). This allows for easily updating all print dialogs when something in a print technology changes, as only the appropriate backend needs to get updated. Also new print technologies can get easily introduced by adding a new backend.
        For quickly getting this concept into the Linux distributions we need these important tasks to be done.
        The CUPS backend tells the print dialog only about printer-specific user-settable options, not about general options implemented in CUPS or cups-filters and so being available for all print queues. These are options like N-up, reverse order, selected pages, e.t.c. as they are only common for CUPS and not necessarily available with other print technologies like Google Cloud Print, they should get reported to the print dialog by the CUPS backend.
        A print dialog should allow to print into a (PDF) file. This should be implemented in a new print dialog backend. [DONE: https://github.com/OpenPrinting/cpdb-backend-file]
        As it will take time until GTK4 with its new print dialog is out, we should get support for the new Common Print Dialog Backends concept for the current GTK3 dialog. As this dialog has its own backend concept one simply would need an “adapter” backend to get from the old concept to the new, common concept.
        [Qt print dialog integration]

        Speakers: Rithvik Patibandla, Till Kamppeter
      • 102
        Working with SANE to make IPP scanning a reality

        Printing at today’s date has progressed a lot and the world is already utilising the benefits of driverless printing. In today’s scenario it is very hard to think of a printer without a scanner. But unfortunately a technology like driverless scanning has yet to see the light of the day. In today’s date you cannot think of using a scanner without a scanner driver. We want to discuss more on this and what needs to be done to get rid of this problem.

        Version 2.0 and newer of the Internet Printing Protocol (IPP) supports polling the full set of capabilities of a printer and if the printer supports a known Page Description Language (PDL), like PWG Raster, Apple Raster, PCLm, or PDF, it is possible to print without printer-model-specific software (driver) or data (PPD file), so-called “driverless” printing. This concept was introduced for printing from smartphones and IoT devices which do not hold a large collection of printer drivers. Driverless printing is already fully supported under Linux. Standards following this scheme are IPP Everywhere, Apple AirPrint, Mopria, and Wi-Fi Direct Print. As there are many multi-function devices (printer/scanner/copier all-in-one) which use the IPP, the Printing Working Group (PWG) has also worked out a standard for IPP-based scanning, “driverless” scanning, to also allow scanning from a wide range of client devices, independent of which operating systems they are running. Conventional scanners are supported under Linux via the SANE (Scanner Access Now Easy) system and require drivers specific to the different scanner models. Most of them are written based on reverse-engineering due to lack of support by the scanner manufacturers. To get driverless scanning working with the software the users are used to the best solution is to write a SANE module for driverless IPP scanning. This module will then automatically support all IPP scanners, thousands of scanners where many of them do not yet exist.
        Another application for driverless IPP scanning is sharing local scanners which are accessed with SANE. Instead of the SANE frontend being a UI, either command line or graphical, it could be a daemon which emulates an IPP scanner on the network, executing the client’s scan requests on the local scanner.
        This way the client only needs to support IPP scanning, no driver for the actual scanner is needed and the client can be of any operating system or device type, including mobile phones, tablets, IoT, e.t.c.

        Speaker: Aveek Basu
      • 11:30 AM
        Break

        Break

      • 103
        Printer/Scanner Applications - The new format for printer and scanner drivers

        The upstream author of CUPS has deprecated the classic way to implement printer drivers, describing the printer's capabilities in PPD (PostScript Printer Description) files and providing filters to convert standard PDLs (Page Description Languages) into the printer's own, often proprietary data format. With the background of PostScript not being the standard PDL any more, most modern (even the cheapest) printers being IPP driverless printers (using standard PDLs and printer's capabilities can get polled from the printer via IPP), and modern systems using sandboxed application packaging (Snappy, Flatpak, e.t.c.) the new Printer Application concept got introduced.
        A Printer Application is a (simple) daemon emulating a driverless IPP printer (can be in the local network but also simply on localhost). Like a physical printer this daemon advertises itself via DNS-SD, takes get-printer-attributes IPP requests and answers with printer capability info so that the client can create a local print queue pointing to it, takes print jobs, converts them to the physical printer's data format and sends them off to the printer.
        This way the client "sees" a driverless IPP printer and the Printer Application is the printer driver (printer-model-specific software to make the printer work). So with the driver being connected to the system's printing stack only via IP and no consisting of files spread into directories of the printing stack, both the printing stack and the driver can be in separate, sandboxed applications, provided as sandboxed packages in the app stores of the appropriate packaging systems (Snappy, Flatpak, e.t.c.). And this allows the driver not depending on a specific operating system distribution any more. A printer manufacturer only needs to make a driver "for Snappy", not for Ubuntu Desktop/Server, Ubuntu Core, Red Hat, SUSE, e.t.c. making development and testing much easier and cheaper.
        And one can even go further: As the Printer Working Group (PWG) also has created an IPP driverless scanning standard, we can create Scanner Applications emulating a driverless IPP scanner and internally using scanner drivers, like SANE, to communicate with the scanner, allowing the same form of OS-distribution-independent sandboxed driver packages for ANY scanner, especially also stand-alone scanners without printing engine.
        For multi-function printers one could also have a combined Printer/Scanner application. Any such Printer and/or Scanner Application can even provide an IPP System Service interface to allow configuring the driver without need of specialized GUI applications on the client.
        We have a Google Summer of Code student working on a framework for Printer Applications, to convert classic printer drivers into Printer Applications to kick off the new standard.
        In this session we will present the new format, its integration into real life systems, problems we got into during the work with our student, and how to present it to hardware manufacturers as the new way to go.

        Speaker: Till Kamppeter
      • 104
        The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service

        Very common in the daily life of computer users are printer setup tools, these GUI applications where you configure a queue for a new printer which you want to use. You select the printer from auto-detected ones and choose a driver for it, nowadays it gets rather common that the driver is selected automatically. You also set option defaults, like Letter/A4, print quality, …
        With the advent of driverless IPP printers and automatic setup of network printers the classic printer setup tool gets less important. Especially one sees this on smartphones and tablets which do not even have a printer setup tool and option settings and default printers are selected in the print dialogs.
        But this does not mean that the time of printer setup tools is over, especially in larger networks they can help getting an overview of the available printers, controlling tools like cups-browsed (or perhaps also the print dialog backends?) to make the user's print dialogs only showing the relevant ones or to create printer clusters.
        Also the printers itself could be configured with a printer setup tool when they support the new IPP System Service standard, an interface which allows remote administration of IPP network printers, similar to what you can do with the printer's web interface but with a standardized client GUI.
        In this session we will talk about new possibilities in printer setup tools and their implementation. Ideas are:
        Client GUI for IPP System Service - Administration of network printers
        Configuring cups-browsed - GUI for printer list filtering, printer clustering, …
        Configuring Common Print Dialog Backends
        More ideas are naturally welcome.

        Speaker: Till Kamppeter
      • 105
        3D Printing without the use of any slicer.

        Currently to print an stl model in a 3D printer the same needs to be sliced first into a gcode to be understandable by a 3D printing software. In Linux we do not have any filter that can convert a stl code to a gcode. First we plan to discuss on what is the current scenario and then what can we do to fit in Linux.

        Speaker: Aveek Basu
    • Testing and Fuzzing MC Esmerelda/room-I&II (Corinthia Hotel Lisbon)

      Esmerelda/room-I&II

      Corinthia Hotel Lisbon

      126

      The Linux Plumbers 2019 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.

      Videos of the Topics:

      kernelCI: testing a broad variety of hardware (00:00)
      Kevin Hillman and Guillaume Tucker

      Dealing with complex test suites (32:29)
      Guillaume Tucker

      GWP-ASAN (52:42)
      Dmitry Vyukov

      Fighting uninitialized memory in the kernel (1:13:06)
      Alexander Potapenko

      syzbot (1:26:53)
      Dmitry Vuykov

      Collabora/unification around unit testing frameworks (1:48:49)
      Knut Omang - Sorry for the very low audio at the start. Microphone problem

      All about Kselftest (2:19:25)
      Shuah Khan

      Potential topics:
      Defragmentation of testing infrastructure: how can we combine testing infrastructure to avoid duplication.
      Better sanitizers: Tag-based KASAN, making KTSAN usable, etc.
      Better hardware testing, hardware sanitizers.
      Are fuzzers "solved"?
      Improving real-time testing.
      Using Clang for better testing coverage.
      Unit test framework. Content will most likely depend on the state of the patch series closer to the event.
      Future improvement for KernelCI. Bringing in functional tests? Improving the underlying infrastructure?
      Making KMSAN/KTSAN more usable.
      KASAN work in progress
      Syzkaller (+ fuzzing hardware interfaces)
      Stable tree (functional) testing
      KernelCI (autobisect + new testing suites + functional testing)
      Kernel selftests
      Smatch
      Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Sasha Levin levinsasha928@gmail.com and Dhaval Giani dhaval.giani@gmail.com

      • 106
        kernelCI: testing a broad variety of hardware

        kernelCI: testing a broad variety of hardware

        The Linux kernel runs on an extremely wide range of hardware, but
        with the rapid pace of kernel development, it's difficult to ensure
        the full range of supported hardware is adequately tested.

        The kernelCI project is a small, but growing project, focused on
        testing the core kernel on diverse set of architectures, boards and
        compilers using distributed labs to test hardware anywhere on the
        planet.

        The goal of this presentation is to give a very brief overview of the
        project, and discuss the near-term future goals and plans.

        Recently added:
        - support for clang-build kernels
        - more arches: ARC, RISC-V, MIPS

        The future:
        - official Linux Foundation project launching
        - more tests: subsystem-focused test suites
        - more labs with more hardware
        - scaling of infrastructure
        - better reporting

        Speakers: Kevin Hilman (BayLibre), Guillaume Tucker (Collabora Limited)
      • 107
        Dealing with complex test suites

        Boot testing is already hard to do well on a wide variety of
        hardware. However it is only scratching the surface of the
        kernel code base. To take projects such as Kernel CI to the next
        level and increase coverage, functional tests are becoming the
        next big thing on the list. Large test suites that run close to
        the hardware are very hard to tame. Some projects such as
        ezbench could become very helpful outside of its initial
        territory that is Intel graphics. But to start with, let us try
        to define the problem space and take a look at the state of the
        art in this area to then come up with ideas that apply to
        upstream kernel functional testing.

        Speaker: Guillaume Tucker (Collabora Limited)
      • 108
        GWP-ASAN

        In this talk Dmitry will introduce the idea of GWP-ASAN, a sampling tool that finds use-after-free and heap-buffer-overflows bugs in production environments. GWP-ASan supplements the normal slab allocator and chooses random allocations to 'sample'. These sampled allocations are placed into a special guarded pool, which is based upon the traditional 'Electric Fence Malloc Debugger' idea. Dmitry will share experiences of using such tool in user-space and speculate about how useful such tool would be for kernel.

        Speaker: Dmitry Vyukov (Google)
      • 109
        Fighting uninitialized memory in the kernel

        During the last two years, KMSAN (a detector of uses of uninitialized
        memory based on compiler instrumentation) has found more than a
        hundred bugs in the upstream kernel.
        We'll discuss the current status of the tool, some of its findings and
        implementation challenges. Ideally, I'd like to get more people to
        look at the code, as finding bugs in particular subsystems may require
        deeper knowledge of those subsystems.
        Another thing that'll be covered is the new stack and heap
        initialization features that will hopefully prevent most of the bugs
        related to uninitialized memory in the kernel.

        Speaker: Alexander Potapenko (Google)
      • 11:30 AM
        Break
      • 110
        syzbot: update and open problems

        In this talk, Dmitry will share updates on syzkaller/syzbot since last year: USB fuzzing, bisection, memory leaks. Talk about open problems: testability of kernel components; test coverage; syzbot process.

        Speaker: Dmitry Vyukov (Google)
      • 111
        Collaboration/unification around unit testing frameworks

        From the initial reactions and interest I have seen wrt. KTF
        (http://heim.ifi.uio.no/~knuto/ktf/, https://github.com/oracle/ktf)
        and the discussions on LKML around KUnit (https://lkml.org/lkml/2018/11/29/82),
        it seems there's a general belief that some form of unit test framework
        like these can be a good addition to the tools and infrastructure already available
        in the kernel.

        It seems however that different people have different notions about what
        and how such a framework should ideally look, and what features belong there.
        I'd like to see if we can bring that discussion forward by focusing on
        some of these items, where people seem to have quite differing views
        depending on where they come from. Here is a non extensive list of
        some topics that seems to pop up when this gets discussed:

        • "Purity" of unit testing - what constitutes a "unit" in the kernel?
        • Testing kernel code - user space vs kernel space? (both useful)
        • Immediate development/debugging requirements vs longer term needs
        • Driver/hardware interaction testing?
        • "Neat"-factor
        • ease of use
        • Network testing (more than 1 kernel involved)
        • How to best integrate with existing test infrastructure in the
          kernel
        • Unification and simpliciation options
          ...

        I'd like to make a short intro into this, and hopefully we can have some
        good exchange based on that.

        Speaker: Dr Knut Omang (Oracle)
      • 112
        All about Kselftest

        Kselftest started out as an effort to enable a developer-focused regression test framework in the kernel to ensure the quality of new kernel releases. Today it is an integral part of the Linux Kernel development process to qualify Linux mainline and stable release candidates.

        Shuah will go over the Kselftest framework, how to write tests that work well with the framework for effective reporting of results. In addition, Shuah will discuss how the framework is tailored for developers as well as users to serve their individual and unique needs and discuss future plans.

        Speakers: Shuah Khan (The Linux Foundation), Anders Roxell, Dan Rue
    • Toolchains MC Jade/room-I&II (Corinthia Hotel Lisbon)

      Jade/room-I&II

      Corinthia Hotel Lisbon

      160

      The goal of the Toolchains Microconference is to focus on specific topics related to the GNU Toolchain and Clang/LLVM that have a direct impact in the development of the Linux kernel.

      The intention is to have a very practical MC, where toolchain and kernel hackers can engage and, together:

      Identify problems, needs and challenges.
      Propose, discuss and agree on solutions for these specific problems.
      Coordinate on how to implement the solutions, in terms of interfaces, patches submissions, etc in both kernel and toolchain component.
      

      Consequently, we will discourage vague and general "presentations" in favor of concreteness and to-the-point discussions, encouraging the participation of everyone present.

      Examples of topics to cover:

      Header harmonization between kernel and glibc.
      Wrapping syscalls in glibc.
      eBPF support in toolchains.
      Potential impact/benefit/detriment of recently developed GCC optimizations on the kernel.
      Kernel hot-patching and GCC.
      Online debugging information: CTF and BTF
      

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Jose E. Marchesi jose.marchesi@oracle.com and Elena Zannoni ezannoni@gmail.com

      • 113
        Analyzing changes to the binary interface exposed by the Kernel to its modules

        Operating system distributors often face challenges that are somewhat
        different from that of upstream kernel developers. For instance, some
        kernel updates often need to stay at least binary compatible with
        modules that might be "out of tree" for some time.

        In that context, being able to automatically detect and analyze
        changes to the binary interface exposed by the kernel to its module
        does have some noticeable value.

        The Libabigail framework is capable of analyzing ELF binaries along
        with their accompanying debug info in the DWARF format, detect and
        report changes in types, functions, variables and ELF symbols. It has
        historically supported that for user space shared libraries and
        application so we worked to make it understand the Linux kernel
        binaries.

        In this presentation, we are going to present the current support of
        ABI analysis for Linux Kernel binaries, especially the kind of
        information that Libabigail consumes from DWARF and thus what it would
        need from an alternative debug info format.

        We hope the presentation will lead to discussions on topics revolving
        around what it would take to adapt Libabigail to the emerging
        alternate debug info formats and if that would make sense at all.

        Speaker: Mr Dodji Seketeli (Red Hat)
      • 114
        Wrapping system calls in glibc

        The glibc project decided a while back that it wants to add wrappers for
        system calls which are useful for general application usage. However,
        that doesn't mean that all those missing system calls are added
        immediately.

        System call wrappers still need documentation in the manual, which
        can be difficult in areas where there is no consensus how to describe
        the desired semantics (e.g., in the area of concurrency). Copyright
        assignment to the FSF is needed for both the code and the manual update,
        but can usually be performed electronically these days, and is reasonably
        straightforward. On top of that, the glibc project is seriously
        constrained by available reviewer bandwidth.

        Some more specific notes:

        Emulation of the system call is not required. It has been historically
        very problematic. The only thing that has not come back to bite us is
        checking if a new flag argument is zero and call the old, equivalent
        system call instead in this case.

        Wrapper names should be architecture-independent if at all possible.
        Sharing system call names as much as possible between architectures in
        the UAPI headers helps with that.

        Mutiplexing system calls are difficult to wrap, particularly if the
        types and number of arguments vary. Previous attempts to use varargs
        for this have led to bugs. For example, open/openat would not pass
        down the mode flag for O_TMPFILE initially, or cannot be called with
        a non-variadic prototype/function pointer on some architectures. We
        wouldn't want to wrap ipc or socketcall (even if they had not been
        superseded), and may wrap futex as separate functions.

        We strongly prefer if a system call that is not inherently
        architecture-specific (e.g., some new VFS functionality) is enabled
        for all architectures in the same kernel release.

        When it comes to exposing the system call, we prefer to use ssize_t or
        size_t for buffer sizes (even if the kernel uses int or unsigned int),
        purely for documentation purposes. Flag arguments should not be long
        int because it is unclear whether in the future more than 32 flags will
        be added on 64-bit architectures. Except for pthread_* functions, error
        reporting is based on errno and special return values.

        Passing file offsets through off64_t * arguments is fine with us.
        Otherwise, off64_t parameter passing tends to vary too much.

        If constants and types related to a particular system call are defined
        in a separate header which does not contain much else, we can include
        that from the glibc headers if available. As result, new kernel flags
        will become available to application developers immediately once they
        install newer kernel headers. This may not work for multiplexing
        system calls, of course, even if we wrap the multiplexer.

        Speakers: Dmitry Levin (BaseALT), Florian Weimer, Maciej W. Rozycki
      • 115
        Security feature parity between GCC and Clang

        There are many security features common to both GCC and Clang, but there is a growing set of features that are missing from GCC and present in Clang, missing from Clang and present in GCC, or missing in both. This session seeks to enumerate and discuss these areas, with the eye toward finding next steps forward (or at least elevating development priority).

        Potential areas of focus:
        - LTO (especially link speed)
        - forward-edge CFI (software and hardware support)
        - backward-edge CFI (software and hardware support)
        - stack variable auto-initialization
        - caller-saved register wipe on function return
        - integer overflow detection
        - stack clash protection
        - implicit fall-through
        - memory tagging

        Speaker: Kees Cook (Google)
      • 11:30 AM
        AM Break
      • 116
        Update on the LLVM port of the Linux Kernel

        This topic will cover how the LLVM port of the linux kernel is going, where it’s being used, and some of the pain points still plaguing those efforts. The issues the kernel port is having almost always are the same issues that other projects have porting from gcc to clang.

        A lot of updates have been made to both the kernel and to llvm/clang which are making both projects better.

        Speaker: Behan Webster
      • 117
        Compact C Type Format Support in the GNU toolchain

        A brief introduction to CTF and its recent addition to the GNU toolchain: what is it for, what's there now, what improvements are planned, and why you might want to use this stuff rather than DWARF.

        What cool things might we be able to do now that C programs can inspect their own types cheaply? What cool things might we be able to do if we extend this to other languages, so C programs could introspect into other languages' type systems?

        A particular focus of interest will be finding out how CTF could help BTF, and vice versa: they are doing similar but slightly different things, and surely the two schemes could cooperate to the benefit of both.

        Speakers: Nick Alcock (Oracle Corporation), Indu Bhagat (Oracle Corporation)
      • 118
        eBPF support in the GNU Toolchain

        This proposal covers the ongoing effort about adding eBPF support to the GNU Toolchain.

        Binutils support is already upstream [1]. This includes a CGEN cpu description, assembler, disassembler and linker. A GCC backend will be submitted for inclusion upstream before September.

        Both the binutils and GCC ports will be briefly described, and then a list of points will be discussed with the kernel community, and also with the llvm people present.

        The main goals of the sessions are:
        1) to ensure the port is useful to the eBPF community and
        2) to agree on ABI (with the kernel) and interoperability (with llvm.)

        [1] https://sourceware.org/ml/binutils/2019-05/msg00306.html

        Speaker: Mr Jose E. Marchesi (Oracle Inc, GNU Project)
    • Android MC Opala/room-I&II (Corinthia Hotel Lisbon)

      Opala/room-I&II

      Corinthia Hotel Lisbon

      126

      Building on the Treble and Generic System Image work, Android is
      further pushing the boundaries of upgradibility and modularization with
      a fairly ambitious goal: Generic Kernel Image (GKI). With GKI, Android
      enablement by silicon vendors would become independent of the Linux
      kernel running on a device. As such, kernels could easily be upgraded
      without requiring any rework of the initial hardware porting efforts.
      Accomplishing this requires several important changes and some of the
      major topics of this year's Android MC at LPC will cover the work
      involved. The Android MC will also cover other topics that had been the
      subject of ongoing conversations in past MCs such as: memory, graphics,
      storage and virtualization.

      Proposed topics include:

      Generic Kernel Image
      ABI Testing Tools
      Android usage of memory pressure signals in userspace low memory killer
      Testing: general issues, frameworks, devices, power, performance, etc.
      DRM/KMS for Android, adoption and upstreaming dmabuf heaps upstreaming
      dmabuf cache managment optimizations
      kernel graphics buffer (dmabuf based)
      SDcardfs
      uid stats
      vma naming
      vitualization/virtio devices (camera/drm)
      libcamera unification
      These talks build on the continuation of the work done last year as reported on the Android MC 2018 Progress report. Specifically:

      Symbol namespaces have gone ahead
      There is continued work on using memory pressure signals for uerspace low memory killing
      Userfs checkpointing has gone ahead with an Android-specific solution
      The work continues on common graphics infrastructure
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Karim Yaghmour karim.yaghmour@opersys.com, Todd Kjos tkjos@google.com, Sandeep Patil sspatil@google.com, and John Stultz john.stultz@linaro.org

      • 119
        Generic Kernel Image (GKI) progress

        A year ago at Linux Plumbers, we talked about a generic Android kernel that boots
        and runs reasonably well on any Android device. This talk shares the progress we've made so far on many fronts. A summary of those work streams, problems we discovered along the way and our plans for them. We will talk about our short term goals and long term vision to get Android device kernels as close to the mainline as possible.

        Speaker: Sandeep Patil (Google)
      • 120
        Monitoring and Stabilizing the In-Kernel ABI

        The Kernel's API and ABI exposed to Kernel modules is not something
        that is usually maintained in upstream. Deliberately. In fact, the
        ability to break APIs and ABIs can greatly benefit the development.
        Good reasons for that have been stated multiple times. See e.g.
        Documentation/process/stable-api-nonsense.rst.
        The reality for distributions might look different though. Especially
        - but not exclusively - enterprise distributions aim to guarantee ABI
        stability for the lifetime of their released kernels while constantly
        consuming upstream patches to improve stability and security for said
        kernels. Their customers rely on both: upstream fixes and the ability
        to use the released kernels with out-of-tree modules that are compiled
        and linked against the stable ABI.

        In this talk I will give a brief overview about how this very same
        requirement applies to the Kernels that are part of the Android
        distribution. The methods presented here are reasonable measures to
        reduce the complexity of the problem by addressing issues introduced
        by ABI influencing factors like build toolchain, configurations, etc.

        While we focus on Android Kernels, the tools and mechanisms are
        generally useful for Kernel distributors that aim for a similar level
        of stability. I will talk about the tools we use (like e.g.
        libabigail), how we automate compliance checking and eventually
        enforce ABI stability.

        Speaker: Matthias Männich (Google)
      • 121
        Solving issues associated with modules and supplier-consumer dependencies

        GKI or any ARM64 Linux distro needs a single ARM64 kernel that works across all SoCs. But having a single ARM64 kernel that works across all SoCs has a lot of hurdles. One of them, is getting all the SoC specific devices to be handed off cleanly from the bootloader to the kernel even when all their drivers are loaded as modules. Getting this to work correctly involves proper ordering of events like module loading, device initialization and device boot state clean up. This discussion is about the work that's being done in the upstream kernel to automate and facilitate the proper ordering of these events.

        Speaker: Saravana Kannan (Google)
      • 122
        Android Virtualization (esp. Camera, DRM)

        An update on how we plan to enable multimedia testing on our 'cuttlefish' virtual platform. Overview of missing components for graphics virtualization.

        Speaker: Alistair Delva (Google)
      • 123
        libcamera: Unifying camera support on all Linux systems

        The libcamera project was started at the end of 2018 to unify camera support on all Linux systems (regular Linux distributions, Chrome OS and Android). In 9 months it has produced an Android Camera HAL implementing the LIMITED profile for Chrome OS, and work is in progress to implement the FULL profile. Two platforms are currently supported (Intel IPU3 and Rockchip ISP), with work on additional platforms ongoing.

        First-class Android support doesn't only depend on the effort put on libcamera, but requires cooperation with the Android community and industry. In particular, libcamera has reached a point where it needs to discuss the following topics:

        • Feedback from the Android community on the overall architecture
        • Feedback from SoC vendors on the device-specific interfaces and device support in general
        • Next development steps for libcamera to support the LEVEL 3 profile
        • Contribution of libcamera to Project Treble and integration in AOSP
        • Future of the Android Camera HAL API and feedback from libcamera team

        Discussions regarding the shortcomings of the Linux kernel APIs for Android camera support, and how to address them, is also on-topic as libcamera suffers from the same issues.

        As the Linux Plumbers Conference will gather developers from the Google Android teams, from the Android community, from the Linux kernel media community and from the libcamera project, we strongly believe this is a unique occasion to design the future of camera support in Linux systems all together.

        Speaker: Laurent Pinchart (Ideas on Board Oy)
      • 124
        Emulated storage features (eg sdcardfs)

        Update and discussion of emulated storage on Android

        Speaker: Daniel Rosenberg (Google)
      • 4:30 PM
        Break
      • 125
        Eliminating WrapFS hackery in Android with ExtFUSE (eBPF/FUSE)

        This work proposes to adopt Extended FUSE (ExtFUSE) framework for improving the performance of Android SDCard FUSE daemon, thereby eliminating a need for out-of-tree WrapFS hackery in the Android kernel.

        ExtFUSE leverages eBPF framework for developing extensible FUSE file systems. It allows FUSE daemon in Android to register “thin” eBPF handlers that can serve metadata as well as data I/O file system requests right in the kernel to improve performance. Our evaluation with Android SDCardFS under ExtFUSE shows about 90% improvement in app launch latency with less than thousand lines of eBPF code in the kernel. In the presentation, I will share my findings and progress made to get feedback from the Android kernel developers.

        Overall, this work benefits millions of Android devices that are currently running out-of-tree WrapFS-based code in the kernel for emulating FAT functionality and enforcing custom security checks.

      • 126
        How we're using ebpf in Android networking

        A short update on eBPF in Android networking:

        • how we're using ebpf in Android P on 4.9+ for statistics collection
          and Q on 4.9+ for xlat464 offload, with a focus on the sorts of
          problems we've run into
        • where we'd like to go, ie. future plans with regard to xlat464/forwarding/nat
          offload and XDP.
      • 127
        Linaro Kernel Functional Testing (LKFT): functional testing of android common kernels

        As part of the Android Microconference:

        Linux Kernel Functional Test is a system to detect kernel regressions across the range of mainline, LTS and Android Common kernels. It is able to run a variety of operating systems from Linux to Android across an array of systems under test. You're probably thinking in terms of standard test suites like CTS, VTS, LTP, kselftest and so on and you're be right. We'll talk about how things have been going over the past year and some of the challenges face when testing at scale.

        The 'F' in LKFT is for Functional, and during this interactive session we will explore how to continue to make strides beyond pass/fail tests. Kernel regressions aren't just an option that once worked now is failing. They also include degradation in performance. The session will explore the recent add to LKFT involving the Energy Aware Scheduler (EAS) with boards that have power probes on hardware. Last we'll talk about audio and some things we've been exploring with testing the audio stack on Android.

        Speaker: Tom Gall (Linaro)
      • 128
        Handling memory pressure on Android

        Topic will discuss how Android framework utilizes new kernel features
        to better handle memory pressure. This includes app compaction, new
        kill strategies and improved process tracking using pidfds.

        Speaker: Suren Baghdasaryan (Google)
      • 129
        DMABUF Developments

        To discuss recent developments and directions with DMABUF:
        * DMABUF Heaps/ION destaging
        * Better DMABUF ownership state machine documentation
        * DMABUF cache maintenance optimizations
        * Kernel graphics buffer idea

        Speakers: Sumit Semwal, John Stultz (in absentia)
      • 130
        DRM/KMS for Android, adoption and upstreaming

        A short update on the status of DRM/KMS ecosystem adoption and how Google is improving verification of the DRM display drivers in Android devices.

        Speaker: Alistair Delva (Google)
      • 131
        scheduler: uclamp usage on Android

        Android has been using an out-of-tree schedtune cgroup controller for
        task performance boosting of time-sensitive processes. Introduction of
        utilization clamping (uclamp) feature in the Linux kernel opens up an opportunity to adopt an upstream mechanism for achieving this goal. The talk will present our plans on adopting uclamp in Android.

        Speaker: Suren Baghdasaryan (Google)
      • 132
        ARM v8.5 Memory Tagging Extension

        What is MTE and why we do need to add the support for the Linux Userspace? Memory Tagging is an ARMv8.5 extension and provides architectural support for run-time detection of various classes of memory errors. It can be used to aid with software debugging to eliminate vulnerabilities before they can be exploited (i.e. bounds violations, use-after-free,use-after-return, use-out-of-scope and use-before-initialisation).

        What does MTE support for a Linux Userspace application mean? We can divide this topic in two main parts: userspace awareness (initialization, relaxation of the ABI, paging support for the tags, swapping) and userspace debugging (enable tagging in the userspace memory allocator).

        The presentation will briefly introduce the MTE concepts trying to put them in the context of what is required for the Linux OS support. It will focus then on the enablement of the ARMv8.5 extension in the userspace trying to analyze the challenges that we faced during the endeavor: memory alignment, tags management, memory impact, etc.

        Speaker: Vincenzo Frascino (ARM)
    • Birds of a feather (BoF) Ametista/room-I (Corinthia Hotel Lisbon)

      Ametista/room-I

      Corinthia Hotel Lisbon

      50

      Our BoF session proposes topics as informal meeting during the conference. The topic lead (submitter) will drive the conversations on the area of interest described in each BoF.

      The attendees group together based on a shared interest and carry out discussions without any pre-planned agenda.

      • 133
        Linux in Safety Critical Systems

        It looks like there may well be enough critical folks present to have a good BOF about safety and linux. Topics can include safety processes and methodologies, tooling to support analysis, security update concerns, etc. Basically, if you're interested in using Linux in safety critical systems come join, and we'll see where the conversation goes.

        Speakers: Kate Stewart (Linux Foundation), Lukas Bulwahn (BMW AG)
      • 134
        Formal Methods for the Linux Kernel

        This BoF session aims to bring together Linux kernel developers who have an interest in formal methods (or formal methods experts with an interest in kernel development). Topics for discussion:

        • A poll of formal methods currently used in the context of the Linux kernel: SPIN, TLA+, CBMC, herd, plain English etc.
        • High level design specification vs. low level algorithm modelling. What properties people seek to verify?
        • Bridging the gap between formal models and the actual code: built-in run-time verification (e.g. lockdep), CBMC-based kernel self-tests, event trace analysis. Any other suggestions?
        • How to encourage wider adoption of formal methods by kernel developers (e.g. help reduce the ramp-up time)
        • Potential for a consolidated repository of formal specs (or in-kernel directory)
        Speaker: Catalin Marinas
      • 4:30 PM
        Break
      • 135
        Persistent Memory as Memory

        Discussion of using Persistent Memory as first- (or second-) class memory.

        Google has a successful prototype of a software-managed "Transparent" mode for 3dXPoint / AEP memory, but we're working on re-designing this into something that is more supportable and at least partially upstreamable.

        We want to open a discussion of how we can represent this "swap"-like use of AEP sensibly.

        Speaker: Jonathan Adams (Google)
      • 136
        Civil communication in practice: What does it mean to you as an open source developer?

        Code review is a collaborative activity involving sentiments and emotions that can affect developers' productivity, creativity, and contribution satisfaction. Discussions in a code review environment in open source could get spirited at times as people from diverse backgrounds and interests are part of it. As a consequence, open source communities have become introspective and started to think about the extent to which the differences in communication styles during code reviews can actually affect the morale of the community. Even though many open source projects have started to establish a code of conduct formalizing ground rules for communication between participants with the goal to make everyone comfortable in contributing to the open source project, we still have a need to understand how communication and feelings surrounding it happen in practice.

        To address those needs, we propose a BOF with the Linux Community. The goal is to do a short survey focusing on analyzing e-mails from the Linux Kernel Mailing List (LKML) to understand the differences in communication styles and how they impact the Linux community. As a result of this BOF, we will be able to provide valuable information to help communities write their guidelines for code reviews or tools to improve communication in a code review environment.

        Speakers: Isabella Ferreira (Polytechnique Montréal), Kate Stewart (Linux Foundation), Shuah Khan (The Linux Foundation), Daniel German (University of Victoria), Bram Adams (Polytechnique Montréal)
    • Containers and Checkpoint/Restore MC Jade/room-I&II (Corinthia Hotel Lisbon)

      Jade/room-I&II

      Corinthia Hotel Lisbon

      160

      The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

      Last year's edition covered a range of subjects and a lot of progress has been made on all of them. There is a working prototype for an id shifting filesystem some distributions already choose to include, proper support for running Android in containers via binderfs, seccomp-based syscall interception and improved container migration through the userfaultfd patchsets.

      Last year's success has prompted us to reprise the microconference this year. Topics we would like to cover include:

      Android containers
      Agree on an upstreamable approach to shiftfs
      Securing containres by rethinking parts of ptrace access permissions, restricting or removing the ability to re-open file descriptors through procfs with higher permissions than they were originally created with, and in general how to make procfs more secure or restricted.
      Adoption and transition of cgroup v2 in container workloads
      Upstreaming the time namespace patchset
      Adding a new clone syscall
      Adoption and improvement of the new mount and pidfd APIs
      Improving the state of userfaultfd and its adoption in container runtimes
      Speeding up container live migration
      Address space separation for containers
      More to be added based on CfP for this microconference

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      Etherpad: https://etherpad.net/p/LPC2019_Containers_and_Checkpoint_Restore

      MC leads
      Stéphane Graber stgraber@stgraber.org, Christian Brauner christian@brauner.io, and Mike Rapoport mike.rapoport@gmail.com

      • 137
        Opening session
        Speaker: Stéphane Graber (Canonical Ltd.)
      • 138
        CRIU and the PID dance

        CRIU only restores processes with the same PID the processes used to have during checkpointing. As there is no interface to create a process with a certain PID like fork_with_pid() CRIU does the PID dance to restore the process with the same PID as before checkpointing.

        The PID dance consists of open()ing /proc/sys/kernel/ns_last_pid, write()ing PID-1 to /proc/sys/kernel/ns_last_pid and close()ing it. Then CRIU does a clone() and a getpid() to see if the clone() resulted in the desired PID. If the PID does not match, CRIU aborts the restore.

        This PID dance is slow, racy and requires CAP_SYS_ADMIN.

        Fortunately the newly introduced clone3() offers the possibility to be extended to support clone3() with a certain/desired PID. There are currently (July 2019) discussions how to extend clone3() to be able to use it with a certain PID. By the time LPC has started these patches will probably be already posted. With these patches it should be possible to solve the problems that the PID dance is slow and racy.

        Which leaves the problem of CAP_SYS_ADMIN. This is a problem for CRIU because it is the major reason why CRIU needs to be run as root during restore. If the root and CAP_SYS_ADMIN requirement could be somehow relaxed it would solve the problems for people running CRIU as non-root for container migration as reported during last year's LPC and it would also open up easy CRIU usage in areas like HPC with MPI based checkpointing and restoring running as non-root.

        In this talk we want to give some background how and why CRIU does the PID dance, we want to present our changes based on clone3() to be able to create a process with a certain PID. Then we would like to get feedback from the community if a rootless restore is important and how to relax the CAP_SYS_ADMIN requirement and how this relaxation could be implemented.

        Speaker: Adrian Reber (Red Hat)
      • 139
        Address Space Isolation for Container Security

        Containers are generally percieved less secure than virtual
        machines. Without going into a theological argument about the actual
        state of the affairs, we suggest to explore the possibility of using
        address space isolation inside the kernel to make containers even more
        secure.

        Assuming that kernel bugs and therefore vulnerabilities are inevitable
        it is worth isolating parts of the kernel to minimize damage that
        these vulnerabilities can cause.

        One way to create such isolation is to assign an address space to the
        Linux namespaces, so that tasks running in namespace A have different
        view of kernel memory mappings than the tasks running in namespace B.

        For instance, by keeping all the objects in a network namespace
        private, we can achieve levels of isolation equivalent to running a
        separated network stack.

        Another possible usecase is isolating address spaces for different
        user namespaces.

        Beside marrying namespaces with address spaces we also considering
        implementaiton of isolated memory mappings using mmap()/madvise() so
        that a region of the caller's memory would be hidden from the rest of
        the system.

        We are going to give a short update on current status of our research
        and we are going to discuss implications of the address space
        isolation and possible future directions:

        • What are the trade-offs between letting user-space to control the
          isolation or keeping the control completely in-kernel.

        • What should be user-visible interface for address space management?
          Does it need to be on/off switch at kernel command line or do we
          need runtime knobs for that? Or maybe even "address space namespace"
          or "address space cgroup"?

        • How can we evaluate the security improvements beyond empiric
          obvservation that when less code and data are mapped, there are less
          vulnerabilities exposed?

        Speakers: Mike Rapoport, James Bottomley (IBM)
      • 140
        Seccomp Syscall Interception

        Recently the kernel landed seccomp support for SECCOMP_RET_USER_NOTIF which enables a process (watchee) to retrieve a fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (watcher). The watcher will then be able to receive seccomp messages about the syscalls having been performed by the watchee.

        We have integrated this feature into userspace and currently make heavy use of this to intercept mknod() syscalls in user namespaces aka in containers.
        If the mknod() syscall matches a device in a pre-determined whitelist the privileged watcher will perform the mknod syscall in lieu of the unprivileged watchee and report back to the watchee on the success or failure of its attempt. If the syscall does not match a device in a whitelist we simply report an error.

        This talk is going to show how this works and what limitations we run into and what future improvements we plan on doing in the kernel.

        Speaker: Mr Christian Brauner
      • 141
        Update on Task Migration at Google Using CRIU

        Over the last year we have worked on expanding the task migration using CRIU in Google. The talk will discuss how in some cases the kernel interfaces are lacking for the purpose of migration:

        • Lack of support for reading rseq configuration which means that it requires userspace support to migrate users of rseq properly.
        • Lack of support for reading what cgroup events the users have registered for.
        • Many kernel C/R interfaces are protected by CAP_SYS_ADMIN which we deemed unsafe to have for the migrator agent - CAP_RESTORE could be the solution.

        We will discuss new challenges which we have encountered while developing the migration technology further:

        • The lack of clean error classification in CRIU forced us to parse the migration logs.
        • Lack of support for some less often used kernel features in CRIU (e.g. O_PATH, PR_SET_CHILD_SUBREAPER).
        • Migrating containers while also changing the IP of the container is hard but in many cases could be done with little effort on the library or user side.
        • We have finalized streaming migration support on our side and in the process we have realized that the hitless migration is infeasible for our latency sensitive users.
        Speaker: Kamil Yurtsever (Google)
      • 4:30 PM
        Break
      • 142
        Secure Image-less Container Migration

        Container runtimes, engines and orchestrators provide a production-grade, robust, high-performing, but also relatively self-managing, self-healing infrastructure using innovative open-source technologies.

        CRIU allows the running state of containerised applications to be preserved as a collection of files that can be used to create an equivalent copy of the applications at a later time, and possibly on a different system.

        However, for a live migration mechanism to be effective it is very important to minimize the down-time of these applications without compromising security. Therefore, in this talk we discuss new features of CRIU that enable seamless live migration based on direct communication mechanism between source and destination nodes, in order to avoid the generation of intermediate image files and to keep only necessary state information cached in memory.

        Speakers: Mr Radostin Stoyanov (University of Aberdeen), Dr Martin Kollingbaum (University of Aberdeen)
      • 143
        Using the new mount API with containers

        The Linux kernel has recently acquired a new API for creating mounts. This allows a greater range of parameter and parameter values to be specified, including, in the future, container-relevant information such as the namespaces that a mount should use.

        Future developments of this API also need to work out how to deal with upcalling from the kernel to gain parameters not directly supplied, such as DNS records, automount configurations or configuration overrides, whilst preventing namespacing violations through the upcall.

        Speaker: Mr David Howells (Red Hat)
      • 144
        Can we agree on what needs to happen to get shiftfs upstream

        Since Canonical is now shipping it I think we can all agree it solves a problem and we just need to get the patches into shape for upstream submission. Can we discuss a pathway for doing that.

        Speakers: James Bottomley (IBM), Christian Brauner, Mr Seth Forshee (Canonical)
      • 145
        Securing Container Runtimes with openat2 and libpathrs

        Userspace has (for a long time) needed a mechanism to restrict path resolution. Obvious examples are those of FTP servers, Web Servers, archiving utilities, and now container runtimes. While the fundamental issue with privileged container runtimes opening paths within an untrusted rootfs was known about for many years, the recent CVEs (CVE-2018-15664 and CVE-2019-10152 being the most recent) to that effect has brought more light to the issue.

        This is an update on the work briefly discussed during LPC 2018, complete with redesigned patches and a new userspace library that will allow for backwards-compatibility on older kernels that don't have openat2(2) support. In addition, the patchset now has new semantics for "magic links" (nd_jump_link-style "symlinks") that will protect against several file descriptor re-opening attacks (such as CVE-2016-9962 and CVE-2019-5736) that have affected all sorts of container runtimes and other programs. It also provides the ability for userspace to further restrict the re-opening capabilities of O_PATH descriptors.

        In order to facilitate easier (safe) use of this interface, a new userspace library (libpathrs) has been developed which makes use of the new openat2(2) interfaces while also having userspace emulation of openat2(RESOLVE_IN_ROOT) for older kernels. The long-term goal is to switch the vast majority of userspace programs that deal with potentially-untrusted directory trees to use libpathrs and thus avoid all of these potential attacks.

        The important parts of this work (and its upstream status) will be outlined and then discussion will open up on what outstanding issues might remain.

        Speaker: Mr Aleksa Sarai (SUSE LLC)
      • 6:30 PM
        Break
      • 146
        Using kernel keyrings with containers

        The kernel contains a keyrings facility for handling tokens for filesystems and other kernel services to use. These are frequently disabled for container environments, however, because they were not made namespace aware by the authors of the user-namespace and others.

        Unfortunately, this lack prevents various things from working inside containers. To get around this, keys are now being tagged with a namespace tag that allows keys operating in different namespaces to coexist in the same keyring and restrictions have been placed on joining session keyrings across namespaces.

        This still isn't sufficient to make them truly useful here. Intended future developments include: granting a permit to use a key to a container; adding per-container keyrings; request-key upcall namespacing.

        Speaker: Mr David Howells (Red Hat)
      • 147
        Cgroup v1/v2 Abstraction Layer

        Abstract

        We have cgroup v1 users who want to switch to cgroup v2, but there
        currently isn't an upstream migration story for them. (Previous
        LPC talks have focused on the issues of migrating from v1 to v2, but
        no substantial upstream solution has come to fruition.)

        The goal of this talk is to discuss the cgroup v1 to v2 migration
        path and gauge community interest in a cgroup v1/v2 abstraction
        layer.

        Problem Statement

        Several Oracle products have very, very long product lifetimes and
        are designed to run on a wide range of Linux kernels and systemd
        versions. These products are encountering difficulties as cgroups
        continues to grow and change. Older kernels only support v1, but v2
        is the future in newer kernels with v1 effectively in maintenance mode.
        Newer versions of systemd have started to abstract the cgroup interface,
        but upgrading older systems to newer versions of systemd is often not
        feasible. Ultimately, long-lifespan products are spending an increasing
        and inordinate amount of time and effort managing their cgroup interfaces.

        There is interest within Oracle to create a cgroup abstraction layer
        that will allow long-lived products to utilize the most advanced
        cgroups features available on every supported system. Ideally these
        products will be able to rely upon a library to abstract away the
        low-level cgroup implementation details on that system.

        Audience

        Anyone interested in cgroups

        Why Should the Audience Attend and/or Care

        • We would like to develop a cgroups abstraction layer in the next year
          or so. We would love to collaborate with others to build and design a
          solution that can help the entire community

        • Do other people/companies have interest in an abstraction layer? We
          want to hear other use cases and needs to better serve as many people
          as possible

        • Is there already something out there that we can utilize and build on?

        • Given the wide array of users and use cases, the library will likely
          need to have bindings for today's most popular languages - python, go
          java, etc.

        • There are a multitude of API possibilities. What level(s) of abstraction
          are of interest to the community? e.g.
          GiveMeCpus(cgname=foo, cpu_count=2, exclusive=True, numa_aligned=True, ...)
          CgroupCreate(cgname=foo, secure_from_sidechannel=True, ...)

        Speaker: Tom Hromatka
      • 148
        CRIU: Reworking vDSO proxification, syscall restart

        We have a number of unsolved time and vdso related issues in CRIU.

        • Syscall restart: if a task Checkpoint interrupted a syscall, on restore CRIU blindely starts again the syscall (executing SYSCALL/SYSENTER/INT80/etc instruction with the original regset). It works OKish, but not with time blocking syscalls i.e., poll(), nanosleep(), futex() and etc. For this purpose, Glibc and vDSO use restart_syscall(). Which won't work in CRIU as kernel is not aware of interrupted syscall. To solve those issues I suggest to extend PTRACE_GET_SYSCALL_INFO with information from task_struct->restart_block. This way on restore criu will be able to adjust syscall arguments on application Restore.
        • vDSO proxification. There is a chance that between Checkpoint and Restore events vDSO code may change. That may be in example, migration to another node or updating the kernel on the very same node. The old vDSO code can't be used anymore as vvar physical page can be missing [migration to an older kernel] or it may have different offsets. CRIU deals with that by mmaping old vdso code and patching entries with jumps to a new vdso. That's far from being perfect: the original application could have being Checkpointed while executing vdso code, but luckily we haven't got any reports about crashes on restore so far! Addressing this problem, we could add symbol table to vvar and got/plt tables to vdso, allowing CRIU to do linker job on restore by patching relocations on older vdso to newer vvar. The other approach would be making proxification process more correct: we could single-step application on Checkpoint from bytes that might be patched on Restore (JUMP_PATCH_SIZE). But additional trouble would be signals which may have being delivered while application was executing the very same bytes. That can be solved probably with hijacking SA_RESTORER..
        Speakers: Dmitry Safonov, Andrei Vagin
      • 149
        Closing session
        Speaker: Stéphane Graber (Canonical Ltd.)
    • Power Management and Thermal Control MC Esmerelda/room-I&II (Corinthia Hotel Lisbon)

      Esmerelda/room-I&II

      Corinthia Hotel Lisbon

      126

      The focus of this MC will be on power-management and thermal-control frameworks, task scheduling in relation to power/energy optimizations and thermal control, platform power-management mechanisms, and thermal-control methods. The goal is to facilitate cross-framework and cross-platform discussions that can help improve power and energy-awareness and thermal control in Linux.

      Prospective topics:

      CPU idle-time management improvements
      Device power management based on platform firmware
      DVFS in Linux
      Energy-aware and thermal-aware scheduling
      Consumer-producer workloads, power distribution
      Thermal-control methods
      Thermal-control frameworks
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Rafael J. Wysocki (rafael@kernel.org) and Eduardo Valentin (edubezval@gmail.com)

      • 150
        Multiple thermal zones representation

        The current design of the thermal framework forces the usage of a governor with a thermal zone thus limiting the scope of the decisions.
        The question of the multiple thermal zones representation and how they are handled by a governor was put several times on the table but without a clear consensus.
        In order to go forward in this area, this MC topic proposes a simple design with a hierarchical thermal zones representation and how they can be managed by a governor. The design keeps the compatibility with the current flat representation.

        Speaker: Daniel Lezcano (Linaro)
      • 151
        Performance guarantees under thermal pressure

        Performance capping due to thermal limitations is common scenario particularly in mobile systems. Today user-space has no information about what level of performance that can be expected worst case and SCHED_DEADLINE can admit reservations which are impossible to fulfill.
        The purpose of the this topic is to discuss what level guarantees the kernel should provide. Should the kernel have a platform specific or tunable sustained performance level?

        Speaker: Morten Rasmussen (Arm)
      • 152
        Task-centric thermal management

        Thermally unsustainable compute demand is in most systems controlled by reducing performance through disabling performance states on specific CPUs or other devices in the system. It provides an efficient method to ensure the system doesn't overheat, however, it doesn't take the actual workload into account which could be better served if the performance caps were applied differently.
        The intention with this topic is to discuss the idea of controlling tasks, i.e. compute demand (potentially from user-space), instead of controlling devices directly.

        Speaker: Morten Rasmussen (Arm)
      • 153
        Improving producer-consumer type workload performance

        When each CPU core can independently control its performance states, then there is performance loss on some benchmarks compared to the case when there are no independent performance states. There are couple of options to indicate to the cpufreq drivers when a producer thread wakes a consumer thread: One sending some hints like we do for IO boost or give boost PELT utilization. But there is a challenge in cleanly identifying a producer/consumer relationship in scheduler code. There are several ways a thread can wait and get signaled to wake in Linux.
        They don't end up in one place in scheduler code to cleanly implement. I experimented a case where futex are used between producer and consumers, where a hint is passed when cpufreq drivers to give small boost.
        The idea here is to discuss:
        - Shall we solve this problem?
        - How to unify wait and wake up functions?
        - Is it better to give a hint or boost PELT utilization of the consumer?

        Speaker: Srinivas Pandruvada
      • 4:40 PM
        Break
      • 154
        Device power management based on platform firmware

        Continuing the attempts to reducing fragmentation in power management on ARM platforms, there are discussions if something similar to ACPI can be done.i.e. device centric power management.

        Currently, a device has power, performance, reset, and clock domains associated with it. SCMI provides interface to deal with these domains directly. This was simpler approach to start with the SCMI specification to keep the OSPM related changes minimal. So for a given device it's power, performance, reset, clock,...etc domains need to be known and appropriate requests should be made on those domains when needed. Since this list seem to ever growing on ARM platforms, like pinmux, gpio, iomux,...etc, the current approach is not sustainable for long.

        Instead of this, there's a thought on making these device centric and drive it.
        So OSPM need not care which power/perf/reset/clock domain it belongs. All the details are abstracted from OSPM completely.

        This talk is to discuss and understand where how to drive this platform firmware based device power management from Linux kernel. Which existing subsystem to reuse ?

        Speaker: Mr Sudeep Holla (ARM)
      • 155
        Taking suspend/resume validation to the next level

        At LPC 2015, we introduced analyze_suspend, a new open source tool to show where the time goes during Linux suspend/resume. Now called "sleepgraph", it has evolved in a number of ways over the last four years. Most importantly, it is now the core of a framework that we use for suspend/resume endurance testing.

        Endurance testing has allowed us to identify, track, report and sometimes fix issues that developers used to dismiss as "unreproducible".

        But to improve Linux suspend/resume quality further, we need more people testing different machines and reporting bugs. This is an appeal for ideas how the power of the broader open source community can be harnessed to improve Linux suspend/resume quality.

        Speaker: Len Brown (Intel Open Source Technology Center)
      • 156
        C-state latency measurement infrastructure

        We in Intel developed instrumentation for measuring C-state wake latency. The instrumentation, which we call "waltr" (WAke up Latency Tracer) consists of user-space and kernel modules parts.

        In principle, waltr works by scheduling delayed interrupts and measuring the wake latency close to the x86 'mwait' x86 instruction. This requires an external device equipped with high precision clock and capable of delayed interrupts. We have been
        using the Intel i210 Ethernet adapter for these purposes. But theoretically this
        could be a completely different device, e.g., a GFX card.

        The C-state latency measurement instrumentation should be very useful for the open-source community and we would like to upstream the kernel parts of it. We are seeking for feedback on how to properly modify the kernel in a maintainable and reusable way, to benefit everyone.

        Here are few examples for the dilemmas have.
        * How do we design a framework for compliant devices like the i210 adapter?
        * What would be the right user-space API for the delayed interrupts provider?
        * How do we take snapshots of C-state counters and deliver them to user-space?

        I am asking for a 20-30 minutes time-slot. And I am hoping to talk to people more about this in hallway discussions.

        Speaker: Artem Bityutskiy
      • 157
        CPU Idle Time Management Improvements

        There are some improvements in the CPU idle time management to be made, like switching over to using time in nanoseconds (64-bit), reducing overhead and some governor modifications (including possible deprecation of the menu governor) which need to be discussed.

        Speaker: Rafael Wysocki (Intel Open Source Technology Center)
      • 6:40 PM
        Break
      • 158
        Power Management and Thermal Control BoF Sessions
    • Birds of a feather (BoF) Ametista/room-I (Corinthia Hotel Lisbon)

      Ametista/room-I

      Corinthia Hotel Lisbon

      50

      Our BoF session proposes topics as informal meeting during the conference. The topic lead (submitter) will drive the conversations on the area of interest described in each BoF.

      The attendees group together based on a shared interest and carry out discussions without any pre-planned agenda.

      • 159
        Csky Intro - what's the meaning of a new arch for linux

        The csky architecture officially merged the main line in linux-4.20. Before that, eight architectures have just been removed from the main line. Many people ask what is the meaning of csky upstream? Also includes our colleagues. Here, we will give some examples to introduce the progress of the csky architecture in the past six months and the value and significance of linux-csky. This is an open discussion about the csky architecture and any questions are welcomed.

        Speakers: Mr Guo Ren, Mr Han Mao (c-sky.com (belong to Alibaba.com))
      • 160
        New hardware with modern I2C address conflicts

        For some time now, special camera setups exist having features which are challenging for I2C address layouts as we know them in Linux: a) a high-speed serial link which can embed I2C communication (e.g. GMSL or FPD-Link III) and b) the ability to reprogram the client addresses of the I2C devices on the camera.

        The use case for these cameras is to run multiple of them in parallel, and not just a single one. To be easily pluggable, they don't have a way to configure the I2C addresses they need. They use initially all the same I2C addresses and rely on software to reprogram them and sort out that problem.

        The really tricky thing is now that they are connected to the same serial high speed link. As a result, all the clients with initially equal addresses sit (more or less, depending on the link) on the same I2C bus as well and need to be carefully reprogrammed one-by-one to a unique address.

        The camera setup above is the primary example we are facing right now. Some early implementations for GMSL and FPD-Link exist with different approaches to map the I2C topology. However, there might be other hardware facing very similar problems. We definitely want to have you in the room.

        An introductory talk gives a few more details of current implementations, and explains the current problems in abstracting all this. From there on, we hope to have gathered enough highly interested people for discussion, opinions, and brainstorming. The goal is, of course, to enhance the I2C core to provide reasonable support for such scenarios which will be beneficial for all users like these high speed links.

        Speaker: Wolfram Sang
      • 11:30 AM
        Ametista
      • 161
        Application-specific accelerators

        Application-specific accelerators are going to start showing up in larger numbers in the times ahead. Today there's often no suitable subsystem for them to aggregate into, and the first of them have landed under drivers/misc for the time being.

        The goal of this BoF is to introduce and discuss the ground rules for a new drivers/accel subsystem, how it fits in with other subsystems, and expectations of contributions in the short and medium term.

        Speaker: Olof Johansson
      • 162
        PCI microconference follow-up

        Discussion around topics related
        to PCI specifications and microconference follow up

        • Root complex integrated endpoints
        • Native host controllers link management
        • VFIO/IOMMU/PCI follow up
    • Databases MC Jade/room-I&II (Corinthia Hotel Lisbon)

      Jade/room-I&II

      Corinthia Hotel Lisbon

      160

      Databases utilize and depend on a variety of kernel interfaces and are critically dependent on their specification, conformance to specification, and performance. Failure in any of these results in data loss, loss in revenue, or degraded experience or if discovered early, software debt. Specific interfaces can also remove small or large parts of user space code creating greater efficiencies.

      This microconference will get a group of database developers together to talk about how their databases work, along with kernel developers currently developing a particular database-focused technology to talk about its interfaces and intended use.

      Database developers are expected to cover:

      The architecture of their database;
      The kernel interfaces utilized, particularly those critical to performance and integrity
      What is a general performance profile of their database with respect to kernel interfaces;
      What kernel difficulties they have experienced;
      What kernel interfaces are particularly useful;
      What kernel interfaces would have been nice to use, but were discounted for a particular reason;
      Particular pieces of their codebase that have convoluted implementations due to missing syscalls; and
      The direction of database development and what interfaces to newer hardware, like NVDIMM, atomic write storage, would be desirable.
      The aim for kernel developers attending is to:

      Gain a relationship with database developers;
      Understand where in development kernel code they will need additional input by database developers;
      Gain an understanding on how to run database performance tests (or at least who to ask);
      Gain appreciation for previous work that has been useful; and
      Gain an understanding of what would be useful aspects to improve.
      The aim for database developers attending is to:

      Gain an understanding of who is implementing the functionality they need;
      Gain an understanding of kernel development;
      Learn about kernel features that exist, and how they can be incorporated into their implementation; and
      Learn how to run a test on a new kernel feature.
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Daniel Black daniel@linux.ibm.com

      • 163
        Open Session

        Quick introduction of people. Frame discussion. Will be quick I promise.

        Speaker: Daniel Black (IBM)
      • 164
        io_uring - excitement - looking for feedback & potential issues

        many devs are excited about the progress reported on this new stuff, but is it followed / considered by kernel devs.? what kind of gain to expect? any potential issues or feedback to share?

        Speaker: Dimitri KRAVTCHUK
      • 165
        disk write barriers

        for example, for a write-ahead logging, one needs to guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is an overkill for this.

        Speaker: Sergei Golubchik
      • 166
        Filesystem atomic writes / O_ATOMIC

        seems like the patches proposed by Fusion-io devs for general O_ATOMIC support within Linux kernel are in stand-by since 6 years.. -- any plans to address it ?.. What is the main reason to not guarantee atomicy of O_DIRECT writes on flash drives? -- seems like most of flash storage vendors are able to provide atomic writes support on HW level, and just SW level (kernel/FS/etc.) is missed.. The main benefit for MySQL/InnoDB is to get a rid of "double write" to protect from data corruption (partially written pages) -- so, every page is written twice, increasing IO write traffic + doubling page write latency + reducing by half flash drive life expectation..

        Speaker: Dimitri KRAVTCHUK
      • 167
        MySQL @EXT4 performance impacts with latest Linux kernels

        since newer kernels (4.14, 5.1, ..) we are observing 50% regression on MySQL IO-bound workloads using EXT4 comparing to the same results on the same HW, but running kernel 3.x or 4.1. Unfortunately we have absolutely no explanation for this regression right now and looking for any available FS layer instrumentation/visibility to understand what is the root problem for such a regression and how it can be by-passed from MySQL code (or fixed if the problem is in EXT4)..

        (more details are expected up to conference date)

        Speaker: Dimitri KRAVTCHUK
      • 168
        MySQL @XFS

        historically XFS was always showing lower performance comparing to EXT4 on most of IO-bound workloads used for MySQL/InnoDB benchmark testing.. However, since the new kernels & XFS arrived, we observed significantly better results on XFS now -vs- EXT4 particularly when InnoDB "double write" is enabled. From the other side, for our big surprise, XFS was doing worse if "double write" was disabled (which is nonsense, because how overall performance can be worse if we do twice less IO writes on the same IO-bound workload?) -- fortunately we found a workaround to by-pass this issue, but still lacking deep understanding of the problem and observation/ visibility details from XFS layer).. -- all is looking like a kind of IO starvation, but how it can be detected on time and ahead?..

        (more details are expected up to conference date)

        Speaker: Dimitri KRAVTCHUK
      • 11:30 AM
        Jade
      • 169
        What SQLite Devs Wish Linux Filesystem Devs Knew About SQLite

        (1) SQLite is the most widely used database in the world. There are probably in excess of 300 billion active SQLite databases on Linux devices. SQLite is a significant client of the Linux filesystem - perhaps the largest single non-streaming client, especially on small devices such as phones.

        (2) Unlike other relational database engines, SQLite tends to live out on the edge of the network, not in the datacenter.

        (3) An SQLite database is a single ordinary file in the filesystem. The database file format is well-defined and stable. The US Library of Congress designates SQLite database files as a recommended format for long-term archive storage of structured data.

        (4) SQLite is not a client/server database. SQLite is a library. The application makes a function call that contains SQL text and SQLite translates that SQL into a sequence of filesystem operations that implement the desired operation, all within the same thread. There is no messaging and no IPC. There is no server process that hangs around to coordinate access to the database file.

        (5) SQLite does not get to choose a filesystem type or mount options. It has to make due with whatever is at hand. Therefore, SQLite really wants to be able to discover filesystem properties at run-time, so that it can tune its behavior for maximum performance and reliability.

        (6) Diagrams showing how SQLite creates the illusion of atomic commit on a non-atomic filesystem.

        Speaker: Dr Richard Hipp (SQLite)
      • 170
        IO: Durability, Errors and Documentation

        Postgres (and many other databases) have, until fairly recently, assumed that IO errors would a) be reliably signalled by fsync/fdatasync/... b) repeating an fsync after a failure would either result in another failure, or the IO operations would succeed.

        That turned out not to be true: See also https://lwn.net/Articles/752063/

        While a few improvements have been made, both in postgres and linux, the situation is still pretty bad.

        From my point of view, a large part of the problem is that linux does not document what error and durability behaviour userspace can expect from certain operations.

        Problematic areas for the kernel:
        - The regular behaviour of durability fs related syscalls are not documented. One extreme example of that is sync_file_range (look at the warning section of the manpage)
        - FS behaviour when encountering IO errors is poorly, if at all, documented. For example: there still is no documentation about the error behaviour of fsync, ext4's errors= operation reads as if it applied to all IO errors, but only applies to metadata errors.
        - There is very little consistency for error behaviour between filesystems. To the degree that XFS will return different data after writeback failed than ext4.
        - There is no usable interface to query / be notified of IO errors
        - the rapid development of thin provisioned storage has increased the likelihood of IO errors drastically, as large parts of the IO stack treat out-of-space on the block level as an IO error

        It seems worthwhile to work together to at least partially clean this up.

        Speakers: Andres Freund (EnterpriseDB / PostgreSQL), Mr Tomas Vondra (Postgresql)
      • 171
        Time series of thread profiles in production

        At MongoDB, we implemented an eBPF tool to collect and display a complete time-series view of information about all threads whether they are on- or off-CPU. This allows us to inspect where the database server spends its time, both in userspace and in kernel. Its minimal overhead allows to deploy it in production.

        This can be an effective method to collect diagnostic information in the field and surface a specific workload which is bound by a syscall. It would be interesting to hear what solution other vendors use to profile in production.

        Speaker: Josef Ahmad (MongoDB Inc.)
      • 172
        New InnoDB REDO log design and MT sync challenges

        since MySQL 8.0 we have a newly redesigned lock-free REDO log implementation. However, this development involved several questions about overall efficiency around MT communications and synchronization. Curiously spinning on CPU showed to be the most efficient on low load.. -- but any plans to implement "generic" MT framework for more efficient execution of any MT apps ?

        Speaker: Mr Pawel OLCHAWA
      • 173
        IP / UNIX Socket Backlog

        there is "backlog" option used in MySQL for both IP and UNIX sockets, but seems like it has a significant overhead on heavy connect/disconnect activity workloads (e.g. like most of Web apps which are doing "connect; SQL query; disconnect") -- any explanation/ reason for this? can it be improved?

        Speaker: Dimitri KRAVTCHUK
      • 174
        IP port -vs- UNIX socket difference on - IP stack is 20-30% slower on MySQL

        MySQL is allowing user sessions connections via IP port and UNIX socket on Linux systems. However, curiously connecting via UNIX socket is delivering up to 30% higher performance comparing to IP local port (loopback).. -- any reason for this? and be "loopback" code improved to match the same level of efficiency as UNIX socket? can the same improvements make over all IP stack to be more efficient?

        Speaker: Dimitri KRAVTCHUK
      • 175
        Regressions due CPU cache issues and missed visibility in Linux/kernel instrumentation

        all MT apps are extremely sensible to CPU cache issues, and MySQL/InnoDB is part of them.. Several times we observed significant regressions (up to 40% and more) due CPU cache miss or simple cache sync due concurrent access to the same variable by several threads, and all "perf" CPU related stats did not show any difference.. Any plans to address it with more deep CPU stats instrumentation?

        Speaker: Mr Pawel OLCHAWA
      • 176
        Syscall overhead from Spectre/Meltdown fixes

        users are very worry about any kind of overhead due kernel patches applied to solve Intel CPU issues (Spectre/Meltdown/etc.) -- what others are observing? what kind of workloads / test cases do you use for evaluation?

        Speaker: Dimitri KRAVTCHUK
      • 177
        Conclusion

        From discussions to code. Where it goes from here?

        Speaker: Daniel Black (IBM)
    • Kernel Summit Track Floriana/room-III (Corinthia Hotel Lisbon)

      Floriana/room-III

      Corinthia Hotel Lisbon

      100

      This year, the Maintainer's and Kernel Summit will be at the Corinthia Hotel in Lisbon, Portugal, September 9th -- 12th. The Kernel Summit will be held as a track during the Linux Plumbers Conference September 9th -- 11th. The Maintainer's Summit will be held afterwards, on September 12th. As in previous years, the "Maintainer's Summit" is an invite-only, half-day event, where the primary focus will be process issues around Linux Kernel Development.

      The "Kernel Summit" is organized as a track which is run in parallel with the other tracks at the Linux Plumber's Conference (LPC), and is open to all registered attendees of LPC. The goal of the Kernel Summit track will be to provide a forum to discuss specific technical issues that would be easier to resolve in person than over e-mail.

      We will reserving roughly some Kernel Summit slots for last-minute discussions that will be scheduled during the week, in an "unconference style". This allows ideas that come up in hallway discussions, and in the LPC miniconferences, to be given
      scheduled, dedicated times for discussion.

      • 178
        Moving the Linux ABI to userspace Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        The ABI between Linux and user software mostly sits at the user/privileged boundary, although many architectures extend this with a small amount of special-case code that sits in userspace, such as in special pages or shared libraries (vDSOs) mapped into each user process [1] that user code can call into.

        The reasons for this are a bit arbitrary: system interface libraries such as glibc and Bionic are maintained as separate projects from the kernel, by different people. The privileged/unprivileged boundary is the de facto demarcation point between projects, because by design only kernel code can run privileged.

        Because Linux's user/privileged boundary and ABI are welded together in this way though, the Linux ABI is forced to evolve (or prevented from doing so) for reasons that have little to do with functionality, such as backwards compatibility for superseded interfaces, and optimisations (e.g., vDSO gettimeofday(), getcpu() etc.).

        Moving implementation of pieces of kernel functionality between privileged space and userspace is currently hard due to the resulting ABI breaks, yet moving functionality into userspace (e.g., into the vDSO) has some interesting potential use cases, such as:

        • Allowing the user/privileged boundary to evolve independently of the kernel ABI.

        • Providing a way to push obsolete, deprecated, redundant and/or regrettable syscalls out of the kernel proper.

        • Making it easier for userspace to refine its own ABI personality: so things like libc, fakeroot etc., can catch and reimplement syscalls in a transparent way.

        • Migrating to a unified library-style ABI instead of relying on a patchwork of bare syscalls, vDSO etc., but without the risk of competing or incompatible implementations.

        Migrating a vDSO function to be implemented in privileged space is straightforward: a stub function can be left in the vDSO for old userspace callers to use: the stub just makes the appropriate syscall.

        The converse is harder, and requires syscall trapping or filtering mechanisms such as BPF or ptrace.

        This presentation will describe some approaches to reflecting syscalls back to userspace, and how feasible they look.

        Things I aim to cover:

        • What mechanisms can be used?

        • How expensive are they, and what breaks?

        • What's the likely overhead of doing all syscalls through a vDSO or similar?

        [1] e.g.,
        Documentation/ABI/stable/vdso
        Documentation/arm/kernel_user_helpers.txt

        Speaker: Dave Martin (ARM Limited)
      • 179
        KUnit - Unit Testing for the Linux Kernel Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        KUnit is a new lightweight unit testing and mocking framework for the Linux kernel. Unlike Autotest and kselftest, KUnit is a true unit testing framework; it does not require installing the kernel on a test machine or in a VM (however, KUnit still allows you to run tests on test machines or in VMs if you want) and does not require tests to be written in userspace running on a host kernel. You can read more about KUnit in this LWN article.

        In the first half of the talk we will provide background on what unit testing is, why we think it is important for the Linux kernel, how KUnit provides a viable unit testing library implementation, and offer a brief demonstration of how it might be used.

        In the second half of the talk we will talk about the future. We will talk about KUnit's roadmap, the challenges that KUnit is facing, how to structure the Linux kernel testing paradigm, and how KUnit fits into it.

        Speaker: Brendan Higgins (Google LLC)
      • 11:30 AM
        Floriana III Floriana/room-III (Corinthia Hotel Lisbon)

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
      • 180
        Reflections on kernel quality, development process and testing Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        In this talk Dmitry will highlight some of the areas for improvement related to release quality, security, and developer experience and productivity. Then try to show that the existing processes, approaches and tools poorly cope with the current scale and rate of change and don't provide adequate quality and developer experience. Lastly Dmitry will advocate that only pervasive changes to the process, tooling and testing approaches can significantly improve the situation.

        Speaker: Dmitry Vyukov (Google)
      • 181
        Discussions on kselftest Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
        Speaker: Shuah Kahn
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 182
        Decoupling ZRAM from a specific backend Floriana/room-III

        Floriana/room-III

        Corinthia Hotel Lisbon

        100

        ZRAM is a compressed RAM based block device implementation which has gotten a lot of use recently primarily in the Android world. ZRAM consists of the block device front-end, compressor back-end and memory allocator back-end. Compressor back-end is accessed via a common API, and therefore it is easy with ZRAM to select the particular compression algorithm that fits your special purpose. As opposed to
        that, selecting a memory allocator back-end for ZRAM is still not possible because ZRAM is using zsmalloc API directly.

        With that said, zsmalloc is not the only kernel allocator for storing compressed objects. There also are zbud (up to 2 objects per page) and z3fold (up to 3 objects per page). Designed to store only integral number of objects per page, these two have deterministic behavior with low I/O latencies. Compression ratio suffers for these two of course -- by much for zbud and not so much for z3fold.

        Still z3fold might be a better choice as a backend for ZRAM when compression ratio is not as important as keeping latencies low. As a z3fold primary author I keep getting questions when it will be available for use with ZRAM, and keep answering that it has to be a result of a wider consensus. To get closer to this, apart from
        zsmalloc / z3fold comparisons, this talk will describe in detail how the existing zpool API should be extended to match ZRAM requirements and whether there is a performance penalty here as this introduces a level of indirection.

        Speaker: Vitaly Wool
      • 4:30 PM
        Break Floriana/room-III (Corinthia Hotel Lisbon)

        Floriana/room-III

        Corinthia Hotel Lisbon

        100
    • LPC Refereed Track Floriana/room-II (Corinthia Hotel Lisbon)

      Floriana/room-II

      Corinthia Hotel Lisbon

      200
      • 183
        Finding more DRAM Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The demand of DRAM across different platforms is increasing but the cost is not decreasing. Thus DRAM is a major factor of the total cost across all kinds of devices like mobile, desktop or servers. In this talk we will be presenting the work we are doing at Google, applicable to Android, Chrome OS and data center servers, on extracting more memory out of running applications without impacting performance.

        The key is to proactively reclaim idle memory from the running applications. For the Android and Chrome OS, the user space controller can provide hints of the idle memory at the applications level while the servers running multiple workloads, an idle memory tracking mechanism is needed. With such hints the kernel can proactively reclaim memory given that estimated refault cost is not high. Using in-memory compression or second tier memory, the refault cost can be reduced drastically.

        We have developed and deployed the proactive reclaim and idle memory tracking across Google data centers [1]. Defining idle memory as memory not accessed in the last 2 mins, we found 32% idle memory across data centers and we were able to reclaim 30% of this idle memory, while not impacting the performance. This results in 3x cheaper memory for our data centers. 98% of the applications spend only around 0.1% of their CPU on memory compression and decompression. Also the idle memory tracking on average takes less than 11% of a single logical CPU.

        The cost of proactive reclaim and idle memory tracking is reasonable for the data centers cost of ownership of memory, however, it imposes challenges for power constrained devices based on Android and Chrome OS. These devices run diverse applications e.g. Chrome OS can run Android and Linux in a VM. To that end, we are working on making idle memory tracking and proactive reclaim feasible for such devices. Henceforth, we are interested and would like to initiate discussion on making proactive reclaim useful for other use-cases as well.

        [1] Software-Defined Far Memory in Warehouse-Scale Computers, ACM ASPLOS 2019.

        Speakers: Shakeel Butt (Google), Suren Baghdasaryan (Google), Yu Zhao (Google)
      • 184
        Linux Gen-Z Sub-system Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Gen-Z Linux Sub-system

        Discuss design choices for a Gen-Z kernel sub-system and the challenges of supporting the Gen-Z interconnect in Linux.

        Gen-Z is a fabric interconnect that connects a broad range of devices from CPUs, memory, I/O, and switches to other computers and all of their devices. It scales from two components in an enclosure to an exascale mesh. The Gen-Z consortium has over 70 member companies and the first version of the specification was published in 2018. Past history for new interconnects suggests we will see actual hardware products two years after the first specification - in 2020. We propose to add support for a Gen-Z kernel sub-system, a Gen-Z component device driver environment, and user space management applications.

        A Gen-Z sub-system needs support for these Gen-Z features:

        • Registration and enumeration services that are similar to existing
          sub-systems like PCI.
        • Gen-Z Memory Management Unit (ZMMU) provides memory mapping and access to fabric addresses. The Gen-Z sub-system can provide services to track PTE entries for the two types of ZMMU's in the specification: page grid and page table based.
        • Region Keys (R-Keys) - Each ZMMU page can have R-Keys used to validate page access authorization. The Gen-Z sub-system needs to provide APIs for tracking, freeing, and validating R-Keys.
        • Process Address Space Identifier (PASID) - ZMMU requester and responder Page Table Entries (PTEs) contain a PASID. The Gen-Z sub-system needs to provide APIs for tracking PASIDs.
        • Data mover - Transmit and receive data movers are optional elements in bridges and other Gen-Z components. The Gen-Z sub-system can provide a user space interface to a RDMA driver that uses a Gen-Z data mover. For example, a libfabric Gen-Z provider implementation can use a RDMA driver to access data mover queues.
        • UUIDs - Components are identified by UUIDs. The Gen-Z sub-system provides interfaces for tracking UUIDs of local and remote components. A Gen-Z driver binds to a UUID similarly to how a PCI driver binds to a vendor/device id.
        • Interrupt handling - Interrupt request packets in Gen-Z trigger local interrupts. Local components such as bridges and data movers can also be sources of interrupts.

        We will discuss our proposed design for the Gen-Z sub-system illustrated in the following block diagram:

        Gen-Z Sub-system Block Diagram

        Gen-Z fabric management is global to the fabric. The operating system may not know what components on the fabric are assigned to it; the fabric manager decides which components belong to the operating system. Although user space discovery/management is unusual for Linux, it will allow the Gen-Z sub-system to focus on the mechanism of component management rather than the policy choices a fabric manager must make.

        To support user space discovery/management, the Gen-Z sub-system needs interfaces for management services:

        • Fabric managers need read/write access to component control space in order to do fabric discovery and configuration. We propose using /sys files for each control structure and table.
        • User space Gen-Z managers need notification of management events/interrupts from the Gen-Z fabric. We propose using poll on the bridges' device files to communicate events.
        • Local management services pass fabric discovery events from user space to the kernel. Our proposed design uses generic Netlink messages for communication of these component add/remove/modify events.

        We are leveraging our experience with writing Linux bridge drivers for three different Gen-Z hardware bridges in the design of the Gen-Z Linux sub-system. Most recently, we wrote the DOE Exa-scale PathForward project's bridge driver with data movers (https://github.com/HewlettPackard/zhpe-driver). We wrote drivers for the Gen-Z Consortium's demonstration card that supports a block device and a NIC as well as a driver for the bridge in HPE's "The Machine" that is a precursor to Gen-Z.

        From our work so far, here are questions we would like feedback on:

        • We intend to expose control space in /sys so that user space fabric managers can work. We ask for feedback on the proposed hierarchy and mechanisms.
        • Gen-Z uses PASIDs and the sub-system could use generic PASID
          interfaces. Any interest in this elsewhere in the kernel?
        • We have need of generic IOMMU interfaces since Gen-Z ZMMU needs to interface with the IOMMU in a platform independent way. Any interest in this elsewhere in the kernel? We saw some patch sets along these lines.
        • We intend to use generic NetLink for communication between user space and the kernel. Any thoughts on that decision?
        • Gen-Z maps huge address spaces from remote components, and to get good performance those mappings need huge pages. Currently, the kernel does not support this use case. We would like to discuss how best to handle these huge mappings.
        • We wrote a parser for the Gen-Z specification's control structure that generates C structures with bitfields. In general, we know the Linux kernel frowns on bitfields. Are bitfields ok in this context?
        Speakers: Jim Hull (Hewlett Packard Enterprise), Betty Dall (HPE), Keith Packard (Hewlett Packard Enterprise)
      • 11:30 AM
        Floriana II Floriana/room-II (Corinthia Hotel Lisbon)

        Floriana/room-II

        Corinthia Hotel Lisbon

        200
      • 185
        pidfds: Process file descriptors on Linux Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        Traditionally processes are identified globally via process identifiers (PIDs). Due to how pid allocation works the kernel is free to recycle PIDs once a process has been reaped. As such, PIDs do not allow another process to maintain a private, stable reference on a process. On systems under pressure it is thus possible that a PID is recycled without other (non-parent) processes being aware of it. This becomes rather problematic when (non-parent) processes are in charge of managing other processes as is the case for system managers or userspace implementations of OOM killers.

        Over the last months we have been working on solving these and other problems by introducing pidfds – process file descriptors. Among other nice properties, the allow callers to maintain a private, stable reference on a process.

        In this talk we will look at challenges we faced and the different approaches people pushed for. We will see what already has been implement and pushed upstream, look into various implementation details and outline what we have planned for the future.

        Speaker: Mr Christian Brauner
      • 186
        Malloc for everyone and beyond NUMA Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        With heterogeneous computing, program's data (range of virtual addresses) have to move to different physical memory during the lifetime of an application to keep it local to compute unit (CPU, GPU, FPGA, ...). NUMA have been the model used so far but it has assumptions that do not work with all the memory type we now have. This presentation will explore the various types of memory and how we can expose and use them through unified API.

        Speaker: Jerome Glisse (Red Hat)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 187
        Efficient Userspace Optimistic Spinning Locks Floriana/room-II

        Floriana/room-II

        Corinthia Hotel Lisbon

        200

        The most commonly used simple locking functions provided by the pthread library are pthread_mutex and pthread_rwlock. They are sleeping locks and so do suffer from unpredictable wakeup latency limiting locking throughput.

        Userspace spinning locks can potentially offer better locking throughput, but they also suffer other drawbacks like lock holder preemption which will waste valuable CPU time for those lock spinning CPUs. Another spinning lock problem is contention on the lock cacheline when a large number of CPUs are spinning on it.

        This talk presents a hybrid spinning/sleeping lock where a lock waiter can choose to spin in userspace or in the kernel waiting for the lock holder to release the lock. While spinning in the kernel, the lock waiters will queue up so that only the one at the queue head will be spinning on the lock reducing lock cacheline contention. If the lock holder is not running, the kernel lock waiters will go to sleep too so as not to waste valuable CPU cycles. The state of kernel lock spinners will be reflected in the value of lock. Thus userspace spinners can
        monitor the lock state and determine the best way forward.

        This new type of hybrid spinning/sleeping locks combine the best attributes of sleeping and spinning locks. It is especially useful for applications that need to run on large NUMA systems where potentially a large number of CPUs may be pounding on a given lock.

        Speaker: Mr Waiman Long (Red Hat)
      • 4:30 PM
        Break Floriana/room-II (Corinthia Hotel Lisbon)

        Floriana/room-II

        Corinthia Hotel Lisbon

        200
    • Networking Summit Track Floriana/room-I (Corinthia Hotel Lisbon)

      Floriana/room-I

      Corinthia Hotel Lisbon

      180
      • 188
        Scaling container policy management with kernel features Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Cilium is an open source project which implements the Container Network
        Interface (CNI) to provide networking and security functions in modern
        application environments. The primary focus of the Cilium community recently
        has been on scaling these functions to support thousands of nodes and hundreds
        of thousands of containers. Such environments impose a high rate of churn as
        containers and nodes appear and leave the cluster. For each change, the
        networking plugin needs to handle the incoming events and ensure that policy is
        in sync with network configuration state. This creates a strong incentive to
        efficiently interpret and map down cluster events into the required Linux
        networking configuration to minimize the window during which there are
        discrepancies between the desired and realized state in the cluster---something
        that is made possible through eBPF and other kernel features.

        Cilium realizes these policy and container events through the use of many
        aspects of the networking stack, from rules to routes, tc to socket hooks,
        skb->mark to the skb->cb. Modelling the changes to datapath state involves a
        non-trivial amount of work in the userspace daemon to structure the desired
        state from external entities and allow incremental adjustments to be made,
        keeping the amount of work required to handle an event proportional to its
        impact on the kernel configuration. Some aspects of datapath configuration such
        as the implementation of L7 policy have gone through multiple iterations, which
        provides a window for us to explore the past, present and future of transparent
        proxies.

        This talk will discuss the container policy model used by Cilium to apply
        whitelist filtering of requests at layers 3, 4 and 7; memoization techniques
        used to cache intermediate policy computation artifacts; and impacts on
        dataplane design and kernel features when considering large container based
        deployments with high rates of change in cluster state.

        Speaker: Joe Stringer (Cilium.io)
      • 189
        Traffic footprint characterization of workloads using BPF Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Application workloads are becoming increasingly diverse in terms of their network resource requirements and performance characteristics. As opposed to long running monoliths deployed in virtual machines, containerized workloads can be as short lived as few seconds. Today, container orchestrators that schedule these workloads primarily consider their CPU and memory resource requirements since they can easily be quantified. However, network resources characterization isn’t as straight forward. Ineffective scheduling of containerized workloads, which could be throughput intensive or latency sensitive, can lead to adverse network performance. Hence, I propose characterizing and learning network footprints of applications running in a cluster, which can be used while scheduling them in containers/VMs such that their network performance can be improved.

        There is a well-known network issue, which is achieving low latency for mice flows (those that send relatively small amounts of data) by separating them from the elephant flows (those that send a lot of data). I’ve written an eBPF program in C that runs at various hook points in the Linux connection tracking (aka conntrack) kernel functions in order to detect network elephant flow, and attribute them to the container or VM, where the flows ingress or egress from. The agent that loads this eBPF program from user space runs in every host in a cluster. It then feeds this learnt information to a container (or VM) scheduling system such that they can use this information proactively, while scheduling containerized workloads with light network footprint (e.g., microservices, functions) and heavy network footprint (e.g., data analytics, data computational applications) on the same cluster, in order to improve their latency and throughput, respectively.

        eBPF facilitates running the programs with minimal CPU overhead, in a pluggable, tunable and safe manner, and without having to change any kernel code. It’s also worthwhile to discuss how the workload’s learnt network footprint can be used for dynamically allocating or tuning Linux network resources like bandwidth, vcpu/vhost-net allocation, receive-side scaling (RSS) queue mappings, etc.
        I'll submit a paper with the (working) source code snippets and details if the talk is accepted.

        Speaker: Aditi Ghag (VMware)
      • 11:30 AM
        Floriana I Floriana/room-I (Corinthia Hotel Lisbon)

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
      • 190
        Improving Route Scalability with Nexthop Objects Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Route entries in a FIB tend to be very redundant with respect to nexthop configuration with many routes using the same gateway, device and potentially encapsulations such as MPLS. The legacy API for inserting routes into the kernel requires the nexthop data to be included with each route specification leading to duplicate processing verifying the nexthop data, an effect that is magnified as the number of paths in the route increases (e.g., ECMP).

        A new API was recently committed to the kernel for managing nexthops as separate objects from routes. The nexthop API allows nexthops to be created first and then routes can be added referencing the nexthop object. This API allows routes to be managed with less overhead (e.g., dramatically reducing the time to insert routes) and enables new capabilities such as atomically updating a nexthop configuration without touching the route entries using it.

        This talk will discuss the nexthop feature touching on the kernel side implementation, reviewing the userspace API and what to expect for notifications, performance improvements and potential follow on features. While the nexthop API is motivated by Linux as a NOS, it is useful for other networking deployments as well such as routing on the host and XDP.

        Speaker: David Ahern
      • 191
        An Evaluation of Host Bandwidth Manager Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Host Bandwidth Manager (HBM) is a BPF based framework for managing per-cgroupv2 egress and ingress bandwidths in order to provide a better experience to workloads/services coexisting within a host. In particular, HBM allows us to divide a host's egress and ingress bandwidth among workloads residing in different v2 cgroups. Note that although sample BPF programs are included in the BPF patches, one can easily use different algorithms for managing bandwidth.

        This talk presents an evaluation of HBM and associated BPF programs. It explores the performance of various approaches to bandwidth management for TCP flows that use Cubic, Cubic with ECN or DCTCP for their congestion control. For evaluating performance, we consider how well flows can utilize the allocated bandwidth, how many packets are dropped by HBM, increases to RTTs due to queueing, RPC size fairness, as well as RPC latencies. This evaluation is done independently for egress and ingress. In addition, we explore the use of HBM for protecting against incast congestion by also using HBM on the root v2 cgroup.

        Our testing shows that HBM, with the appropriate BPF program, is very effective at managing egress bandwidths regardless of which TCP congestion control algorithm is used, preventing flows from exceeding the allocated bandwidth while allowing them to use most of their allocation. Not surprisingly, effectively managing ingress bandwidth requires ECN, and preferably DCTCP. Finally, we show that using HBM is very effective at preventing packet losses due to incast congestion, as long as we are willing to sacrifice some ingress bandwidth.

        Speaker: Lawrence Brakmo (Facebook)
      • 1:30 PM
        Lunch Sete/Colinas-Restaurant (Corinthia Hotel Lisbon)

        Sete/Colinas-Restaurant

        Corinthia Hotel Lisbon

        20
      • 192
        Closing Plenary (Floriana I/II/III) Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
      • 193
        Bus service for Evening Party Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Buses will start circulating at 7:30PM.

        Last return bus is at 11PM

      • 194
        Closing Party @ Centro Cultural de Belém (CCB) Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180

        Closing Party will be held at the Centro Cultural de Belém (CCB). Accessible by bus starting from the entrance (upstairs) behind the LPC registration desk.

        Last return bus: 11PM

      • 195
        Last Bus service - 11PM Floriana/room-I

        Floriana/room-I

        Corinthia Hotel Lisbon

        180
    • RDMA MC Opala/room-I&II (Corinthia Hotel Lisbon)

      Opala/room-I&II

      Corinthia Hotel Lisbon

      126

      Following the success of the past 3 years at LPC, we would like to see a 4th RDMA (Remote Direct Memory Access networking) microconference this year. The meetings in the last conferences have seen significant improvements to the RDMA subsystem merged over the years: new user API, container support, testability/syzkaller, system bootup, Soft iWarp, etc.

      In Vancouver, the RDMA track hosted some core kernel discussions on get_user_pages that is starting to see its solution merged. We expect that again RDMA will be the natural microconf to hold these quasi-mm discussions at LPC.

      This year there remain difficult open issues that need resolution:

      RDMA and PCI peer to peer for GPU and NVMe applications, including HMM and DMABUF topics
      RDMA and DAX (carry over from LSF/MM)
      Final pieces to complete the container work
      Contiguous system memory allocations for userspace (unresolved from 2017)
      Shared protection domains and memory registrations
      NVMe offload
      Integration of HMM and ODP
      And several new developing areas of interest:

      Multi-vendor virtualized 'virtio' RDMA
      Non-standard driver features and their impact on the design of the subsystem
      Encrypted RDMA traffic
      Rework and simplification of the driver API
      Previous years:
      2018, 2017: 2nd RDMA mini-summit summary, and 2016: 1st RDMA mini-summit summary

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Leon Romanovsky leon@leon.nu, Jason Gunthorpe jgg@mellanox.com

      • 196
        GUP and ZONE_DEVICE pages

        P2P
        - Suggestion with VFIO (Don)
        - RDMA as the importer, VFIO as the exporter

        get_user_pages() and friends
        - Discussion on future GUP, required to support P2P
        - GUP to SGL?
        - Non struct page based GUP

        hmm_range_fault()
        - Integrating RDMA ODP with HMM
        - 'DMA fault' for ZONE_DEVICE pages

        Speakers: Don Dutile (Red Hat), Jason Gunthorpe (Mellanox Technologies), John Hubbard (NVIDIA)
      • 197
        RDMA, File Systems, and DAX

        For almost 2 years now the use of RDMA with DAX filesystems has been disabled due to the incompatibilities of RDMA and the file system page handling.

        A general consensus has emerged from many conferences and email threads on a path to support RDMA directly to persistent memory which is managed by a filesystem.

        This talk will present the work done since LSFmm to support RDMA and FS DAX.

        Specifically this work requires exclusive layout lease grants to obtain pins.
        Fails truncate operations on file pages which have been given pins. And supports recovery by admins by allowing them to identify offending processes holding these pins.

        Speaker: Mr Ira Weiny
      • 11:30 AM
        Opala
      • 198
        Discussion about IBNBD/IBTRS Upstreaming: Action Items.

        We are going through upstreaming IBNBD/IBTRS 5th iterations, the latest effort is here: https://lwn.net/Articles/791690/.

        We would like to discuss in an open round about the unique features of the driver and the library, whether and how they are beneficial for the RDMA eco-system and what should be the next steps in order to get them upstream.

        A face to face discussion about action items will smooth the path.

        Speakers: Mr Jinpu Wang (1 & 1 IONOS Cloud GmbH), Mr Danil Kipnis (1 & 1 IONOS Cloud GmbH)
      • 199
        Shared IB Objects

        Consider a case of a server with a huge amount of memory and thousands of processes are using it to serve clients requests.

        In such a case, the HCA will have to manage thousands of MRs which will compete for caches and address translation entities.

        The way to improve performance is to allow sharing of IB objects between processes. One process will create several MRs and share them.

        This will reduce the number of address translation entries and cache miss dramatically.

        This talk will cover the implementation of a Shared Object mechanism.

        Speaker: Yuval Shaia (Oracle)
      • 200
        Improving RDMA performance through the use of contiguous memory and larger pages for files.

        As memory sizes grow so do the sizes of the data transferred between RDMA devices. Generally, the Operating system needs to keep track of the state of each of its pieces of memory and that is on Intel x86 a page of 4 KB. This is also connected to hardware providing memory management features such as the processor page tables as well as the MMU features of the RDMA NIC.

        The overhead of the operating system increases as the number of these pages reaches ever higher orders of magnitude. I.e. for 4GB of data one needs 1 million of these page descriptors. Each page descriptor is a 64-byte cache line and thus a 4GB operation requires 64MB of cache lines to be managed.

        A lot of efforts on optimization of I/O focuses on avoiding touching these page descriptors through the use of larger contiguous memory or larger page sizes. This talk gives an overview of the current methods in use to avoid these lowdowns and the work in progress to improve the situation
        and make it less of an effort to avoid these issues.

        Speaker: Christopher Lameter (Jump Trading LLC)
    • Real Time MC Esmerelda/room-I&II (Corinthia Hotel Lisbon)

      Esmerelda/room-I&II

      Corinthia Hotel Lisbon

      126

      Since 2004 a project has improved the Real-time and low-latency features for Linux. This project has become know as PREEMPT_RT, formally the real-time patch. Over the past decade, many parts of the PREEMPT RT became part of the official Linux code base. Examples of what came from PREEMPT_RT include: Real-time mutexes, high-resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The number of patches that need integration has been reduced from previous years, and the pieces left are now mature enough to make their way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged (tm)!

      In the final lap of this race, the last patches are on the way to be merged, but there are still some pieces missing. When the merge occurs, PREEMPT_RT will start to follow a new pace: the Linus one. So, it is possible to raise the following discussions:

      The status of the merge, and how can we resolve the last issues that block the merge;
      How can we improve the testing of the -rt, to follow the problems raised as Linus's tree advances;
      What's next?
      Proposed topics:

      Real-time Containers
      Proxy execution discussion
      Merge - what is missing and who can help?
      Rework of softirq - what is need for the -rt merge
      An in-kernel view of Latency
      Ongoing work on RCU that impacts per-cpu threads
      How BPF can influence the PREEMPT_RT kernel latency
      Core-schedule and the RT schedulers
      Stable maintainers tools discussion & improvements.
      Improvements on full CPU isolation
      What tools can we add into tools/ that other kernel developers can use to test and learn about PREEMPT_RT?
      What tests can we add to tools/testing/selftests?
      New tools for timing regression test, e.g. locking, overheads...
      What kernel boot self-tests can be added?
      Discuss various types of failures that can happen with PREEMPT_RT that normally would not happen in the vanilla kernel, e.g, with lockdep, preemption model.
      The continuation of the discussion of topics from last year's microconference, including the development done during this (almost) year, are also welcome!

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Daniel Bristot de Oliveira bristot@redhat.com

      • 201
        Core Scheduling for RT

        Recently speculative execution techniques have shown that an untrusted application can steal data from another one when both share the same core. To avoid such problems users have to disable SMT, causing non-negligible performance impact. Core-scheduling tries to mitigate the performance problem by allowing trusted applications to run concurrently on siblings of a core while avoiding two untrusted applications to share the same core.

        However, this has a number of ramifications and applications for Real-Time schedulers too. For instance, the Admission Control of SCHED_DEADLINE depends on the number of CPUs, but with core scheduling, the number of CPUs available is a dynamic function. OTOH Real-Time workloads often want SMT disabled for determinism, and core-scheduling gives the capability for a single task to claim an entire core.

        So I propose discussing the impact and possibilities of core-scheduling for Real-Time.

        Speaker: Peter Zijlstra (Intel OTC)
      • 202
        RCU configuration, operation, and upcoming changes for real-time workloads

        RCU has changed a surprising amount over the past few years, what with elimination of many RCU Kconfig options in favor of kernel boot parameters, RCU flavor consolidation, ongoing work on speeding up RCU's handling of offloaded callbacks, and newly started work on providing warnings when RCU's callback handling is overloaded. These changes affect how RCU behaves, and in some cases in ways that affect realtime usage. This talk will summarize the changes relevant to realtime, and outline how this affects configuration and tuning of RCU.

        Speaker: Paul McKenney (IBM Linux Technology Center)
      • 203
        Real-Time Container

        I'd like to review if-how we can build real-time container. It should include but not limited these topics here,

        1. Understanding container Scheduling
        2. Test and evaluations
        3. Possible factors related to latency issues
        4. discussions like tracing containers-leveled metrics
        5. tips
        6. etc.
        Speaker: Tiejun Chen (VMware)
      • 204
        Mathematizing the latency

        We know that reducing the sections with preemption and IRQ disabled reduces the latency, also that IRQs influences on it, but some cases are hard to catch. For example, in the old jump label update, there was a burst of IPIs causing latency spikes. Such non-periodic behavior is hard to mathematize. As a side effect, this adds pessimism to "possible formulas" that tries to define the worst-case latency, mainly regarding IRQs. Daniel would like to discuss his idea about the possible approaches to this problem, without adding unpractical pessimism.

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 11:30 AM
        Esmerelda
      • 205
        Real time softirq mainlining

        Which Real Time softirq implementation do we want for mainline?

        _ Vector-Lock based? (depend on sleeping spinlocks machinery)
        _ Vector masking based?
        _ Other?

        Speaker: Frederic Weisbecker (Suse)
      • 206
        Full dynticks / isolation for Real Time

        _ What is needed upstream for real time support of Full Dynticks and isolation?
        _ Specific requests?

        Speaker: Frederic Weisbecker
      • 207
        PREEMPT_RT: status and Q&A

        In this talk, Thomas Gleixner will present the status of the PREEMPT_RT, along with a section of questions and answers regarding the upstream work and the future of the project.

        Speaker: Thomas Gleixner
    • BPF MC Esmerelda/room-I&II (Corinthia Hotel Lisbon)

      Esmerelda/room-I&II

      Corinthia Hotel Lisbon

      126

      A BPF Microconference will be featured at this year's Linux Plumbers Conference (LPC) in Lisbon, Portugal.

      The goal of the BPF Microconference is to bring BPF developers together to discuss and hash out unresolved issues and to move new ideas forward. The focus of this year's event is on the core BPF infrastructure as well as its many subsystems and related user space tooling.

      The BPF Microconference will be open to all LPC attendees. There is no additional registration required. This is also a great occasion for BPF users and developers to meet face to face and to exchange and discuss developments.

      Similar to last year's BPF Microconference the main focus will be on discussion rather than pure presentation style.

      Therefore, each accepted topic will provide introductory slides with subsequent discussion as the main part for the rest of the allocated time slot. The expected time for one discussion slot is approximately 20 min.

      MC is lead by both BPF kernel maintainers:

      Alexei Starovoitov ast@kernel.org and Daniel Borkmann daniel@cilium.io

      • 208
        Bringing BPF developer experience to the next level

        The way BPF application developers build applications is constantly improving. There are still rough corners, as well as (as of yet) fundamentally inconvenient developer workflows involved (e.g., on-the-fly compilation). The ultimate goal of BPF application development is to provide experience as straightforward and simple as a typical user-land application.

        We'll discuss major pain points with BPF developer experience today and present motivation for solving them. Libbpf and BTF type info integration are at the center of the puzzle that's being put together to provide a powerful and yet less error-prone solution:
        - BPF CO-RE and how it is addressing adapting to ever-changing kernel and facilitates safe and efficient kernel introspection;
        - consistent and safer APIs to load/attach/work with BPF programs;
        - declarative and more powerful ways to define and initialize BPF maps;
        - providing and standardizing BPF-side helper library for all BPF code needs.

        Speaker: Andrii Nakryiko (Facebook)
      • 209
        BPF Debugging

        Debugging BPF program logic is hard these days.
        Developers typically write their programs and
        then checking map values or perf_event outputs
        make sense or not. For tricky issues, temporary
        maps or bpf_trace_printk are used so developer
        can get more insight about what happens. But
        this requires possibly multiple rounds of
        modifying sources, recompilation and redeployment, etc.

        This discussion surrounds creating bpf debugging
        tool, bdb (bpf debugger) similar naming after gdb/lldb.
        This tool should try to do what gdb for ELF execution.
        - specify breakpoints at source/xlated/jitted level
        - retrieve data for registers, stacks and globals(maps)
        and presented at both register and variable level.
        - different conditions to retrieve data, e.g.,
        running 100 times, only if this variable == 1.
        this will require kernel to live patch bpf codes.
        - modifying data (register, stack slot, globals)?
        how does this interact with verifier to ensure safety.
        - this will leverage BTF and existing test_run framework.
        - production debugging vs. qemu debugging
        qemu debugging may be truely single-step.

        Speaker: Yonghong Song
      • 210
        A pure Go BPF library

        At the LSF/MM eBPF track, we discussed the necessity of a common Go
        library to interact with BPF. Since then, Cilium and Cloudflare have
        worked out a proposal to upstream parts of github.com/newtools/ebpf
        and github.com/cilium/cilium/pkg/bpf into a new common library.

        Our goal is to create a native Go library instead of a CGO wrapper
        of C libbpf. This provides superior performance, debuggability and
        ease of deployment. The focus will be on supporting long-running
        daemons interacting with the kernel, such as Cilium or Cloudflare's
        L4 load balancer.

        We’d like to present this proposal to the wider BPF community and
        solicit feedback. We’ll cover the goals and guiding principles we’ve
        set ourselves and our initial roadmap.

        Speakers: Joe Stringer (Isovalent / Cilium), Lorenz Bauer (Cloudflare), Martynas Pumputis
      • 211
        Do we need CAP_BPF_ADMIN?

        Currently, most BPF functionality requires CAP_SYS_ADMIN or CAP_NET_ADMIN. However, in many cases, CAP_SYS_ADMIN/CAP_NET_ADMIN gives the user more than enough permissions. For example, tracing users need to load BPF programs and access BPF maps, so they need CAP_SYS_ADMIN. However, they don't need to modify the system, so CAP_SYS_ADMIN adds significant risk.

        To better control BPF functionality, this is time to think about CAP_BPF_ADMIN (or even multiple CAP_BPF_*s). In this BPF MC, we would like to discuss whether we need CAP_BPF_ADMIN, and what CAP_BPF_ADMIN would look like. We will present survey of major BPF use cases, and identify use cases that may benefit from a new CAP. Then, we will discuss which syscalls/commands should be gated by the new CAP. We expect constructive discussions between the BPF folks and security folks.

        Speaker: Song Liu
      • 4:30 PM
        Break
      • 212
        Reuse host JIT back-end as offload back-end

        eBPF offload is a powerful feature on modern SmartNICs used to accelerate
        XDP or TC based BPF. The current kernel eBPF offload infrastructure was
        introduced for the Netronome NFP based SmartNICs, these were based around a
        proprietary ISA and had some specific verifier requirements.

        In the near future this may be joined by SmartNICs using public ISA's such
        as RISC-V and Arm which also happen to be used as host CPUs. This talk will
        discuss the implications of reusing these ISAs and other back-end features
        for offload to a sea of cores as well as how much of a host CPU back-ends
        can be reused and what additional infrastructure may be needed. As an
        example we will use the current work on a many core RISC-V processor
        ongoing within.

        Speaker: Mr JIONG WANG (Netronome Systems)
      • 213
        Using SCEV to establish pre and post-conditions over BPF code

        Currently, the BPF verifier has to "execute" code at least once and then it can prune branches when it detects the state is the same. In this session we would like to cover a technique called Scalar Evolution (SCEV) which is used by LLVM and GCC to perform optimization passes such as identifying and promoting induction variables and do worst case trip analysis over loops. At its most basic usage SCEV finds the start value of variables, the variables stride and the variables ending value over a block of code. Building a SCEV pass into the BPF verifier would allow us to create a set of pre and post conditions over blocks of BPF codes.

        We see this as potentially useful to avoid "executing" loops in the verifier and instead allowing the verifier to check pre-conditions before entering the loop. And additionally establishing pre and post conditions on function calls to avoid having to execute the verifier on functions repeatedly. We suspect this will likely be necessary to support shared libraries for example.

        The goal of the session will be to do a brief introduction to SCEV. Provide a demonstration of some early prototype work that can build pre and post conditions over blocks of BPF code. Then discuss next steps for possible inclusion.

        Speaker: Mr John Fastabend (Isovalent)
      • 214
        Beyond per-CPU atomics and rseq syscall: subset of eBPF bytecode for the do_on_cpu syscall

        The Restartable Sequences system call [1,2,3,4] introduced in Linux 4.18 has limitations which can be solved by introducing a bytecode interpreter running in inter-processor interrupt context which accesses user-space data.

        This discussion is about the subset of the eBPF bytecode and context needed by this interpreter, and extensions of that bytecode to cover load-acquire and store-conditional memory accesses, as well as memory barrier instructions. The fact that the interpreter needs to allow loading data from userspace (tainted data), which can then be used as address for loads and stores, as well as conditional branches source register, will also be discussed.

        [1] "PerCpu Atomics" http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
        [2] "Restartable sequences" https://lwn.net/Articles/650333/
        [3] "Restartable sequences restarted" https://lwn.net/Articles/697979/
        [4] "Restartable sequences and ops vectors" https://lwn.net/Articles/737662/

        Speaker: Mathieu Desnoyers (EfficiOS Inc.)
      • 215
        Kernel Runtime Security Instrumentation (KRSI)

        Existing Linux Security Modules can only be extended by modifying and rebuilding the kernel, making it difficult to react to new threats. The Kernel Runtime Security Instrumentation project (KRSI) (prototype code) aims to help this by providing an LSM that allows eBPF programs to be added to security hooks.

        The talk discusses the need for such an LSM (with representative use cases) and compares it to some existing alternatives, such as Landlock, a separate custom LSM, kprobes+eBPF etc. The second half of the talk outlines the proposed design and interfaces, and includes a live demo.

        KRSI is an LSM that:

        • Allows the attachment of eBPF programs to security hooks.
        • Provides a good ecosystem of safe eBPF helper functions specifically written with security and auditing features in mind.

        This enables the development of a new class of userspace security products that:

        • Reduce the overhead of building and updating the kernel/LSM when a new security vulnerability is discovered.
        • Allows the system owners to choose the format in which the data is audit logged.
          Provide flexibility w.r.t granularity of auditing needed and add new auditing without needing to re-build or update the LSM/Kernel (in contrast to the existing audit framework)

        The intended audience for this talk would be:

        • Security-focused kernel engineers
        • Engineers building user-space security products on Linux.
        • Security Engineers and Admins who care about the time required to deploy security software to detect and prevent a new class of malicious activity.
        Speaker: Mr KP Singh
      • 216
        Map batch processing

        bcc community has long discussed that batch
        dump, lookup and delete will help its typical
        use case, periodically retrieving and deleting
        all samples in the kernel. Without batch APIs,
        bcc typically does
        iterate through all keys (get_next_key API)
        get (key, value) pairs
        iterate through all keys to delete them

        Also, Brian Vazquez
        has proposed BPF_MAP_DUMP command to dump
        more than one entry per syscall call.
        https://www.spinics.net/lists/netdev/msg583538.html

        This discussion will propose new bpf subcommands
        for map batch processing, e.g., batching
        get_next_key/lookup/update/delete/lookup_and_delete.
        discuss its pros and cons etc.

        Looks the subject has been discussed actively in the mailing list.
        If the discussion reached its maturity, we may not need to discuss
        in the conference.

        Speaker: Yonghong Song
    • Birds of a feather (BoF) Ametista/room-I (Corinthia Hotel Lisbon)

      Ametista/room-I

      Corinthia Hotel Lisbon

      50

      Our BoF session proposes topics as informal meeting during the conference. The topic lead (submitter) will drive the conversations on the area of interest described in each BoF.

      The attendees group together based on a shared interest and carry out discussions without any pre-planned agenda.

      • 217
        RCU internals and usage

        This session will focus on answering questions on the internals and the usage of Linux-kernel RCU. However, questions regarding details of the RCU-related patches in the -rt patchset will be deferred to other venues, given that this topic consumed the entire time in the 2018 informal RCU BoF session.

        This is not intended to be a tutorial on RCU basics, though a separate session on this topic might be offered if there is sufficient interest.

        Speaker: Paul McKenney (IBM Linux Technology Center)
      • 218
        Soft Affinity

        When multiple instances of workloads are consolidated in same host it is
        good practice to partition them for best performance. For e.g give a NUMA
        node parition to each instance. Currently Linux kernel provides two
        interfaces to hard parition: sched_setaffinity system call or cpuset.cpus
        cgroup. But this doesn't allow one instance to burst out of its partition
        and use available CPUs from other partitions when they are idle. Running
        all instances free range without any affinity, on the other hand, suffers
        from cache coherence overhead across sockets (NUMA nodes) when all
        instances are busy. To achieve the best of both worlds introduce new Soft Affinity feature that allows the scheduler to chose a preferred set of CPUs when they are idle but burst out of it and use the allowed set if they are all busy.

        Speaker: Subhra Mazumdar
      • 4:30 PM
        Break
    • Live Patching MC Opala/room-I&II (Corinthia Hotel Lisbon)

      Opala/room-I&II

      Corinthia Hotel Lisbon

      126

      The main purpose of the Linux Plumbers 2019 Live Patching microconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make live patching of the Linux kernel and the Linux userspace live patching feature complete.

      The intention is to mainly focus on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.

      This proposal follows up on the history of past LPC live patching microconferences that have been very useful and pushed the development forward a lot.

      Currently proposed discussion/presentation topic proposals (we've not gone through "internal selection process yet") with tentatively confirmed attendance:

      5 min Intro - What happened in kernel live patching over the last year
      API for state changes made by callbacks [1][2]
      source-based livepatch creation tooling [3][4]
      klp-convert [5][6]
      livepatch developers guide
      userspace live patching
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Jiri Kosina jkosina@suse.cz and Josh Poimboeuf jpoimboe@redhat.com

      • 219
        What happened in kernel live patching over the last year

        A short summary of a development in kernel live patching over the last year. There have been many improvements since LPC in Vancouver, but there are still some outstanding issues. Not all attendees might closely follow live-patching mailing list and therefore the talk should be a good starting point for the microconference.

        Speaker: Miroslav Beneš
      • 220
        Rethinking late module patching

        Current livepatch implementation supports late patching of modules when they are loaded (and unpatching when unloaded). It has caused headaches and LPC microconference is a good opportunity to discuss the future of the feature. There were attempt to deny the module removal. Introduction of patch module dependencies could also simplify the code and issue a lot. On the other hand, such solutions could make livepatch less flexible. It is necessary to weigh their advantages and downsides properly.

        Speaker: Miroslav Beneš
      • 221
        Source-based livepatch creation tooling

        At last year's Live Patching MC, an approach to automating source based live patch creation had been proposed. The implementation made good progress since then, in particular an initial release of the "klp-ccp" utility has been published (https://github.com/SUSE/klp-ccp) recently. Its purpose is to handle the transformation of patched kernel parts into self-contained live patch source code files.

        However, klp-ccp is only part of a larger pipeline and in working further towards fully automated live patch creation, it's worth to discuss how the individual pieces are best glued together.

        Among the open questions are:

        • Can klp-ccp and klp-convert make use of the same source of information for resolving symbols to instances from target kernel?
        • Can we perhaps introduce some convention for accessing the IPA optimization reports created by GCC's -fdump-ipa-clones?
        • Can we introduce some mechanism for obtaining the original kernel compilation's compiler flags each?
        Speaker: Nicolai Stange (SUSE)
      • 222
        Update on objtool - Power

        A quick update on the objtool port on Power, what is the current state and
        what more needs to be done. Also, discuss how do we integrate it upstream.

        Speaker: Mr Kamalesh Babulal
      • 223
        Do we need a Livepatch Developers Guide?

        Over the past few years, kernel engineers have been busy implementing livepatch support features (the consistency model, atomic replace, shadow variables, etc.) to increase potential livepatch patch coverage. At the same time, more and more vendors have adopted livepatching to solve continuous uptime/update problems.

        As the livepatch feature set grows and matures and demand for livepatch patches rise, developers will be seeking guidance and best practices when writing livepatch patches. The kpatch project addressed this with a "Patch Author Guide" on its github site. This document covers several common patch writing FAQ and techniques, but it is currently very kpatch-specific.

        The kernel livepatch subsystem has some great technical documentation in Documentation/livepatch. Talk about the state of those docs and whether they are approachable and complete enough for livepatch patch authors. Do we need a wholesale "Patch Author Guide" or can we adopt some of the same FAQ and technique details from the kpatch guide.

        Speaker: Joe Lawrence (Red Hat)
      • 4:30 PM
        Break
      • 224
        API for state changes made by callbacks

        The discussion should focus on an API for handling state of changes made by callbacks. It was already discussed as a global state handling at the last LPC in Vancouver. New ideas have occurred since then. The discussion should also include patch versioning, stickiness and transition reversal.

        Patches submitted upstream so far:
        https://www.spinics.net/lists/live-patching/msg05063.html
        http://lore.kernel.org/r/20190719074034.29761-1-pmladek@suse.com

        Speaker: Petr Mládek
      • 225
        klp-convert and livepatch relocations

        The kernel already supports special livepatch relocation types enable several interesting livepatch modules use cases:

        • Access to symbols outside of normal C scoping rules
        • Deferred access to yet-to-be loaded kernel module symbols
        • Support for architecture-specific special sections like altinstructions and paravirt instructions

        Although the kernel supports loading livepatch modules with these features, there remains no easy in-kernel means of creating such relocation types. The klp-convert patchset adds this functionality to the kernel build system, reducing dependencies on out-of-tree livepatch build mechanisms.

        Talk about the current state of the klp-convert patchset: what has been implemented, what is being worked on, and what issues are still outstanding.

        Speaker: Joe Lawrence (Red Hat)
      • 226
        Making Livepatching Infrastructure Better

        Currently testing/stressing of livepatching infrastructure is limited to the creation of livepatching module for the reported CVE/Security issues. Continuous testing of the infrastructure is required, it can be achieved by randomly selecting the patch(s) posted over kernel mailing list to improve and fix the bugs seen in the infrastructure. I would like to discuss the in house framework used for testing livepatch. The discussion would help to understand/provide feedback on how/what should be tested. The improvements which can be made to improve the testing coverage.

        Speaker: Mr Kamalesh Babulal
      • 227
        Live patch services

        Discussion about current live patch services and how we can make it more open and flexible.
        How we can make more open source distributions use or make their own live patch services.
        What we are still missing? and what we can share?

        Speaker: Alice Ferrazzi
    • System Boot and Security MC Jade/room-I&II (Corinthia Hotel Lisbon)

      Jade/room-I&II

      Corinthia Hotel Lisbon

      160

      The microconference will focus on various topics related to the open source security, including bootloaders, firmware, BMCs and TPMs. This will help to get together all interested people in one room and discuss current developments and issues hurting the community.

      Potential speakers and key participants: everybody involved or interested in GRUB, iPXE, coreboot, LinuxBoot, SeaBIOS, UEFI, OVMF, TianoCore, IPMI, OpenBMC, TPM, and related projects and technologies.

      It has been an exciting year of progress around the Linux integrity - patches for TPM support have finally been integrated into GRUB, support for a wider range of TPM2 features has been landing in-kernel, IMA and EVM have continued to grow new features and there's a fully-featured free software remote attestation implementation.

      Let's get together and spend a few hours discussing what the remaining painpoints are and what should come next.

      • 228
        Secure and Trusted boot in OpenBMC

        The OpenBMC project has brought modern Linux to the firmware in your new server. A missing piece of this is ensuring the firmware is the image you expect it to be running.

        The next generation of BMC hardware will allow a hardware root of trust to secure the boot chain. This talk will present the a proposed design for trusted boot in OpenBMC.

        Speaker: Joel Stanley (IBM)
      • 229
        UEFI and TianoCore update

        The UEFI forum is rolling out a new "code first" process, to be available for both UEFI and ACPI specifications, in order to speed up time between initial definition and upstream support.

        The UEFI self-certification testsuite (SCT) has been open sourced.

        UEFI interface implementation in U-Boot now sufficient for GRUB use (and more) across multiple distributions..

        Speaker: Leif Lindholm (Linaro, TianoCore, GRUB)
      • 230
        SGX upstreaming status and challenges

        The presentation gives an overview of what has been implemented in the SGX patch set and what there is still left to do. The presentation goes through the known blockers for upstreaming. In particular, access control related issues will be discussed.

        Speaker: Jarkko Sakkinen
      • 231
        TrenchBoot - how to nicely boot system with Intel TXT and AMD SVM

        TrenchBoot is a cross-community OSS integration project for hardware-rooted, late launch integrity of open and proprietary systems. It provides a general purpose, open-source DRTM kernel for measured system launch and attestation of device integrity to trust-centric access infrastructure. TrenchBoot closes the the measurement gap and reduces the need to trust system firmware. This talk will introduce TrenchBoot architecture and recent work within Oracle to launch the Linux kernel directly with Intel TXT or AMD SVM Secure Launch. It will propose mechanisms for integrating a Linux distro into a TrenchBoot system launch. DRTM-enabled capabilities for client, server and embedded platforms will be presented for consideration by the Linux community.

        Speaker: Daniel Kiper
      • 4:30 PM
        Break
      • 232
        TPM2 Security in the face of bus interposers

        TPM2 introduced a plain text authorization scheme with the idea that the system using the TPM should now whether the transport was secure. The presence of interposers on the bus, either as physical devices

        https://www.nccgroup.trust/us/our-research/tpm-genie/

        Or as compromised pre-boot firmware make this threat a reality. A NULL seed based scheme has been proposed for Linux

        https://lore.kernel.org/linux-integrity/1540193596.3202.7.camel@HansenPartnership.com/

        we should discuss if this is the best we can do and if it is how should we extend it to the layers below that use the TPM (like UEFI and grub).

        Speaker: James Bottomley (IBM)
      • 233
        reference Integrity measurements for TPM2 security policy

        Firmware on commodity PCs have used the TPM to store integrity measurements from security relevant components as part of the boot process for some time. Grub2 has recently merged patches that extend this integrity measurement chain through to the launching of the OS kernel. Collecting and storing these measurements in the TPM is a necessary precondition for implementing authorization policy based on the state of the system, but this alone is insufficient.

        This talk will begin by discussing the current state of boot-time integrity measurement collection in UEFI firmware and Grub2. We'll then present a notional use-case implementing security controls based on TPM2 policy mechanisms while describing the plumbing required to enable interaction with the TPM2 device. The remainder of this talk will then discuss the existing gaps in software and tooling required to implement work-flows for managing configuration of the relevant security controls across system install and update operations.

        Speaker: Philip Tricca (Intel)
      • 234
        Non-UEFI-aware measured boot using coreboot, GRUB and TPM2.0

        The main issue in using TPM2.0 in such measured boot solution is that at the
        moment of writing this abstract neither Trusted Grub, nor Linux kernel has
        TPM2.0 implementation. There are of course implementations based on UEFI
        systems, where bootloaders can utilize TCG EFI protocol to handle TPM. However
        other non-UEFI based solutions suffer from lack of TPM2.0 drivers in the
        bootloaders. Taking, for example, coreboot with vboot and measured mode the chain
        of trust ends on at verifying and measuring the MBR code. This limits the
        trusted boot technology for firmware solutions that do not base on UEFI
        specification.

        As TPM2.0 is already supported in coreboot, the next stage would be enabling it
        in GRUB2. As a matter of fact that TPM1.2 has already been enabled in its
        derivative, Trusted GRUB2.0, but we consider it much unsatisfying.

        Chain of trust:

        coreboot + payload -(chain cuts here)-> Trusted GRUB -> kernel

        Establishing a chain of trust will make SRTM (Static Root of Trust for
        Measurement) based on coreboot fully featured. As security solutions are used
        more and more widely it will help coreboot to stay up to date with all the
        competitor's proprietary solutions.

        Speakers: Piotr Król (3mdeb Embedded Systems Consulting), Mr Żygowski Michał (3mdeb Embedded Systems Consulting)
      • 235
        TPM 2.0 Linux sysfs interface

        At the time of writing this paper the Linux kernel supported TPM 1.2
        functionalities in sysfs. To these functionalities we include:

        $ ls /sys/devices/pnp0/00:04/tpm/tpm0
        active  caps  device     enabled  pcrs   ppi    subsystem         timeouts
        cancel  dev   durations  owned    power  pubek  temp_deactivated  uevent
        $ ls /sys/devices/pnp0/00:04/tpm/tpm0/ppi
        request  response  tcg_operations  transition_action  version  vs_operations
        

        We would expect the same or similar level of support for TPM 2.0. At least
        kernel should be able to request localities, change PCR banks, list PCRs, extend
        PCRs, clear TPM, take ownership. For now, the TPM2.0 is unusable in any way.
        Despite enabling all TPM options in the kernel configuration. There is a TPM 2.0
        software stack, however, it has many dependencies and has to be compiled by
        anyone who would like to utilize TPM2.0 (packages in package managers was
        broken for certain distros at the time of writing the document).

        Additionally, Linux has Integrity Measurement Architecture which utilizes TPM to
        attest the rootfs whether it has been maliciously modified. However, the only
        supported TPM is the one in version 1.2. Enabling it is as simple as adding a
        single kernel cmdline parameter: ima_tcb and defining a policy. However, it
        will only work with TPM 1.2 tools like tpm-tools, trousers.

        Speakers: Mr Król Piotr (3mdeb Embedded Systems Consulting), Mr Żygowski Michał (3mdeb Embedded Systems Consulting)
    • 236
      Closing Plenary
    • 237
      Closing Party Centro Cultural de Belém

      Centro Cultural de Belém

      Praça do Império, 1449-003 Lisboa

      Busses will be leaving from the Corinthia Hotel lobby from 19:30