18–20 Sept 2024
Europe/Vienna timezone

Session

VFIO/IOMMU/PCI MC

18 Sept 2024, 10:00

Description

The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:

These features are aimed at high-performance systems, server and desktop computing, embedded and SoC platforms, virtualisation, and ubiquitous IoT devices.

The kernel code that enables these new system features focuses on coordination between the PCI devices, the IOMMUs they are connected to, and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems.

The VFIO/IOMMU/PCI MC focuses on the kernel code that enables these new system features, often requiring coordination between the VFIO, IOMMU and PCI sub-systems.

Following the success of LPC 2017, 2019, 2020, 2021, 2022, and 2023 VFIO/IOMMU/PCI MC, the Linux Plumbers Conference 2024 VFIO/IOMMU/PCI track will focus on promoting discussions on the PCI core and current kernel patches aimed at VFIO/IOMMU/PCI subsystems. Specific sessions will target discussions requiring coordination between the three subsystems.

See the following video recordings from 2023: LPC 2023 - VFIO/IOMMU/PCI MC.

Older recordings can be accessed through our official YouTube channel at @linux-pci and the archived LPC 2017 VFIO/IOMMU/PCI MC web page at Linux Plumbers Conference 2017, where the audio recordings from the MC track and links to presentation materials are available.

The tentative schedule will provide an update on the current state of VFIO/IOMMU/PCI kernel sub-systems, followed by a discussion of current issues in the proposed topics.

The following was a result of last year's successful Linux Plumbers MC:

  • The first version of work on improving the IRQ throughput using coalesced interrupt delivery with MSI has been sent for review to be included in the mainline kernel
  • The work surrounding support for /dev/iommufd continues with the baseline VFIO support replacing the "Type 1", has been merged into the mainline kernel, and discussions around introducing accelerated viommu to KVM are in progress. Both Intel and AMD are working on supporting iommufd in their drivers
  • Changes focused on IOMMU observability and overhead are currently in review to be included in the mainline kernel
  • The initial support for generating DT nodes for discovered PCI devices has been merged
    into the mainline kernel. Several patches followed with various fixes since then
  • Following a discussion on cleaning up the PCI Endpoint sub-system, a series has been proposed to move to the genalloc framework, replacing a custom allocator code within the endpoint sub-system

Tentative topics that are under consideration for this year include (but are not limited to):

  • PCI

    • Cache Coherent Interconnect for Accelerators (CCIX)/Compute Express Link (CXL) expansion memory and accelerators management
    • Data Object Exchange (DOE)
    • Integrity and Data Encryption (IDE)
    • Component Measurement and Authentication (CMA)
    • Security Protocol and Data Model (SPDM)
    • I/O Address Space ID Allocator (IOASID)
    • INTX/MSI IRQ domain consolidation
    • Gen-Z interconnect fabric
    • ARM64 architecture and hardware
    • PCI native host controllers/endpoints drivers' current challenges and improvements (e.g., state of PCI quirks, etc.)
    • PCI error handling and management, e.g., Advanced Error Reporting (AER), Downstream Port Containment (DPC), ACPI Platform Error Interface (APEI) and Error Disconnect Recovery (EDR)
    • Power management and devices supporting Active-state Power Management (ASPM)
    • Peer-to-Peer DMA (P2PDMA)
    • Resources claiming/assignment consolidation
    • Probing of native PCIe controllers and general reset implementation
    • Prefetchable vs non-prefetchable BAR address mappings
    • Untrusted/external devices management
    • DMA ownership models
    • Thunderbolt, DMA, RDMA and USB4 security
  • VFIO

    • Write-combine on non-x86 architectures
    • I/O Page Fault (IOPF) for passthrough devices
    • Shared Virtual Addressing (SVA) interface
    • Single-root I/O Virtualization(SRIOV)/Process Address Space ID (PASID) integration
    • PASID in SRIOV virtual functions
    • Device assignment/sub-assignment
  • IOMMU

    • /dev/iommufd development
    • IOMMU virtualisation
    • IOMMU drivers SVA interface
    • DMA-API layer interactions and the move towards generic dma-ops for IOMMU drivers
    • Possible IOMMU core changes (e.g., better integration with the device-driver core, etc.)

If you are interested in participating in this MC and have topics to propose, please use the Call for Proposals (CfP) process. More topics might be added based on CfP for this MC.

Otherwise, join us in discussing how to help Linux keep up with the new features added to the PCI interconnect specification. We hope to see you there!

Key Attendees:

  • Alex Williamson
  • Arnd Bergmann
  • Ashok Raj
  • Benjamin Herrenschmidt
  • Bjorn Helgaas
  • Dan Williams
  • Eric Auger
  • Jacob Pan
  • Jason Gunthorpe
  • Jean-Philippe Brucker
  • Jonathan Cameron
  • Jörg Rödel
  • Kevin Tian
  • Krzysztof Wilczyński
  • Lorenzo Pieralisi
  • Lu Baolu
  • Marc Zyngier
  • Peter Zijlstra
  • Thomas Gleixner

Contacts:

Presentation materials

  1. Jonathan Cameron (Huawei Technologies R&D (UK))
    18/09/2024, 10:00

    Key takeaway - interrupts are what makes this complex.

    The PCIe port driver is an unusual beast:
    - It binds to several Class Codes because they happen to have common features. (PCI Bridges of various types, Root Complex Event Collectors).
    - It then gets ready to register a set of service drivers.
    - Before registering those service drivers it has to figure out what interrupts are in use...

    Go to contribution page
  2. Ilpo Järvinen (Intel)
    18/09/2024, 10:20

    PCIe Bandwidth Controller (bwctrl) is a PCIe portdrv service that allows controlling the PCIe Link Speed for thermal and power consumption reasons. The Link Speed control is provided through an in-kernel API and for userspace through a thermal cooling device. With the advent of PCIe gen6, also the PCIe Link Width will become controllable in the near future.

    On PCIe side, bwctrl requires...

    Go to contribution page
  3. Liang Yan
    18/09/2024, 10:40

    We encountered a performance bottleneck while testing NCCL on a GPU cluster with 8x H100 GPUs and 8x 400G NIC nodes. Despite a theoretical capacity of 400 Gb/s, our system consistently reached only ~85 Gb/s. The primary issue was identified as communication constraints between GPUs and NICs under the same PCIe switch.

    This session will concisely overview the challenges we experienced, such...

    Go to contribution page
  4. Mr Jason Gunthorpe (NVIDIA Networking)
    18/09/2024, 11:00

    A brief iommufd update and time for any active discussion that needs resolution.

    A discussion on Generic Page Table which should reach the mailing list as RFC before the conference. Generic Page Table consolidates the page table code in the iommu layer into something more like the MM with common algorithms and thin arch specific helpers. Common alogrithms will allow implementing new ops to...

    Go to contribution page
  5. Shivaprasad G Bhat (IBM)
    18/09/2024, 12:00

    PPC64 implementation of VFIO is spread across two vastly different machine types (pSeries, PowerNV) trying to share a lot of common code driven by PPC specific SPAPR IOMMU API.

    The support of the PCI device assignment on these sub arch's have gone through many cycles of breakages and fixes with ongoing efforts to add support for IOMMUFD, which PPC64 is yet to catch up to. Enhancements[1] to...

    Go to contribution page
  6. James Gowans (Amazon EC2)
    18/09/2024, 12:10

    Live update is a mechanism to support updating a hypervisor in a way that has limited impact to running virtual machines. This is done by pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM processes and then deserialising/resuming the VMs so that they continue running from where they were. When the VMs have DMA devices assigned to them, the IOMMU state and page...

    Go to contribution page
  7. Joel Granados
    18/09/2024, 12:30

    The PCI ATS Extended Capability allows peripheral devices to participate in the
    caching of translations when operating under an IOMMU. Further, the ATS Page
    Request Interface (PRI) Extension allows devices to handle missing mappings.
    Currently, PRI is mainly used in the context of Shared Virtual Addressing,
    requiring support for the Process Address Space Identifier (PASID) capability,
    but...

    Go to contribution page
  8. Srivatsa Vaddagiri (Qualcomm)
    18/09/2024, 12:50

    Platform devices are those that are discovered via something like a device-tree.
    Once discovered, the device is typically available for the life of a VM. IOW
    platform devices can't be hotplugged in its typical sense. Qualcomm however is
    having usecases where platform device ownership need to be managed at runtime
    between VMs. A VM that has ownership of a platform device is required...

    Go to contribution page
  9. Manivannan Sadhasivam
    18/09/2024, 13:10

    As a follow up to the last year's 'PCI Endpoint Open Items Discussion', below are the topics for discussion this year:

    1. State of the Virtio support in PCI Endpoint Subsystem
    2. Using QEMU for testing PCI Endpoint Subsystem
    3. Repurposing Interrupt Controllers for Receiving Doorbells in Endpoint Devices
    Go to contribution page
  10. Wei Huang

    PCIe standard TLP processing hints (TPH) allow steering tags (STs) to be attached to PCIe TLP headers to facilitate optimized processing of DMA write requests that target memory space. New AMD hardware, by leveraging TPH, will support smart data cache injection where DMA data will be prefetched into L2 cache of target CCXs rather than DRAM. These new technologies can potentially improve DMA...

    Go to contribution page
  11. Wei Huang

    PCIe standard TLP processing hints (TPH) allow steering tags (STs) to be attached to PCIe TLP headers to facilitate optimized processing of DMA write requests that target memory space. New AMD hardware, by leveraging TPH, will support smart data cache injection where DMA data will be prefetched into L2 cache of target CCXs rather than DRAM. These new technologies can potentially improve DMA...

    Go to contribution page
Building timetable...
Diamond Sponsor
Platinum Sponsors
Gold Sponsors
Silver Sponsors
Conference Services Provided by