11โ€“13 Dec 2025
Asia/Tokyo timezone

Session

Live Update MC

13 Dec 2025, 15:00

Conveners

Live Update MC

  • Pasha Tatashin
  • Alexander Graf
  • David Matlack (Google)
  • Mike Rapoport

Description

Live Update is a specialized reboot process where selected devices are kept operational and kernel state is preserved and recreated across a kexec. For devices, DMA and interrupts may continue during the reboot.

The primary use-case of Live Update is to enable hypervisor updates in cloud environments with minimal disruption to running virtual machines. During a Live Update, a VM can pause and its state is stored to memory while the hypervisor reboots. PCIe devices attached to those VMs (such as GPUs, NICs, and SSDs), are kept running during the Live Update. After the reboot, VMs are recreated and restored from memory, reattached to devices, and resumed. The disruption is limited to the time it takes to complete this entire process.

With Live Update infrastructure in place, other use-cases may emerge, like for example preserving the state of GPU doing LLM, freezing running containers with CRIU, and preserving large in-memory databases.

The Live Update and state persistence functionality touch on different parts of the kernel and this microconference aims to bring together people from different subsystems. Upstream support for Live Updates is still in its infancy and there are a lot of unsolved aspects that will benefit from direct communication.

Key problems that will be discussed:

Support for memfd/guest_memfd/hugetlb/tmpfs
Preserving the state of VFIO, IOMMUFD, and IOMMU drivers.
Kernel <-> userspace interaction during Live Update
Integration of Live Update with PCI and Device Model
Persistence of movable memory
Leveraging suspend/resume functionality for device state preservation
Optimizing kernel shutdown and boot times
Automated Testing of Live Updates

Key attendees:

Pasha Tatashin
David Matlack
David Rientjes
Chris Li
Bjorn Helgaas
Samiullah Khawaja
Vipin Sharma
Josh Hilke
Changyuan Lyu
Alex Graf
David Woodhouse
James Gowans
Pratyush Yadav
Jason Gunthorpe
Mike Rapoport
Alex Williamson

Presentation materials

  1. Pasha Tatashin
    13/12/2025, 15:00

    Introduce the Live Update Orchestrator (LUO) and its daemon, LUOD, a new framework designed to provide user API, state machine, resource management, and resource ownership model for live update operations. Detail the architecture of LUO, explaining its core components and states. The talk will walk through the typical workflow.

    We will cover the current status of the project, including key...

    Go to contribution page
  2. Pratyush Yadav
    13/12/2025, 15:15

    As LUO grows and various subsystems evolve, there will eventually be a need to update the serialization format for a particular subsystem. This new version of the format needs to be understood by the next kernel. This talk will discuss how the versions can be managed and negotiated between the current and the next kernel to ensure live update actually succeeds.

    Go to contribution page
  3. Andrey Ryabinin
    13/12/2025, 15:30

    Hello, I'd like to propose discussion about KSTATE
    as a solution for [de]serializing kernel's state.

    Thanks,
    Andrey

    Go to contribution page
  4. Pratyush Yadav
    13/12/2025, 15:45

    The Live Update Orchestrator (LUO) allows userspace to hand over resources identified by file descriptors (FDs) to the next kernel. Memory is one of the most fundamental resources managed by the kernel. Memory can be identified by a FD via memfd. This makes memfd a great candidate for the first LUO user.

    This talk will discuss the design of memfd preservation with LUO, current state of...

    Go to contribution page
  5. Jason Miu
    13/12/2025, 16:00

    The KHO framework's initial design relies on a stateful, linear serialization step that creates a scalability bottleneck on large-memory hosts. This talk will detail our effort at making KHO stateless by changing the data structures which manage preserved physical pages.

    The session will also facilitate an open discussion on future optimizations. We will explore potential designs for making...

    Go to contribution page
  6. Pasha Tatashin
    13/12/2025, 16:15

    An open discussion on improving kexec performance during live updates. Minimizing the downtime associated with reboots is critical. This session will explore potential optimizations, including methods to speed up ACPI discovery, enhance serialization/deserialization, and ideas like orphaned VMs. We invite the audience to contribute their own ideas and experiences to collaboratively identify...

    Go to contribution page
  7. David Matlack (Google), Josh Hilke (KVM Team @ Google)
    13/12/2025, 17:00

    In this talk, we'll discuss our work to support Live Update in VFIO for PCI devices.

    Live Update is a mechanism to quickly update the kernel while running virtual machines using kexec. VFIO is a kernel module to allow devices to be controlled by userspace and virtual machines.

    During our talk we will cover the problems that need to be solved to support Live Updates in VFIO, and our...

    Go to contribution page
  8. Samiullah Khawaja
    13/12/2025, 17:15

    During a kernel live update, devices owned by a virtual machine may continue to perform DMA operations. For these operations to succeed, the IOMMU state must be preserved. Normally, a kexec reboot would reinitialize the IOMMU, causing the loss of all state and pre-existing DMA mappings.

    To prevent this loss and ensure device continuity, the DMA mappings setup for a preserved device must...

    Go to contribution page
  9. Chris Li (Google)
    13/12/2025, 17:30

    Live Updating GPU device is a big usage case for the Hypervisor liveupdate project.

    The PCI liveupdate subsystem is built on top of the Live Update Orchestrator to manage the PCI device and its depended device for the livedupate.

    The PCI device probing needs heavily modification to embrace the device livedupate. The device is already running and have device state left by the previous...

    Go to contribution page
  10. Evangelos Petrongonas (Amazon Web Services)
    13/12/2025, 17:45

    At scale, virtualization uncovers hidden bottlenecks, including the cost of PCI configuration space accesses. In SR-IOV deployments with thousands of VFs, each configuration read triggers a hardware transaction. As VFs increase, these accesses scale linearly, leading to longer VM boot times, heavier bus contention, and noticeable startup delays.

    The PCI subsystem today treats every...

    Go to contribution page
  11. Brian Vazquez
    13/12/2025, 18:00

    This microconference proposal aims to facilitate a discussion on the challenges and solutions for adding live-update support to Linux networking drivers, using the IDPF driver as a primary case study. Live-update is a specialized reboot process that preserves selected devices and kernel state, minimizing disruption in cloud environments. A key use case is enabling hypervisor updates while...

    Go to contribution page
  12. Adithya Jayachandran (NVIDIA), Saeed Mahameed (Nvidia)
    13/12/2025, 18:15

    Restarting a node running a stateful workload for an infrastructure software upgrade can be an extremely costly operation. Modern infrastructure software upgrades must also account for applications which are using accelerators such as GPUs, RDMA NICs and NIC stateful flow accelerators. While these workloads may typically run in isolated VMs, a hypervisor reboot for a kernel update can lead to...

    Go to contribution page
Building timetable...
Diamond Sponsors
Platinum Sponsors
Gold Sponsors
Silver Sponsors
T-Shirt Sponsor
Conference Services Provided by