Description
Live Update is a specialized reboot process where selected devices are kept operational and kernel state is preserved and recreated across a kexec. For devices, DMA and interrupts may continue during the reboot.
The primary use-case of Live Update is to enable hypervisor updates in cloud environments with minimal disruption to running virtual machines. During a Live Update, a VM can pause and its state is stored to memory while the hypervisor reboots. PCIe devices attached to those VMs (such as GPUs, NICs, and SSDs), are kept running during the Live Update. After the reboot, VMs are recreated and restored from memory, reattached to devices, and resumed. The disruption is limited to the time it takes to complete this entire process.
With Live Update infrastructure in place, other use-cases may emerge, like for example preserving the state of GPU doing LLM, freezing running containers with CRIU, and preserving large in-memory databases.
The Live Update and state persistence functionality touch on different parts of the kernel and this microconference aims to bring together people from different subsystems. Upstream support for Live Updates is still in its infancy and there are a lot of unsolved aspects that will benefit from direct communication.
Key problems that will be discussed:
Support for memfd/guest_memfd/hugetlb/tmpfs
Preserving the state of VFIO, IOMMUFD, and IOMMU drivers.
Kernel <-> userspace interaction during Live Update
Integration of Live Update with PCI and Device Model
Persistence of movable memory
Leveraging suspend/resume functionality for device state preservation
Optimizing kernel shutdown and boot times
Automated Testing of Live Updates
Key attendees:
Pasha Tatashin
David Matlack
David Rientjes
Chris Li
Bjorn Helgaas
Samiullah Khawaja
Vipin Sharma
Josh Hilke
Changyuan Lyu
Alex Graf
David Woodhouse
James Gowans
Pratyush Yadav
Jason Gunthorpe
Mike Rapoport
Alex Williamson