Kernel live patching (KLP) makes it possible to apply quick fixes to a live Linux kernel, without having to shut down the workload to reboot a server. The kpatch tool chain and the livepatch infrastructure generally work well. However, using them on a closely monitored fleet with several million servers uncovers many corner cases. During the deployment of KLP at Meta, we ran into issues, including performance regressions, conflicts with tracing & monitoring tools, and KLP transitions sporadically failing depending on the behavior of the kernel at the time the patch is applied. In this presentation, we will share our experiences working with KLP at scale, describe the top issues we are facing, and discuss some ideas for future improvements.
First, we would like to briefly introduce how we build, deploy, and monitor KLPs at scale. We will then present some recent work to improve KLP infrastructure, including: eliminating performance hit when applying KLPs; making sure KLP works well with various tracing mechanisms; and fixing various corner cases with kpatch-build tool chain and livepatch infrastructure. Finally, we would like to discuss the remaining issues with KLP at scale, and how to address them. Specifically, we will present different reasons for KLP transition errors, and a few ideas/WIPs to address these errors.
|I agree to abide by the anti-harassment policy||Yes|