Speaker
Description
Monitoring the kernel on millions of servers in production poses significant problems in terms of scale and diversity of the environment, both in terms of software and hardware. An observability system should allow detecting, debugging and fixing a large number of issues, as well as allowing engineers to focus on the most important ones in terms of spread and severity. This is made challenging by trying to run the newest kernel out of all the hyperscalers, which usually means we find problems before many others. Meta uses commonly available tools (netcons, kdump, drgn, etc) as well as other, internally developed tools that are able to fullfill these requirements. The session will cover: how kernel events and data are collected and aggregated, the tools and data sources being using and finally how and where is data used in the release process. We'll also discuss the challenges of maintaining observability at Meta's scale, including performance overhead considerations, data volume management, and the balance between granularity and practicality. This talk is relevant for anyone interested in production kernel observability, or operating Linux at scale.