11โ€“13 Dec 2025
Asia/Tokyo timezone

Session

Linux System Monitoring and Observability MC

11 Dec 2025, 10:00

Conveners

Linux System Monitoring and Observability MC

  • Usama Arif
  • Breno Leitao (Meta)

Description

The Linux System Monitoring and Observability Track brings together developers, maintainers, system engineers, and researchers focused on understanding, monitoring, and maintaining the health of Linux systems at scale. This track addresses the needs of engineers managing millions of Linux servers, where proactive monitoring, rapid problem detection, and automated remediation are essential for operational success.

Track Objectives

This track aims to foster collaboration between kernel developers, system administrators, and monitoring tool creators to advance the state of Linux system monitoring and observability.

Target Audience

  • Hyperscaler Engineers: System reliability engineers managing large-scale deployments
  • Kernel Developers: Contributors working on tracing, performance counters, and diagnostic interfaces
  • Monitoring Tool Developers: Creators of observability platforms and diagnostic utilities
  • System Administrators: Operations teams responsible for fleet health and incident response
  • Performance Engineers: Specialists focused on system optimization and bottleneck identification

Key attendees:

  • Usama Arif
  • Guilherme Piccolli
  • Jason Xing (Tencent)
  • Ricardo Canuelo
  • Gavin Guo
  • Systemd folks (Daan)
  • Other cloud sysadmins

Key Focus Areas

Kernel Health and Runtime Monitoring
- Real-time kernel health assessment techniques
- Early detection of system degradation

Hardware Integration and Error Detection

 - Hardware error correlation and root cause analysis
 - Integration between kernel monitoring and hardware telemetry

Problem Correlation

- Virtualization stack monitoring (hypervisor โ†” guest relationships)
- Container runtime observability
- Network stack performance and reliability monitoring
- Storage I/O path analysis and optimization
- BMC information
- Scheduler (sched_ext changes)

Memory Management and Analysis

-  Runtime memory profiling techniques
-  Out-of-memory (OOM) prediction and prevention
-  Memory leak detection and mitigation

Automated Analysis and Remediation

-   Automated problem categorization and triage
-   Anomaly detection algorithms for system behavior

Visualization and Alerting Infrastructure

-  Real-time dashboarding for large-scale deployments
-  Historical trend analysis and capacity planning

Tools and frameworks that would fit here

- eBPF/BPF: Advanced tracing and monitoring programs
- ftrace/perf: Low-level kernel tracing infrastructure
- Runtime Sanitizers: KFENCE, KASAN, and similar detection tools
- Hardware Interfaces: EDAC, MCE, ACPI error reporting

System-Level Tools

* bpftrace: Dynamic tracing language and runtime
* systemd: Service monitoring and system state management
* netconsole: Network-based kernel logging

Crash Analysis and Post-Mortem

* kdump/crash/drgn: Kernel crash dump analysis
* Core dump analysis: Userspace failure investigation
* Live debugging: GDB integration and kernel debugging techniques

Performance Analysis

* perf: Hardware performance counter analysis
* Memory profilers: Heap analysis and memory usage optimization
* Runtime memory profilers, such as `below`, strobelight, Open Telemetry, etc

Example Topics and Presentations. Previous LPC Presentations That Align With This Track

  • Livepatch Visibility at Scale
  • IOMMU overhead optimizations and observability
  • PWRu: BPF-based Datacenter Telemetry with Kernel Bypass
  • Performance Analysis Superpowers with Enhanced BPF
  • Runtime Verification: Where We've Been, Where We're Going
  • Linux Performance Analysis and Observability

Presentation materials

  1. Mr Gavin Guo
    11/12/2025, 10:00

    Over the past decade, Brendan Greggโ€™s Flamegraph has become an indispensable tool for pinpointing performance bottlenecks. Based on the canonical Flamegraph, it has been evolving into various flavors tailored to address specific performance issues in the production systems. We'll share how the novel approach generates Flamegraphs from latency, memory usage, crashdump, and kern.log traces....

    Go to contribution page
  2. Jason Xing (Tencent)
    11/12/2025, 10:20

    Title

    Methodology and Practice in Observing Kernel Networking

    Abstract

    Blindly enumerating all counters extracted from the kernel and haphazardly monitoring every function in the hot path is hardly practical in production. Three key issues deserve greater attention: 1) performance degradation, 2) ineffective metrics, and 3) the prohibitive cost of massive data storage. In most time,...

    Go to contribution page
  3. Guilherme G. Piccoli (Igalia)
    11/12/2025, 10:40

    Steam Deck is a successful console from Valve that runs on top of FOSS, having Linux as its operating system.

    For the regular gamers, user experience is smooth and they donโ€™t even need to think about whatโ€™s going under the hood to ensure such good experience is possible. Specially, there are interesting bits from the tracing system and in-kernel debug features leveraged in order to achieve...

    Go to contribution page
  4. Adin Scannell, Jordan Rome
    11/12/2025, 11:00

    This talk will cover the on-going effort to evolve [bpftrace][1] from an observability tool into a flexible, composable framework that can make many observability tools and drive the larger BPF observability ecosystem - instead of trailing behind it.

    Over the past year, the bpftrace development team has focused on removing obstacles that hinder users from efficiently observing and debugging...

    Go to contribution page
  5. JP Kobryn (Meta)
    11/12/2025, 11:20

    Periodically reading cgroup stat data can be expensive across a large enough fleet. I will discuss work done this year that focused on optimizations in this area and provide some background on the data/rationale that led us there. The presentation will include one technique for avoiding the expensive conversion/formatting involved with reading memory cgroups.

    Go to contribution page
  6. SeongJae Park
    11/12/2025, 11:50

    DAMON simplifies the collection of system and workload data access patterns. However, interpreting this data and transforming it into actionable insights for humans remains a challenge. Representing the data in an actionable format is difficult. While efforts have been made to visualize this data, opinions vary on its accessibility. This session will review past attempts to make the data...

    Go to contribution page
  7. Daniel Gomez
    11/12/2025, 12:10

    We build robust kernel code by properly handling errors and recovering
    gracefully. But many critical error conditions are hard to replicate
    in testing, so error injection becomes essential for validation. Past
    error injection approaches were often considered too intrusive and got
    rejected [1].

    This talk presents moderr, an eBPF tool using libbpf for error injection
    in the kernel module...

    Go to contribution page
  8. Vlad Poenaru (Meta)
    11/12/2025, 12:30

    Monitoring the kernel on millions of servers in production poses significant problems in terms of scale and diversity of the environment, both in terms of software and hardware. An observability system should allow detecting, debugging and fixing a large number of issues, as well as allowing engineers to focus on the most important ones in terms of spread and severity. This is made challenging...

    Go to contribution page
  9. Mauricio Faria de Oliveira (Igalia)
    11/12/2025, 12:50

    The existing page_owner debug feature tracks the stack trace of memory allocations in the system at the page level. It can answer questions like: 'What allocated this page?' and 'How many pages are allocated by what?' -- pointing right at the source code.

    That allows for profiling and monitoring all of the system memory per allocation stack trace to identify trends, leaks, spikes,...

    Go to contribution page
  10. Mr Peace Lee
    11/12/2025, 13:10

    Modern embedded systems such as automotive IVI and custom Linux distributions are becoming increasingly complex, making real-time performance diagnosis difficult using traditional tools like ftrace or perf alone.
    Developers often face fragmented data sources, high analysis overhead, and the need for manual correlation across logs and traces.

    Guider is an open-source, self-contained...

    Go to contribution page
Building timetable...
Diamond Sponsors
Platinum Sponsors
Gold Sponsors
Silver Sponsors
T-Shirt Sponsor
Conference Services Provided by