Linux Plumbers Conference 2025

Name: Linux Plumbers Conference 2025
Start: 2025-12-11T09:00:00+09:00
End: 2025-12-13T23:30:00+09:00
Location: No location set

11–13 Dec 2025

Asia/Tokyo timezone

2025

contact@linuxplumbersconf.org

Session

Linux System Monitoring and Observability MC

11 Dec 2025, 10:00

Matrix chat

Linux System Monitoring and Observability MC

Usama Arif
Breno Leitao (Meta)

Description

The Linux System Monitoring and Observability Track brings together developers, maintainers, system engineers, and researchers focused on understanding, monitoring, and maintaining the health of Linux systems at scale. This track addresses the needs of engineers managing millions of Linux servers, where proactive monitoring, rapid problem detection, and automated remediation are essential for operational success.

Track Objectives

This track aims to foster collaboration between kernel developers, system administrators, and monitoring tool creators to advance the state of Linux system monitoring and observability.

Target Audience

Hyperscaler Engineers: System reliability engineers managing large-scale deployments
Kernel Developers: Contributors working on tracing, performance counters, and diagnostic interfaces
Monitoring Tool Developers: Creators of observability platforms and diagnostic utilities
System Administrators: Operations teams responsible for fleet health and incident response
Performance Engineers: Specialists focused on system optimization and bottleneck identification

Key attendees:

Usama Arif
Guilherme Piccolli
Jason Xing (Tencent)
Ricardo Canuelo
Gavin Guo
Systemd folks (Daan)
Other cloud sysadmins

Key Focus Areas

Kernel Health and Runtime Monitoring
- Real-time kernel health assessment techniques
- Early detection of system degradation

Hardware Integration and Error Detection

 - Hardware error correlation and root cause analysis
 - Integration between kernel monitoring and hardware telemetry

Problem Correlation

- Virtualization stack monitoring (hypervisor ↔ guest relationships)
- Container runtime observability
- Network stack performance and reliability monitoring
- Storage I/O path analysis and optimization
- BMC information
- Scheduler (sched_ext changes)

Memory Management and Analysis

-  Runtime memory profiling techniques
-  Out-of-memory (OOM) prediction and prevention
-  Memory leak detection and mitigation

Automated Analysis and Remediation

-   Automated problem categorization and triage
-   Anomaly detection algorithms for system behavior

Visualization and Alerting Infrastructure

-  Real-time dashboarding for large-scale deployments
-  Historical trend analysis and capacity planning

Tools and frameworks that would fit here

- eBPF/BPF: Advanced tracing and monitoring programs
- ftrace/perf: Low-level kernel tracing infrastructure
- Runtime Sanitizers: KFENCE, KASAN, and similar detection tools
- Hardware Interfaces: EDAC, MCE, ACPI error reporting

System-Level Tools

* bpftrace: Dynamic tracing language and runtime
* systemd: Service monitoring and system state management
* netconsole: Network-based kernel logging

Crash Analysis and Post-Mortem

* kdump/crash/drgn: Kernel crash dump analysis
* Core dump analysis: Userspace failure investigation
* Live debugging: GDB integration and kernel debugging techniques

Performance Analysis

* perf: Hardware performance counter analysis
* Memory profilers: Heap analysis and memory usage optimization
* Runtime memory profilers, such as `below`, strobelight, Open Telemetry, etc

Example Topics and Presentations. Previous LPC Presentations That Align With This Track

Livepatch Visibility at Scale
IOMMU overhead optimizations and observability
PWRu: BPF-based Datacenter Telemetry with Kernel Bypass
Performance Analysis Superpowers with Enhanced BPF
Runtime Verification: Where We've Been, Where We're Going
Linux Performance Analysis and Observability

Live Stream

103. Extending Flamegraphs for Multi-Dimensional Performance Analysis

Mr Gavin Guo

11/12/2025, 10:00

Over the past decade, Brendan Gregg’s Flamegraph has become an indispensable tool for pinpointing performance bottlenecks. Based on the canonical Flamegraph, it has been evolving into various flavors tailored to address specific performance issues in the production systems. We'll share how the novel approach generates Flamegraphs from latency, memory usage, crashdump, and kern.log traces....
Go to contribution page
76. Methodology and Practice in Observing Kernel Networking

Jason Xing (Tencent)

11/12/2025, 10:20

Title

Methodology and Practice in Observing Kernel Networking

Abstract

Blindly enumerating all counters extracted from the kernel and haphazardly monitoring every function in the hot path is hardly practical in production. Three key issues deserve greater attention: 1) performance degradation, 2) ineffective metrics, and 3) the prohibitive cost of massive data storage. In most time,...
Go to contribution page
293. Proactive and crash-time data collection on Steam Deck

Guilherme G. Piccoli (Igalia)

11/12/2025, 10:40

Steam Deck is a successful console from Valve that runs on top of FOSS, having Linux as its operating system.

For the regular gamers, user experience is smooth and they don’t even need to think about what’s going under the hood to ensure such good experience is possible. Specially, there are interesting bits from the tracing system and in-kernel debug features leveraged in order to achieve...
Go to contribution page
265. From Tool To ToolBox: How bpftrace is evolving to become more composable and expressive

Adin Scannell, Jordan Rome

11/12/2025, 11:00

This talk will cover the on-going effort to evolve [bpftrace][1] from an observability tool into a flexible, composable framework that can make many observability tools and drive the larger BPF observability ecosystem - instead of trailing behind it.

Over the past year, the bpftrace development team has focused on removing obstacles that hinder users from efficiently observing and debugging...
Go to contribution page
450. Reading memcg stats more efficiently

JP Kobryn (Meta)

11/12/2025, 11:20

Periodically reading cgroup stat data can be expensive across a large enough fleet. I will discuss work done this year that focused on optimizations in this area and provide some background on the data/rationale that led us there. The presentation will include one technique for avoiding the expensive conversion/formatting involved with reading memory cgroups.
Go to contribution page
274. Actionable Data Access Monitoring Output Data and Format

SeongJae Park

11/12/2025, 11:50

DAMON simplifies the collection of system and workload data access patterns. However, interpreting this data and transforming it into actionable insights for humans remains a challenge. Representing the data in an actionable format is difficult. While efforts have been made to visualize this data, opinions vary on its accessibility. This session will review past attempts to make the data...
Go to contribution page
295. Module Error Injection with eBPF

Daniel Gomez

11/12/2025, 12:10

We build robust kernel code by properly handling errors and recovering
gracefully. But many critical error conditions are hard to replicate
in testing, so error injection becomes essential for validation. Past
error injection approaches were often considered too intrusive and got
rejected [1].

This talk presents moderr, an eBPF tool using libbpf for error injection
in the kernel module...
Go to contribution page
440. Scaling Kernel Production Monitoring @ Meta

Vlad Poenaru (Meta)

11/12/2025, 12:30

Monitoring the kernel on millions of servers in production poses significant problems in terms of scale and diversity of the environment, both in terms of software and hardware. An observability system should allow detecting, debugging and fixing a large number of issues, as well as allowing engineers to focus on the most important ones in terms of spread and severity. This is made challenging...
Go to contribution page
395. Improving page_owner for profiling and monitoring memory usage per allocation stack trace

Mauricio Faria de Oliveira (Igalia)

11/12/2025, 12:50

The existing page_owner debug feature tracks the stack trace of memory allocations in the system at the page level. It can answer questions like: 'What allocated this page?' and 'How many pages are allocated by what?' -- pointing right at the source code.

That allows for profiling and monitoring all of the system memory per allocation stack trace to identify trends, leaks, spikes,...
Go to contribution page
78. Guider: Lightweight Real-Time Performance & Fault Monitoring Framework for Embedded Linux Platforms

Mr Peace Lee

11/12/2025, 13:10

Modern embedded systems such as automotive IVI and custom Linux distributions are becoming increasingly complex, making real-time performance diagnosis difficult using traditional tools like ftrace or perf alone.
Developers often face fragmented data sources, high analysis overhead, and the need for manual correlation across logs and traces.

Guider is an open-source, self-contained...
Go to contribution page

Building timetable...

Linux Plumbers Conference 2025

2025

Session

Linux System Monitoring and Observability MC

Matrix chat

Conveners

Linux System Monitoring and Observability MC

Description

Presentation materials

Title

Abstract

Diamond Sponsors

Platinum Sponsors

Gold Sponsors

Silver Sponsors

T-Shirt Sponsor

Conference Services Provided by

Choose timezone

Linux Plumbers Conference 2025

2025

Conveners

Linux System Monitoring and Observability MC

Description

Presentation materials

Title

Abstract

Diamond Sponsors

Platinum Sponsors

Gold Sponsors

Silver Sponsors

T-Shirt Sponsor

Conference Services Provided by