5–7 Oct 2026
Europe/Prague timezone

Linux System Monitoring and Observability MC

Not scheduled
20m

Speakers

Breno Leitao (Meta) Jason Xing (Tencent) Usama Arif

Description

The Linux System Monitoring and Observability MC brings together developers, maintainers, system engineers, and researchers to tackle unsolved problems in understanding, monitoring, and maintaining the health of Linux systems at scale.

Engineers managing millions of Linux servers face monitoring and observability challenges that no single team can solve alone. This track provides a forum to surface those challenges, share partial approaches, and leave with concrete next steps.

The goal is to have these engineers together to discuss the direction and strategy for better monitoring of Linux systems.

Track Objectives

  • Surface the most pressing unsolved problems in Linux monitoring and observability
  • Identify gaps in existing kernel interfaces, tooling, and infrastructure
  • Build consensus on priorities and approaches for the upstream community
  • This track invites participants to bring their hardest open questions, pain points, and gaps in current tooling, so the community can collaboratively work toward solutions.

Target Audience

  • Hyperscaler Engineers: System reliability engineers who encounter monitoring gaps at scale that others may share
  • Kernel Developers: Contributors working on tracing, performance counters, and diagnostic interfaces who want to understand real-world pain points
  • Monitoring Tool Developers: Creators of observability platforms who have hit kernel or infrastructure limitations
  • System Administrators: Operations teams who can articulate what breaks, what's missing, and what's too hard
  • Performance Engineers: Specialists who can identify where current observability falls short for optimization work

Problems involving any of the following (but not limited to) are in scope:

  • eBPF/BPF: Tracing and monitoring programs: limitations, missing features, safety constraints
  • ftrace/perf: Kernel tracing infrastructure gaps
  • Runtime Sanitizers: KFENCE, KASAN: coverage gaps, performance trade-offs, production usability
  • Hardware Interfaces: EDAC, MCE, ACPI error reporting, missing integrations, inadequate interfaces
  • bpftrace, systemd, netconsole: usability and scalability issues
  • kdump/crash/drgn: crash analysis workflow pain points
  • perf, memory profilers, below, strobelight, OpenTelemetry: analysis gaps and scaling challenges

Things that made progress given and were discussed in the Micro conference:

  • Lack of NMI on some architecture and how to collect information about it. (Breno)
    https://lore.kernel.org/all/rs4igmsjrm6r2aio4nbe5jos3vcqk2u4bjhltjwtj2pn3cquip@kv3grgec7qrb/
    https://lkml.org/lkml/2026/3/30/1280

  • Improvements in the LAVD monitoring system (Gavin Guo)

  • Page owner tracking (Mauricio)
    https://lore.kernel.org/all/20251205231721.104505-1-mfo@igalia.com/

  • Kmemleak detection in the fleet (Breno)
    https://lore.kernel.org/all/20260323-kmemleak_report-v1-1-ba2cdd9c11b9@debian.org/

  • Memory failure and clean crashes
    https://lore.kernel.org/all/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org/

  • Track kernels doing kexec
    https://lore.kernel.org/all/20260309-kho-v8-0-c3abcf4ac750@debian.org/

  • TCP Reset Observibility (Jason)
    https://mailarchive.ietf.org/arch/msg/tcpm/d27ntz9UM4tb4-cxJNxfYw8zCSE/

  • Always-on 7x24 network latency monitor (Jason)

  • Who is planning to submit topics being discussed in the MC to SOSP 2026

  • Relay monitoring (Jason)

  • Diagnostic check and api in relay and future upstreaming discussion

  • Improving memcg statistics collection (JP)
    https://lore.kernel.org/all/20260401203752.643259-1-jp.kobryn@linux.dev/

  • General improvements for DAMON (SJ)

Authors

Presentation materials

There are no materials yet.