18–20 Sept 2024
Europe/Vienna timezone

Session

Kernel Memory Management MC

20 Sept 2024, 15:00

Description

Memory management has become exciting again. Some controversial subjects which might merit discussion:

  • Should we add memory policy zones?
  • How far should we go to support CXL?
  • How do we handle page allocation in a memdesc world?
  • Should we switch the slab allocator from partial slabs to sheaves?
  • Can we get rid of non-compound multi-page allocations?
  • What other improvements might we see from mTHP?
  • How might we make allocations guaranteed to not fail?
  • Can we share the pagecache between reflinked files?
  • Is there a better way to share page tables between processes than hugetlb?
    -

Presentation materials

  1. Suren Baghdasaryan, Pasha Tatashin, Sourav Panda (Google)
    20/09/2024, 15:00

    Memory allocation profiling infrastructure provides a low-overhead mechanism to make all kernel allocations in the system visible. This allows for monitoring memory usage, tracking hotspots, detecting leaks, and identifying regressions.
    Unlike previous discussions on the design of this technique, we will now focus on the changes since it was incorporated into the upstream kernel, planned...

    Go to contribution page
  2. Petr Tesařík
    20/09/2024, 15:15

    For decades, Linux memory management has been mostly focused on the needs of
    user space and generic kernel-space users (memory control groups, transparent
    huge pages, compression). Other big changes are good for maintenance and/or
    debugging (removal of DISCONTIGMEM, compaction, kmemleak, folios, removal of
    redundant slab-style allocators and many other). Little has been done for...

    Go to contribution page
  3. SeongJae Park
    20/09/2024, 15:30

    There are two hopes for Linux kernel. Some people hope the kernel to just works without users' intervention. Meanwhile, some people hope the kernel be extensible so that the users can flexibly control the kernel with their proprietary information.

    DAMON is designed and planned to convince the two parties. Also, because DAMON is a part of memory management subsystem, it should also...

    Go to contribution page
  4. Liam Howlett (Oracle), Lorenzo Stoakes (Oracle)
    20/09/2024, 15:45

    vma guards are inserted at the start and/or end of vmas to detect out-of-bound reads or writes. Currently these guards are represented by an allocated vma even though almost all the information in the vma is not used. Sometimes these guards are so numerous that they represent close to half of the vmas used in a system. Such a large number of underutilized objects represents a potential for...

    Go to contribution page
  5. Rik van Riel (Facebook)
    20/09/2024, 16:00

    Conventional wisdom has held that madvise overhead has been mostly the syscall overhead. However, profiling shows this not to be the case.

    Even on a medium sized 1 socket system, about half the CPU time spent in MADV_DONTNEED is spent flushing the TLB, and that is just in the calling CPU. Add in handling of the TLB flush IPIs on the other CPUs, and 90-95% of the MADV_DONTNEED overhead is...

    Go to contribution page
  6. Kundan Kumar (Samsung Semiconductor India Research)
    20/09/2024, 16:15

    Direct and passthrough IO involves mapping user space memory into the kernel. At present, this memory is mapped as an array of pages. Using 4K pages for mapping results in additional overhead due to per-page memory pinning, unpinning, and calculations. Switching to a large folio-based mapping will reduce this overhead.

    As part of this proposal, the current GUP implementation needs to be...

    Go to contribution page
  7. Juan Yescas, Kalesh Singh (Google)
    20/09/2024, 17:00

    During the transition to a 16kb page size system, numerous instances were found where the kernel or userspace relied on the assumption of PAGE_SIZE == 4096. While many functional issues have been resolved, some inherent challenges persist, along with opportunities for optimization in systems with larger page sizes.

    This work investigates the following key challenges and potential areas of...

    Go to contribution page
  8. Chris Li (Google), Kairui Song (Tencent)
    20/09/2024, 17:15

    The swap system original only need to handle 4K and THP size swap. When mTHP introduce more size option for swap, it also bring new challenge of the swap fragmentation. The swap sub system will need some change for the new allocation requirement.

    The presentation will propose some swap allocator approaches to address the mthp swap fragmentation. Some of the patch series already send to the...

    Go to contribution page
  9. Barry Song, Mr Chuanhua Han, Mr Tangquan Zheng
    20/09/2024, 17:30

    In addition to the work by Chris Li and Ryan Roberts on optimizing mTHP swap-out slot allocation [1][2], we at OPPO have several patchsets focused on mTHP swap-in [3][4] and enhancing zsmalloc/zRAM [5] to save and restore compressed mTHP.

    Without mTHP swap-in, mTHP is a one-way ticket: once swapped out, they cannot revert to mTHP. With mTHP swap-in, we make mTHP bidirectional and gain the...

    Go to contribution page
  10. Axel Rasmussen (Google), Guru Anbalagane (Google), Wei Xu (Google), Yuanchu Xie (Google)
    20/09/2024, 17:45
    • Adopting MGLRU in Google's production kernel
    • Predicable DRAM scheduling based on working set
    • Leveraging page table scanning for NUMA and CXL
    • Path for MGLRU to become the default
    Go to contribution page
  11. Yu Zhao (Google)
    20/09/2024, 18:00

    TAO is an umbrella project aiming at a better economy of physical contiguity viewed as a valuable resource. A few examples are:
    1. A multi-tenant system can have guaranteed THP coverage while hosting abusers/misusers of the resource.
    2. Abusers/misusers, e.g., workloads excessively requesting and then splitting THPs, should be punished if necessary.
    3. Good citizens should be awarded with,...

    Go to contribution page
  12. Yu Zhao (Google)
    20/09/2024, 18:15

    There are three types of zones:
    1. The first four zones partition the physical address space of CPU memory.
    2. The device zone provides interoperability between CPU and device memory.
    3. The movable zone commonly represents a memory allocation policy.

    Though originally designed for memory hot removal, the movable zone is instead widely used for other purposes, e.g., CMA and kdump kernel,...

    Go to contribution page
Building timetable...
Diamond Sponsor
Platinum Sponsors
Gold Sponsors
Silver Sponsors
Conference Services Provided by