Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.
Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.
We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.
Cgroup accounting has significant overhead due to the need to constantly loop over all cpus to update statistics of cpu usages and blocked averages. We have seen that on 4 socket Haswell, database benchmarks like TPCC have 8% performance regression at the time of Haswell and 4.4 kernel when it is run under cgroup. On recent Cannon Lake platform using latest PCIE SSDs and 4.18 kernel, the...
Discuss two possible approaches to live update Linux that runs as a hypervisor without a noticeable effect on running Virtual Machines (VM). One method is to use cooperative multi-OSing paradigm to share the same machine between two kernels while the new kernel is booting, and the old kernel is still serving the running VM instances. Allow the new kernel to live migrate the drivers from the...
In this talk I discuss scalability of load balancing algorithms in the task scheduler, and present my work on tracking overloaded CPUs with a bitmap, and using the bitmap to steal tasks when CPUs become idle.
The scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a...
1) Scalability of scheduler idle cpu and core search on systems with large number of cpus
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. These don't scale for large llc domains and will...
Huge pages are essential to addressing performance botttlenecks
since the base page sizes are not changing while the amount of memory is
ever increasing. Huge pages can address TLB misses but also memmory
overhead in the Linux kernel that arises through page faults and other
compute intensive processing of small pages. Huge pages are required
with contemporary high speed NVME ssds to reach...
Flexible workqueue: Currently we have two pool setting-up for workqueue: 1) per-cpu workqueue pool and 2) unbound workqueue pool, the former require the users of workqueues to have some knowledge of cpu online state, as shown in:
While the latter (unbound workqueue) only has one pool per-NUMA, and that may...
Certain CPU-intensive tasks in the kernel can benefit from multithreading, such as zeroing large ranges of memory, initializing massive state (struct page) at boot, VFIO page pinning, XFS quotacheck, and freeing memory on munmap/exit. There is currently no interface that provides this service. ktask is a framework built on workqueues that splits up the work, chooses the number of threads to...
The mmap_sem has long been a contention point in the memory management
subsystem. In this session some mmap_sem related topics will be
discussed. Some optimization has been merged by the upstream kernel to
solve holding mmap_sem for write for excessive period of time in
munmap path by downgrading write mmap_sem to read. And, some
optimization are under discussion on the mailing list, i.e....