# Strange Kernel Performance Changes? Cache Alignment Matters Feng Tang Intel Linux Kernel Team LINUX PLUMBERS CONFERENCE > September 20-24, 2021 ### Background - ODay (kernel test robot) keeps testing kernel performance and reporting regressions and improvements - Recently, there are many Strange cases, which are hard to explain as bisected culprit commits seem to have nothing to do with the benchmark - Kernel developers including Linus suspected and even challenged the reports: "What?" "Why it matters?" - Goal: understand and explain them, try to mitigate (make everyone's life easier) - Hints and ideas are welcome and appreciated! #### Kernel Sections Layout ``` bss initcall setup inittext per cpu load data ro data text ``` ``` 0000000000000000 D __per_cpu_start 000000000002e000 D __per_cpu_end ffffffff81000000 T stext ffffffff81000000 T text fffffffff81e011b7 T etext ffffffff82000000 R start rodata ffffffff8249c000 D end rodata ffffffff82600000 D sdata ffffffff82876840 D edata ffffffff82cc1000 D init begin ffffffff82cc1000 D per_cpu_load ffffffff82cef000 T sinittext ffffffff82d5663f T einittext ffffffff82f03390 T initcall start ffffffff82f03e88 T __initcall_end ffffffff82f1e000 R init end ffffffff82f2a000 B bss start ffffffff83400000 B bss stop ffffffff8342c000 B end ``` System.map #### Kernel Section Layout -II Text/Data sections layout D.text E.text Text A.text **B.text** C.text F.text G.text G.data C.data E.data A.data B.data D.data F.data Data Link order matters (from Makefile) ## Cache Alignment Matters Most of them are caused unnoticeably by underlying cache alignment changes: - Text (function) alignment - Data alignment (false sharing) - HW cache prefetchers - Adjacent cache lines prefetch (2N, 2N+1) - L2 cache prefetcher # Text Alignment - Kernel functions are all linked together compactly - One line of code change may cause changes to the whole kernel text/function's alignment - The earlier a .o get linked, the more parts it can affect - Can be explained, but hard to be solved - Kconfig or compiler change can greatly affect the result - Examples - [LKP] Re: [mm] fd4d9c7d0c: stress-ng.switch.ops\_per\_sec -30.5% regression - [mm/hugetlb] c77c0a8ac4: will-it-scale.per\_process\_ops 15.9% improvement D.func3 D.func2 D.func1 C.func1 B.func4 B.func3 B.func2 B.func1 A.func3 A.func2 A.func1 # Case Study – Text Alignment - A one-line mm fix patch cause 30.6% regression for stress-ng.switch case - change in kmem\_cache\_alloc\_bulk() c->tid = next\_tid(c->tid); - 16 more bytes in binary for the function "49 83 40 08 01 addq \$0x1,0x8(%r8)" - The change is gone with forced function alignment ``` old map: ffffffff812a1880 T kmem cache alloc bulk fffffffff812a1a80 t kmalloc large node fffffffff812a1b10 t calculate sizes fffffffff812aleb0 t store user store fffffffff812a1f20 t poison store ffffffff812a1f90 t red zone store fffffffff812a2000 t order store new map: ffffffff812a1880 T kmem cache alloc bulk ffffffff812a1a90 t kmalloc large node ffffffff812a1b20 T kmalloc node ---> relocated fffffffff812a1e40 t calculate sizes fffffffff812a21e0 t store user store fffffffff812a2250 t poison store ffffffff812a22c0 t red zone store fffffffff812a2330 t order store ``` # Mitigation (Debug) – Text Alignment Force all function start address aligned on 64 bytes (merged) - A black box check which we are not 100% sure - Kconfig option CONFIG\_DEBUG\_FORCE\_FUNCTION\_ALIGN\_64B - Much less report after 0Day enabled it - Why not default on? 10% more kernel size, more TLB usage ## Data Alignment - Key is the Cache False Sharing - Data is more complex than text - Static Layout - .data section - specific sections like (percpu) - Dynamic Allocation: kmalloc/slab/vmalloc - Debug Methods - perf-c2c - pahole - add padding ## Cache False Sharing - Data loaded from memory to cache on cacheline granularity - Multiple CPUs access data in one cache line - all read → Fine - one write → Bad - Try to separate them in hot data structure #### Mitigation (Debug) - Data Alignment - Force data sections of every .o file aligned (patch posted) - Change in linker script vmlinux.lds.S - Debug only due to huge size increase - Per-CPU data Add debug allocation macros to force all percpu-data address aligned - Kmalloc/slab Force alignment (slab has parameter) #### **HW Cache Prefetcher** - Most platforms have them ON by default as being helpful generally - Transparent to SW programmer - Accuracy affects bus BW hugely - May vary on different generations as the algorithm evolves - Consider them if SW debugging can't help - Real cases related to the first 2 types | Prefetcher | Bit# in<br>MSR<br>0x1A4 | Description | |-----------------------------------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------| | L2 hardware prefetcher | 0 | Fetches additional lines of code or data into the L2 cache | | L2 adjacent cache line prefetcher | 1 | Fetches the cache line that comprises a cache line pair (128 bytes) | | DCU prefetcher | 2 | Fetches the next cache line into L1-D cache | | DCU IP prefetcher | 3 | Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines | https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html ## Adjacent Cache Lines Prefetch - When one cache line is accessed and fetched, its adjacent cache line will be fetched too - 64B cache line extended into 128B 'fat cacheline' - Can not be detected by tools like perf-c2c | 128 * N | 2N | 2N+1 | |-------------|------|------| | 128 * (N+1) | 2N+2 | 2N+3 | | 128 * (N+2) | 2N+4 | 2N+5 | | 128 * (N+3) | 2N+6 | 2N+7 | | 128 * (N+4) | 2N+8 | 2N+9 | ## Case Study – HW Prefetcher - Patch removing a 'struct page\_counter' from 'struct mem\_cgroup', causes -22.7% regression for will-itscale/page\_fault2 - Commit does have relation with the test case, looks to be alignment related - 3 hot members(A, B, C) sit in 2 adjacent cache lines which were not in one 128B trunk, but were pulled into one by the commit. - "False sharing" of 2 cachelines - Solution sperate them into different 128B trunks ## Mitigation – Selective Isolation - Goal: Make kernel performance more stable (Less surprise) - Chose N(10~20) .o files, add 64/128B alignment to one function and one data of them (modules A/D/I below) - Divide kernel into N independent capsules like capsules in a big ship - one capsule changed/broken won't affect others - Rule: select more in critical and early modules - It won't hurt, with minimal increase of kernel size #### Todos - Upstream the mitigation and debug patches - Extend perf-c2c tool to cover adjacent cache line prefetch - Explore more about HW prefetcher - Check cases which are still not explained