Paul E. McKenney, Meta Platforms Kernel Team Puranjay Mohan, Kernel Developer at Amazon Web Services Linux Plumbers Conference, eBPF Track, September 20, 2024



# Instruction-level BPF memory model

#### History

- "Towards a BPF Memory Model", LPC 2021
  - https://lpc.events/event/11/contributions/941/
- Kangrejos 2023 Hallway Track (with Jose Marchesi)
- "Instruction-Level BPF Memory Model", IETF 118
  - https://datatracker.ietf.org/doc/agenda-118-bpf/
  - https://datatracker.ietf.org/meeting/118/materials/slides-118-bpf-bpf-memory-model-00
- "BPF Memory Model, Two Years On", LPC 2023
  - https://lpc.events/event/17/contributions/1580/
- "Instruction-Level BPF Memory Model", living Google Document
  - https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
- "Instruction-level BPF memory model", LSF/MM/BPF 2024
  - https://lwn.net/Articles/976071/ (video: https://www.youtube.com/watch?v=QG-cLG9PekI)

#### **History**

- "Towards a BPF Memory Model", LPC 2021
  - https://lpc.events/event/11/contributions/941/
- Kangrejos 2023 Hallway Track (with Jose Marchesi)
- "Instruction-Level BPF Memory Model", IETF 118
  - https://datatracker.ietf.org/doc/agenda-118-bpf/
  - https://datatracker.ietf.org/meeting/118/materials/slide
- "BPF Memory Model, Two Years On", LPC
  - https://lpc.events/event/17/contributions
- "Instruction-Level BPF Memory
  - https://docs.google.com/de
- "Instruction-level BPF
  - https://lwn.net/Art

memory-model-00

ocument

SKqkavUQyL2tThJXYWHS15qcbxlsFb0/edit?usp=sharing

MM/BPF 2024

3/1/ (video: https://www.youtube.com/watch?v=QG-cLG9Pekl)

- This model is upstream in the herdtools7 project
- To install and use:

```
git clone https://github.com/herd/herdtools7
cd herdtools7
# Follow instructions in INSTALL.md.
herd7 path/to/BPF/litmus/test
# Sample tests in catalogue/bpf/tests.
```

#### Litmus tests in catalogue/bpf/tests:

CoRR+poonceonce+Once.litmus CoRW+poonceonce+Once.litmus CoWR+poonceonce+Once.litmus CoWW+poonceonce.litmus depencency\_ordered\_before.litmus IRIW+fencembonceonces+OnceOnce.litmus IRIW+poonceonces+OnceOnce.litmus ISA2+poonceonces.litmus LB+fcas-addr-once+once-scas.litmus LB+fcas-ctrlcvg-once+once-scas.litmus LB+fcas-ctrl-once+once-scas.litmus LB+fcas-data-once+once-scas.litmus LB+poonceonces.litmus LockTwice.litmus MP+fcas-addr-fcas+scas-scas.litmus MP+fcas-ctrl-fcas+scas-scas.litmus MP+fcas-data-fcas+scas-scas.litmus MP+fcas-data-fcas+scas-scas-LKMM.litmus MP+poonceonces.litmus MP+pooncerelease+poacquireonce.litmus

R+fencembonceonces litmus R+poonceonces.litmus S+atomiconce+data.litmus SB+fence+fail cmpxchq.litmus SB+fencembonceonces.litmus SB+fence+success cmpxchq.litmus SB+poonceonces.litmus SB+rfionceonce-poonceonces.litmus S+fence+addr.litmus S+fence+ctrl-read.litmus S+fence+ctrl-write.litmus S+fence+data.litmus S+onceatomic+data.litmus S+poonceonces.litmus WRC+poonceonces+Once.litmus WRC+pooncerelease+fencermbonceonce+Once.litmus W+RWC+poll+poaa+pola.litmus X+addr-reads+corr-writes+data-rw.litmus X-test-r2.litmus



Litmus tests in catalogue/bpf/tests:

CoRR+poonceonce+Once.litmus CoRW+poonceonce+Once.litmus CoWR+poonceonce+Once.litmus CoWW+poonceonce.litmus dependency\_ordered\_before.litmus IRIW+fencembonceonces+OnceOnce.litmus IRIW+poonceonces+OnceOnce.litmus ISA2+poonceonces.litmus LB+fcas-addr-once+once-scas.litmus LB+fcas-ctrlcvg-once+once-scas.litmus LB+fcas-ctrl-once+once-scas.litmus LB+fcas-data-once+once-scas.lit LB+poonceonces.litmus LockTwice.litmus MP+fcas-addr-fcas+sca MP+fcas-ctrl-fcas it mus as.litmus MP+fcas-data-fcas -scas-LKMM.litmus MP+fcas-data-fcas+ MP+poonceonces.litmus MP+pooncerelease+poacquireonce.litmus

R+fencembonceonces.litmus R+poonceonces.litmus S+atomiconce+data.li SB+fence+fail cmp SB+fencembonce a.litmus SB+fence+s SB+poop s.litmus SB+ ∠(†mus e.litmus amic+data.litmus ceonces litmus kc+poonceonces+Once.litmus WRC+pooncerelease+fencermbonceonce+Once.litmus W+RWC+poll+poaa+pola.litmus X+addr-reads+corr-writes+data-rw.litmus X-test-r2.litmus

### Example BPF Litmus Test

```
BPF S+fence+data
int x=0; int y=10;
0:r0=x; 0:r1=y;
0:r5=tmp; (* only used for the atomic op in P0 to enforce ordering *)
1:r0=x; 1:r1=y;
P0
                                       | r2 = *(u32 *)(r1 + 0)
*(u32 *)(r0 + 0) = 2
r6 = atomic_fetch_add((u64*)(r5 + 0), r6) | *(u32 *)(r0 + 0) = r2
*(u32 *)(r1 + 0) = 0
exists (1:r2=0 /\ x=2)
```

# And Corresponding herd7 Output

```
$ herd7 -model bpf_lkmm.cat S+fence+data.litmus
Test S+fence+data Allowed
States 3
1:r2=0; [x]=0;
1:r2=10; [x]=2;
1:r2=10; [x]=10;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:r2=0 / [x]=2)
Observation S+fence+data Never 0 3
Time S+fence+data 0.00
Hash=a35dc5b17cde70582ebd0ea218dd3ba5
```

# Load-Acquire and Store-Release

https://lore.kernel.org/bpf/20240729183246.4110549-1-yepeilin@google.com/T/

# Load-Acquire and Store-Release

Arbitrarily chose instruction formats

```
r0 = load_acquire((u32 *)(r2 + 0))
store_release((u32 *)(r2 + 0), r8)
```

- Chose RCpc vs. RCsc (ARM64 Idapr vs. Idar)
  - Idapr (but not Idar) can be reordered with earlier stlr
  - ARM tried just ldar, performance forced ldapr
  - RCsc would force bad code on some architectures
- Made store\_release A-commutative (see next slide)

# What is A-Cumulativity???



12

# What is A-Cumulativity???



13

# What Did This Change Take?

```
git diff 4112e1ea..6315dd37 -- stat herd lib/
herd/BPFArch_herd.ml | 14 ++++++++++-
herd/BPFSem.ml
                  20 +++++++++++++++++
herd/libdir/bpf.cat | 11 ++++++---
 lib/BPFBase.ml
 lib/BPFLexer.mll
                    2 ++
 6 files changed, 76 insertions(+), 6 deletions(-)
```

### Demo

- Convert existing LKMM tests to BPF!!!
  - https://github.com/paulmckrcu/litmus
  - Early days, and hopefully replaced by something a bit more formal at some point ;-)

#### Validation: Convert LKMM to BPF

- tools/memory-model/litmus-tests:
  - 35 Total
  - 22 Without RCU, SRCU, locking, and weak barriers
  - 20 Without "if" statements and smp\_store\_mb()
  - 20 Potentially convertable to BPF
  - 20 Compatible LKMM and BPF outcomes

#### Validation: Convert LKMM to BPF

- tools/memory-model/litmus-tests:
  - 35 Total
  - 22 Without RCU, SRCL, cong, and weak barriers
  - 20 Without "is a long of the long of the
  - 20 convicable to BPF
  - compatible LKMM and BPF outcomes

- https://github.com/paulmckrcu/litmus:
  - 5374 Total
  - 2493 Without RCU, SRCU, locking, and weak barriers
  - 2166 Without "if" statements and smp\_store\_mb()
  - 146 Potentially convertable to BPF
  - 133 Excluding casted/unmarked accesses and atomic RMW
  - 126 Compatible LKMM and BPF outcomes
  - 7 With incompatible outcomes

- https://github.com/paulmckrcu/litmus:
  - 5374 Total
  - 2493 Without RCU, SRCU, locking
  - 2166 Without "if" statement to the control of the
  - 146 Potentially
  - 133 Find (1997) Led accesses and atomic RMW
  - and BPF outcomes
  - men incompatible outcomes

https://github.com/paulmckrcu/litmus: **5374 Total** 2493 Without R SRCU, locking 2166 Without "in 146 Potential rd atomic RMW 133 F and BPF outcomes acompatible outcomes

- https://github.com/paulmckrcu/litmus:
  - 5374 Total
  - 2493 Without RCU, SRCU, locking, and weak barriers
  - 2166 Without "if" statements and smp\_store\_mb()
  - 146 Potentially convertable to BPF
  - 133 Excluding casted/unmarked accesses and atomic RMW
  - 126 Compatible LKMM and BPF outcomes
  - 0 With incompatible outcomes

# JIT Complications

# BPF Conditional Jump Instructions

- This weak ordering applies when:
  - Either the src or dst registers depend on a prior load instruction (BPF\_LD or BPF\_LDX), and
  - There is a store instruction (BPF\_ST or BPF\_STX) before control flow converges, and following the conditional jump instruction in program order
  - The restrictions outlined in the "CONTROL DEPENDENCIES" section of Documentation/memory-barriers.txt are faithfully followed
    - Compilers do not understand control dependencies, and happily break them.
    - Optimizations involving conditional-move instructions requires the "before control flow converges" restriction

#### **BPF Instructions To Other Instructions**



#### **BPF Instructions To Other Instructions**



# JIT Complications

- Register Mismatches
- ABI Calling Conventions
- Backend Optimizations

# Register Mismatches

- BPF has R0-R10, real hardware has 16, 32, ...
- Can map BPF R0-R10 to fixed HW registers
  - Usually gives up performance: spills/reloads
- If fewer HW registers, dynamically map
- Many JITs treat R0-R10 as C-language auto variables whose <u>addresses have not been taken</u>

# ABI Calling Conventions

- BPF has calling conventions
- But so do hardware-assembly BPF helpers
- JIT might need to map calling conventions
- Fun when doing stack unwinding: shadow stack

# **Backend Optimizations**

- Inlining complicates stack unwinding and optimizations
- Arithmetic optimizations
  - Multiplication by zero replaced by zero, discarding other operand and computations leading up to it
  - Subtracting an expression from itself is also cancelled
- Type-based inference
  - Range-based tracking of register values permits eliding of branch instructions

# Optimizations Break Dependencies

# Checking Dependencies

- The klitmus tool starts with an LKMM litmus test, then creates a kernel module that tests it
  - Can prove something happens, but cannot prove that something cannot happen
- Use klitmus-like tool translate JIT BPF assembly litmus tests to a kernel module, check for broken dependencies
  - Again, cannot prove breakage does not happen: Still useful

### Where Is the BPF Memory Model?

- Overall direction set in 2021
- Informal instruction-level ordering in late 2023
- Formal model and tools in early 2024
- Handle new load-acquire and store-release instructions in late 2024
  - Adjustments might be needed based on eventual instruction format and semantics
- Verification against LKMM in late 2024 (support for "if" statements still needed)
- Things known to still be left:
  - There might also be a full-barrier instruction
    - Currently emulated with no-operation value-returning atomic operations
  - Comparison of BPF MM against hardware models (klitmus-like tool TBD)

# Summary

### **Summary**

- BPF memory model now has:
  - Prototype load-acquire/store-release handling
  - Automated checking against LKMM
    - Other than some atomics and "if" statements

#### For More Information

- Linux-kernel BPF standards directory (includes instruction definitions)
  - Documentation/bpf/standardization
- The Herd toolsuite for memory-model verification and testing
  - https://github.com/herd/herdtools7 with base memory model
  - https://github.com/puranjaymohan/herdtools7.git with load-acquire/storerelease prototype
- "Is Parallel Programming Hard, And, If So, What Can You Do About It?"
  - Chapter 12 ("Formal Verification")
    - https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

## Questions?

# Backup

#### Review of Informal Model

- BPF Atomic Instructions
- BPF Conditional Jump Instructions
- BPF Load instructions
- BPF Memory-Reference Instructions

#### **BPF Atomic Instructions**

- BPF\_XCHG, BPF\_CMPXCHG
- BPF\_ADD, BPF\_OR, BPF\_AND, BPF\_XOR
- BPF\_FETCH with one of the above

#### **BPF Atomic Instructions 1/3**

- BPF\_XCHG and BPF\_CMPXCHG instructions are fully ordered
- All CPUs and tasks agree that all instructions preceding or following a given BPF\_XCHG or BPF\_CMPXCHG instruction are ordered before or after, respectively, that same instruction
  - Consistent with Linux-kernel atomic\_xchg() and atomic\_cmpxchg(), respectively
  - Alternatively, consistent with the following:
    - smp\_mb(); atomic\_cmpxchg\_relaxed(); smp\_mb();

#### BPF Atomic Instructions 2/3

- BPF\_ADD, BPF\_OR, BPF\_AND, BPF\_XOR instructions are unordered
- CPUs and JITs can reorder these instructions freely
  - Consistent with Linux-kernel
     atomic\_add(), atomic\_or(),
     atomic\_and(), and atomic\_xor() APIs

#### **BPF Atomic Instructions 3/3**

- When accompanied by BPF\_FETCH, BPF\_ADD, BPF\_OR, BPF\_AND, BPF\_XOR instructions are fully ordered
- All CPUs and tasks agree that all instructions preceding or following a given instruction adorned with BPF\_FETCH are ordered before or after, respectively, that same instruction
  - Consistent with Linux-kernel atomic\_fetch\_add(), atomic\_fetch\_or(), atomic\_fetch\_and(), and atomic\_fetch\_xor() APIs

- Modifiers to BPF\_JMP32 and BPF\_JMP instructions:
  - BPF\_JEQ, BPF\_JGT, BPF\_JGE, BPF\_JSET, BPF\_JNE,
     BPF\_JSGT, BPF\_JSGE, BPF\_JLT, BPF\_JLE, BPF\_JSLT,
     and BPF\_JSLE
- Unconditional jump instructions (BPF\_JA, BPF\_CALL, BPF\_EXIT) provide no memory-ordering semantics

- These modifiers to BPF\_JMP32 and BPF\_JMP instructions provide weak ordering:
  - BPF\_JEQ, BPF\_JGT, BPF\_JGE, BPF\_JSET,
     BPF\_JNE, BPF\_JSGT, BPF\_JSGE, BPF\_JLT,
     BPF\_JLE, BPF\_JSLT, and BPF\_JSLE
- Too-smart JITs might need to be careful

- This weak ordering applies when:
  - Either the src or dst registers depend on a prior load instruction (BPF\_LD or BPF\_LDX), and
  - There is a store instruction (BPF\_ST or BPF\_STX) before control flow converges, and
  - The restrictions outlined in the "CONTROL DEPENDENCIES" section of Documentation/memory-barriers.txt are faithfully followed
    - Compilers do not understand control dependencies, and happily break them.
    - Optimizations involving conditional-move instructions requires the "before control flow converges" restriction

- This weak ordering applies when:
  - Either the src or dst registers depend on a prior load instruction (BPF\_LD or BPF\_LDX), and
  - There is a store instruction (BPF\_ST or BPF\_STX) before control flow converges, and following the conditional jump instruction in program order
  - The restrictions outlined in the "CONTROL DEPENDENCIES" section of Documentation/memory-barriers.txt are faithfully followed
    - Compilers do not understand control dependencies, and happily break them.
    - Optimizations involving conditional-move instructions requires the "before control flow converges" restriction

- This weak ordering applies when:
  - Either the src or dst registers depend on a prior logical and (BPF\_LDX), and
  - There is a store instruction (BPF Converges, and following the Converges, and following the Converges of t
  - The restrictions to let the description of Document 100 to 100
    - control dependencies, and happily break them. control dependencies, and happily break them. control dependencies, and happily break them.