### WireGuard and GRO? Improving WireGuard performance

Daniel Borkmann Anton Protopopov Martynas Pumputis ISOVALENT



### History (Cilium LB)



"Making the Kubernetes Service Abstraction Scale using BPF", LPC 2019





### History (Cilium Encryption)

- IPSec integration since Cilium v1.14 for inter-container traffic
  - Host stack does encryption via kernel XFRM framework
  - Cannot just bpf\_redirect() to encrypt
  - Tricky integration due to reliance on skb->tc\_index and skb->mark
  - No automated key rotation (no IKE)
- WireGuard in CI to test Cilium LB L3 to L2 netdev redirection
  - Dedicated netdev for encryption (cilium\_wg0)
  - Simple setup and auto key rotation (just exchange pub keys)

### Cilium WireGuard integration



#### **Cilium WireGuard integration**



#### Cilium WireGuard integration



### Cilium WireGuard (userspace) integration

- User-space mode to support WireGuard on < 5.6 kernels (now deprecated)
  - Relies on TUN device
  - Not intended for production use (cannot withstand cilium-agent restarts)
  - Probably not performant enough (?)

### WireGuard driver vs WireGuard-go

- WireGuard-go got support for UDP GRO/GSO
  - <u>Blog</u>: "Userspace isn't slow, some kernel interfaces are!"
  - <u>Blog</u>: "Surpassing 10Gb/s over Tailscale"



#### WireGuard benchmark setup

- AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0
- 100Gb/s dual port ConnectX-6 Dx (mlx5), LRO enabled
- PREEMPT\_NONE, IRQs pinned, no SMT, CPU gov: performance
- CPU mitigations compiled out
- BIG TCP enabled
- Git trees: net tree, wireguard-go (12269c27617)



# TCP stream single flow host to host over wire, 1500 MTU (higher is better)



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 1500 MTU



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 1500 MTU



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU

### Can we still do better for the native driver?

- How does GRO/GSO currently work in the native WireGuard driver?
- GRO:
  - Individual UDP packets (no GRO) go up the stack into WireGuard socket
  - WireGuard decrypts, then aggregates via napi\_gro\_receive(&peer->napi, skb)
- GSO:
  - Stack can send up to 64k GSO packets down into wg device
  - WireGuard segments via skb\_gso\_segment(skb, 0), then encrypts



### Can we still do better for the native driver?

- Low hanging fruit? Two ideas:
- GRO:
  - Instead of sending individual packets up the stack into the UDP socket, why not take a similar approach as xfrm's <u>ESP offload</u>?
  - GRO handler enqueues the skb internally for decryption, returns
     ERR\_PTR(-EINPROGRESS) back to GRO engine to tell skb has been GRO\_CONSUMED
  - Details: see Steffen's <u>IPsec GRO layer decapsulation</u>
- GSO:
  - Enable BIG TCP support for the driver to allow even bigger packets to reach the device: netif\_set\_tso\_max\_size(dev, GSO\_MAX\_SIZE) during dev setup

```
static size_t wg_gro_candidate(struct sk_buff *skb)
{
        if (unlikely(skb->len < sizeof(struct message_header)))</pre>
                return false;
        if (SKB TYPE LE32(skb) == cpu to le32(MESSAGE DATA) &&
            skb->len >= MESSAGE MINIMUM LENGTH)
                return true;
        return false;
}
struct sk_buff *wg_gro_receive(struct sock *sk,
                               struct list head *head,
                               struct sk buff *skb)
{
        struct wq device *wq = sk->sk user data;
        int offset = skb_gro_offset(skb);
        if (!pskb_pull(skb, offset))
                return NULL;
        if (!wg_gro_candidate(skb))
                goto out;
        skb mark not on list(skb);
        PACKET_CB(skb)->ds = ip_tunnel_get_dsfield(ip_hdr(skb), skb);
        wq packet consume data(wq, skb);
        return ERR_PTR(-EINPROGRESS);
```

out:

```
skb_push(skb, offset);
NAPI_GR0_CB(skb)->same_flow = 0;
NAPI_GR0_CB(skb)->flush = 1;
return NULL;
```

int wg\_socket\_init(struct wg\_device \*wg, u16 port) struct net \*net; int ret; struct udp\_tunnel\_sock\_cfg cfg = { .sk\_user\_data = wg, .encap\_type = 1, .encap rcv = wg receive, .gro\_receive = wg\_gro\_receive, }; . . . UDP socket registers GRO handler GRO handler pushes data packets

directly for decryption when GRO engine is invoked from phys dev

ESP GRO added INET\_ESP\_OFFLOAD Kconfig knob, do we need a similar Kconfig knob for WireGuard, or an attribute during device creation?

DB



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 1500 MTU



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU



DB Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU

- Rationale: Cilium creates a single cilium\_wg0 device for all east-west Pod/Pod traffic
  - BPF datapath basically bpf\_redirect()'s to cilium\_wg0
- Question: How well does it scale when multiple parallel flows hit cilium\_wg0?

### TCP stream multi flow host to host over wire, 8k MTU (higher is better)



Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU

### TCP stream multi flow host to host over wire, 8k MTU (higher is better)



Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU

- Potential improvements?
  - Creating multiple WireGuard devices under a bond and then load-balance based on hash
    - Currently not possible due to bond being an L2 device and WireGuard L3
    - Missing .ndo\_set\_mac\_address but also refuses after dummy implementation
    - Probably new bond mode needed (?) or fixups when bond/slave device are both in NOARP mode. Would be nice for bpf\_redirect() in datapath.
  - Creating multiple WireGuard devices and load-balance via multipath next hops
    - Works in terms of routing, but WireGuard reveals unexpected behavior

```
# ip r
```

```
default via 192.168.1.1 dev enp5s0 proto dhcp src 192.168.1.119 metric 100
10.0.0/24 dev enp10s0f0np0 proto kernel scope link src 10.0.0.2
10.1.0.1
```

```
nexthop dev wg0 weight 1
nexthop dev wg1 weight 1
```

- Several WireGuard devices on same node, options tried:
  - Different listen-port but otherwise same peer key/endpoint/allowed-ip settings?
    - Currently buggy: allowed-ips overridden/removed to "none"

peer: xvYlNOXRTf30caylpH5EFgEYluKY0Zp1bZkFEDIfE1I=
 endpoint: 10.0.0.1:9001
 allowed ips: (none)

- Different listen-port and different key-pairs, but same endpoint/allowed-ip settings?
  - Same behavior as above (needs fixing)

- Several WireGuard devices on same node, options tried:
  - Different listen-port but otherwise same peer key/endpoint/allowed-ip settings?
    - Currently buggy: allowed-ips overridden/removed to "none"

peer: xvYlNOXRTf30caylpH5EFgEYluKY0Zp1bZkFEDIfE1I=
 endpoint: 10.0.0.1:9001
 allowed ips: (none)

- Different listen-port and different key-pairs, but same endpoint/allowed-ip settings?
  - Same behavior as above (needs fixing)
- What about a WireGuard mode to have inner hash part of outer src port?
  - Downside: Exposes information of different flows, assumes single wg dev per host
- Workaround for test: all properties different (key-pairs/endpoint/allowed-ip)
  - This works for testing the idea, but is not practical for production

# TCP stream multi flow host to host over wire, 8k MTU (higher is better)



Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU

### Transactions per second host to host over wire, 8k MTU (higher is better)



Back to back: AMD Ryzen 9 3950X @ 3.5 GHz, 128G RAM @ 3.2 GHz, PCIe 4.0, ConnectX-6 Dx, mlx5 driver, striding mode, LRO, 8k MTU

### Other findings from testing:



#### wg: decrypt\_packet() :





### Other TODO items

- Once RSS is solved, experiment with CPU locality in terms of encryption/decryption
- \_\_\_\_cacheline\_group\_begin/end for RX/TX mostly data in hot path
- <u>Atomic queue counter</u> shared across CPUs
- Complete removal of wg driver segmenting skbs?
  - Probably not possible due to nonce as part of wg header



### Cilium WireGuard integration: future? (~KubeCon'24)



### Acknowledgements

Jason A. Donenfeld (WireGuard) Jordan Whited & James Tucker (WireGuard-go improvements) Sebastian Wicki (initial Cilium integration co-author) Cilium, netdev & BPF communities

### Thanks! Questions?

Cilium + WireGuard: <u>https://docs.cilium.io/en/stable/security/network/encryption-wireguard/</u> PoC code: <u>https://github.com/cilium/linux/commits/pr/wg</u>



