This post focuses on the XFRM building blocks Cilium uses to provide its IPsec support. Therefore, only tunnel mode and ESP are discussed, XFRM devices are not described, and some focus is made on the use of packet marks in XFRM policies and states.

Several others have written on XFRM, usually with a slightly different focus. James Bottomley gave a quick introduction on his blog, with examples of configurations. Andrej Stender’s blog has a very detailed description of the typical packet paths for IPsec gateways.

This guide was initially published as part of the Cilium documentation. I’m republishing it as I have a bit more freedom here (vs. reStructuredText) to format and display the content.

If you find mistakes, you can report them by email or via the other contact methods listed at the bottom.

Overview

IPsec encryption in the Linux kernel relies on XFRM. XFRM is an IP framework intended for packet transformations, from encryption to compression. It is configured via a set of policy and state objects, which for IPsec, correspond to Security Policies and Security Associations.

XFRM Policies and States

At a high-level, XFRM policies define what traffic to accept and reject, whereas states define how to perform the encryption and decryption. Policies can match on the direction (out, in, or fwd), the source and destination IP addresses with CIDRs, and the packet mark. As an example, the following policy matches egressing packets with any source IP address, 10.56.1.X destination IP addresses, and 0xcb93eXX packet marks. Policies default to allowing traffic as done here.

src 0.0.0.0/0 dst 10.56.1.0/24 
	dir out priority 0 
	mark 0xcb93e00/0xffffff00 
	[...]

States are relatively similar, except that they are agnostic to the direction and can only match on exact IP addresses (or 0.0.0.0 to match all). The following state will apply to packets with IP addresses 10.56.0.17 -> 10.56.1.238 and the same packet marks as above. In the case of tunnel-mode IPsec, these IP addresses correspond to the outer IP addresses. For ingressing, encrypted packets, the SPI will also be used (discussed below).

src 10.56.0.17 dst 10.56.1.238
	proto esp spi 0x00000003 reqid 1 mode tunnel
	replay-window 0 
	mark 0xcb93e00/0xffffff00 output-mark 0xe00/0xffffff00
	aead rfc4106(gcm(aes)) 0x6254fced5f7a5ea9401b9015ecf10d65eac51a69 128
	anti-replay context: seq 0x0, oseq 0x36, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0

You may notice that nothing specifies if this state should perform encryption or decryption. That’s because it can actually do both. As said above, states are agnostic to the direction of traffic so the same state may theoretically be used for both encryption and decryption. What to do will be determined based on where in the stack the state is matched (ex., decryption on ingress).

Policy Templates

XFRM policies also typically define a template, as below:

src 0.0.0.0/0 dst 10.56.1.0/24 
	dir out priority 0 
	mark 0xcb93e00/0xffffff00 
	tmpl src 10.56.0.17 dst 10.56.1.238
		proto esp spi 0x00000003 reqid 1 mode tunnel

How this template is used depends on the direction. For egressing traffic, the template defines the encoding to perform. For example, the above template will encapsulate packets with an IP header and an ESP header. The IP header will have IP addresses 10.56.0.17 and 10.56.1.238. The ESP header will have SPI 3.

For ingressing and forwarded traffic however, the template acts as an additional filter. The following XFRM policy for example will only allow packets if they are ESP packets with outer IP addresses 10.56.1.238 and 10.56.0.17, in addition to having a packet mark matching 0xd00/0xf00.

src 0.0.0.0/0 dst 10.56.0.0/24 
	dir in priority 0 
	mark 0xd00/0xf00 
	tmpl src 10.56.1.238 dst 10.56.0.17
		proto esp reqid 1 mode tunnel

The template of XFRM OUT policies points to the XFRM state to use for encryption. The IP addresses, the SPI, the protocol, the mode, and the reqid should all match between the XFRM state and the template.

XFRM Packet Flows

IPsec and XFRM are represented in the usual Linux networking diagram. There are however several errors in that diagram when it comes to XFRM1, so I decided to write a new one. It takes inspiration from Andrej Stender’s diagrams and simplifies the overall flow to focus on just what I want to explain here. All pieces related to XFRM are in purple, routing decisions in orange, and the rest in yellow.

Packet flow in Netfilter and XFRM subsystems.


Egress Packet Flow

On egress, packets will first hit one of the XFRM OUT policy blocks. At this point, a lookup is performed against the XFRM OUT policies. If a match is found, the packet goes to the XFRM encode block and the template is used to lookup XFRM states. If a state is found, its information is used to encrypt the packet.

The encrypted packet will then navigate again through the OUTPUT and POSTROUTING chains.

Ingress Packet Flow

On ingress, encrypted packets (ex., ESP packets) will hit the XFRM decode after they navigate through the INPUT chain.

In tunnel mode, encrypted packets will typically have one of the server’s IP addresses as the outer destination address, so they should automatically be routed through the INPUT chain. If not, it may be necessary to add IP routes to redirect packets to the INPUT chain. As an example, Cilium identifies IPsec traffic on tc-bpf ingress and marks them with a special value which is then used to reroute those packets to the INPUT chain.

At the XFRM decode, if packets match an XFRM state, they will be decoded (i.e., decapsulated and decrypted) using the state’s information. The match is based on the source & destination addresses, the mark, the SPI, and the protocol. In case of any decoding error (ex., wrong key), the packet is dropped and an error counter is increased.

As illustrated on the diagram, an XFRM policy matching the packet isn’t required for the decoding to happen (it goes directly to XFRM decode), but it is required for the packet to proceed to a local process or through the FORWARD chain. An XFRM policy with an optional template (i.e., level use) will allow all decrypted packets through. Traffic that was never encrypted, and therefore does not come from XFRM decode, is allowed by default.

After a packet is decrypted, it is recirculated in the stack, as if coming from the interface it was initially received on. More specifically, packets are recirculated before the tc layer, such that they are visible on the tc-bpf hook a second time (once before decryption, once after). The packet mark is preserved when recirculated, so it’s possible to identify and trace packets that have been decrypted and recirculated. The packet mark can also be modified during decryption, using the output-mark field of the XFRM states.

Output Description of ip xfrm

The example outputs below are from iproute2-6.1.0. More fields will likely appear in newer versions. For example, XFRM states have a dir field in newer kernels (v6.10+), which will likely appear in the ip xfrm state output at some point.

In the ip xfrm output, policies are ordered by date of creation, with newer policies at the top. This is important because, in case two policies match a packet and have the same priority, the newest one is used.

$ ip -s xfrm policy
src 0.0.0.0/0The CIDR to match against the source IP address dst 0.0.0.0/0The CIDR to match against the destination IP address uid 0
  dir fwdStates the direction. It defines where in the Linux stack this policy will be used, between ingress, egress, and forwarding. action allowThe action to take on matching packets. Packets can only be allowed through (by default) or dropped. index 18Used to differentiate between different policies which might have the same or overlapping selectors. If not given or if it already exists, it is automatically (re-)generated (cf., `xfrm_gen_index`). The three LSBs encode the direction (ex., 1 for `XFRM_POLICY_OUT`). The MSBs are simply incremented by one (that is, the index is incremented by 8) until a free index is found. priority 2975States the priority for this policy in case multiple could match the packet. 0 is the highest priority. share anyAlways set to `any` and unused today. flag (0x00000000)Set of flags for XFRM policies. Only `XFRM_POLICY_ICMP` (0x2) is supported at the moment; `XFRM_POLICY_LOCALOK` (0x1) is not implemented (anymore?). When `XFRM_POLICY_ICMP` is given, the policy will also apply to ICMP packet with a payload packet that matches the policy's selector.
  lifetime config:
     limit: soft (INF)(bytes), hard (INF)(bytes)
  limit: soft (INF)(packets), hard (INF)(packets)Not implement and not enforced.

     expire add: soft 0(sec), hard 0(sec)
  expire use: soft 0(sec), hard 0(sec)Various expiration times for the policy, based on the time since the policy was added or the time since the policy was last matched by a packet. When a soft expiration time is reached, a notification is sent to userspace via netlink (`struct xfrm_user_expire`). When a hard limit or expiration time is reached, the policy is deleted.

  lifetime current:
    0(bytes), 0(packets)Not implemented; will always be 0.
    add 2024-06-17 11:24:49 use 2024-06-17 11:25:01Timestamps for when the policy was added and when it was last matched by a packet, to be used if expiration times have been set.
  tmpl src 0.0.0.0See Policy Templates for how this field is used. dst 10.92.0.164See Policy Templates for how this field is used.
    proto espSee Policy Templates for how this field is used. spi 0x00000000(0)See Policy Templates for how this field is used. reqid 1(0x00000001)See Policy Templates for how this field is used. mode tunnelSee Policy Templates for how this field is used.
    level useThe nonsensical way to indicate this template is optional, the alternative being `level required`. If no XFRM state matching the template is found, the template will be skipped if optional. Otherwise, the packet will be dropped with `XfrmInTmplMismatch`. share anyNot implemented and will always be `any`.
    enc-mask ffffffffBit mask defining the list of allowed encryption algorithms. See Encryption algorithms in include/uapi/linux/pfkeyv2.h for the list of possible values. auth-mask ffffffffBit mask defining the list of allowed authentication algorithms. See Authentication algorithms in include/uapi/linux/pfkeyv2.h for the list of possible values. comp-mask ffffffffNon-implemented bit mask (was probably defined for compression algorithms).
& ip -s xfrm state
src 10.92.1.189The IP address to match against the packets' source IP addresses. dst 10.92.0.164The IP address to match against the packets' destination IP addresses.
  proto espThe IPsec protocol to use. spi 0x00000003(3)The Security Parameter Index. A tag to distinguish between multiple IPsec streams that may be using different algorithms and/or keys. Particularly useful during key rotations. reqid 1(0x00000001)An ID only used to ensure the XFRM policy template and the state match. It doesn't seem to be used for anything else in the kernel. mode tunnelStates whether the packet is encapsulated (`tunnel`) or if the ESP header is simply added to the existing packet (`transport`).
  replay-window 0Size of the replay window used for the anti-replay checks (i.e., toleration setting). seq 0x00000000 flag (0x00000000)Holds various flags including `XFRM_STATE_ESN` (0x80) for ESN mode.
  mark 0x4db50d00/0xffff0f00The value and mask used to match against the packets' marks. output-mark 0xd00/0xffffff00The value and mask to apply to the packets' marks after they have been encrypted or decrypted.
  aead rfc4106(gcm(aes))The type and name of algorithm in use. 0x856f15d0ccabe682286b4286bccf5d595b88b168 (160 bits)The key and its size. It's of course sensitive information that should be treated as such. 128The ICV length. Which lengths are supported depends on the algorithm in use.
  anti-replay context: seq 0x0Holds the current receive-side sequence number, for the anti-replay check., oseq 0x0The last emitted sequence number. If this number overflows (on 32-bits), packets are dropped and the error counter `XfrmOutStateSeqError` is increased. In ESN mode, this sequence number is coded on 64-bits., bitmap 0x00000000Tracks the sequence numbers that have already been seen in the replay window.
  sel src 0.0.0.0/0 dst 0.0.0.0/0An additional filter applying to the decrypted packets, to ensure the inner packets are coming and going where you expect. uid 0This field appears to be unused (`user` in `struct xfrm_selector`).
  lifetime config:
  limit: soft (INF)(bytes), hard (INF)(bytes)
  limit: soft (INF)(packets), hard (INF)(packets)
  expire add: soft 0(sec), hard 0(sec)
  expire use: soft 0(sec), hard 0(sec)Various limits and expiration times for the state, based on the number of bytes received, the number of packets received, the time since the state was added, or the time since the state was last used for a packet. When a soft limit or expiration time is reached, a notification is sent to userspace via netlink (`struct xfrm_user_expire`). When a hard limit or expiration time is reached, the state is deleted.

  lifetime current:
    20124(bytes), 83(packets)Counters for bytes and packets matched by this state, to be used if limits have been set.
    add 2024-06-17 11:15:48 use 2024-06-17 11:16:02Timestamps for when the state was added and when it was last matched by a packet, to be used if expiration times have been set.
  stats:
    replay-window 0Incremented whenever a packet is received with a sequence number outside the window. replay 0Incremented whenever a packet is received with a sequence number in the replay window that was already observed. failed 0Incremented when the checksums for authentication or encryption headers are incorrect (full name `integrity_failed` on kernel's side). `XfrmInStateProtoError` is always incremented when this counter is incremented.

XFRM Errors

All XFRM errors correspond to packet drops. Some of them may also be associated with per-state counters increasing. CONFIG_XFRM_STATISTICS is required to see these error counters in /proc/net/xfrm_stat.

  • XfrmInError: If the kernel fails to allocate memory during encryption.
  • XfrmInBufferError:
    • If a packet is going through too many XFRM states. The maximum is set to XFRM_MAX_DEPTH (6).
    • If too many XFRM policy templates apply to a packet. The maximum is also set to XFRM_MAX_DEPTH (6).
  • XfrmInHdrError:
    • If the SPI portion of the packet is malformed.
    • If the outer IP header is malformed.
  • XfrmInNoStates: If no XFRM IN state was found that matches the AH or ESP packet ingressing on the INPUT chain.
  • XfrmInStateProtoError:
    • If the AH or ESP checksum is incorrect.
    • If the packet’s IPsec protocol (ex., AH, ESP) doesn’t match the protocol specified by the XFRM state.
    • Also includes all protocol specific errors (ex., from esp_input) listed below:
    • If decryption/encryption fails (ex., because the key specified in the XFRM IN state doesn’t match the key with which the packet was encrypted).
    • If the protocol headers (ex., ESP) or trailers are malformed.
    • If there is not enough memory to perform encryption/decryption.
  • XfrmInStateModeError: If the packet is in IPsec tunnel mode, but the matched XFRM state is in transport mode.
  • XfrmInStateSeqError: If the anti-replay check rejected the packet. If the check failed because the sequence number was outside the window, the replay-window counter of the associated XFRM state will be incremented. If it failed because the sequence number was seen already, the replay counter is incremented instead.
  • XfrmInStateExpired: There can be a delay between when a state expires (hard limits) and when it’s actually deleted. During that time, matching packets are dropped with XfrmInStateExpired on ingress.
  • XfrmInStateMismatch:
    • If the encapsulation protocol of the XFRM state (ex., espinudp in encap field of ip xfrm state) doesn’t match the encapsulation protocol of the packet.
    • If the decrypted packet doesn’t match the selector (sel field) of the used XFRM state.
  • XfrmInStateInvalid: If received packet matched an XFRM state that is being deleted or that expired.
  • XfrmInTmplMismatch:
    • If a packet matches an XFRM policy with a non-optional template, but the template doesn’t match any of the XFRM states used to decrypt the packet (yes, a packet can be decoded multiple times).
    • If an XFRM state with mode tunnel was used on the packet and it doesn’t match any XFRM policy template.
  • XfrmInNoPols: If the ingressing packet doesn’t match any XFRM policy and the default action is set to block. See ip xfrm policy {get,set}default to view and set the default XFRM policy actions.
  • XfrmInPolBlock: If the packet matches an XFRM IN policy with action block.
  • XfrmOutError:
    • If the kernel fails to allocate memory during encryption.
    • In some cases, if the packet to encrypt is malformed.
  • XfrmOutBundleCheckError: Unused.
  • XfrmOutNoStates: If the packet matched an XFRM OUT policy, but no XFRM state was found that matches the policy’s template.
  • XfrmOutStateProtoError: If a protocol-specific (ex., ESP) encryption error happens.
  • XfrmOutStateModeError: If the packet exceeds the MTU once encapsulated and it shouldn’t be fragmented.
  • XfrmOutStateSeqError: The output sequence number (oseq) of an XFRM state reached its maximum value, UINT32_MAX when not using ESN mode.
  • XfrmOutStateExpired: There can be a delay between when a state expires (hard limits) and when it’s actually deleted. During that time, matching packets are dropped with XfrmOutStateExpired on egress.
  • XfrmOutPolBlock: If the packet matches an XFRM OUT policy with action block.
  • XfrmOutPolDead: Unused. XfrmOutStateInvalid is reported instead for XFRM states that in the process of being deleted.
  • XfrmOutPolError:
    • If too many XFRM policy templates apply to a packet. The maximum is also set to XFRM_MAX_DEPTH (6).
    • If no XFRM state is found for a non-optional template of the matching XFRM policy.
  • XfrmFwdHdrError: If the packet is malformed when going through the FWD policy check.
  • XfrmOutStateInvalid: If egressing packet matched an XFRM state that is being deleted or that expired.
  • XfrmOutStateDirError: If the direction of the XFRM state found during the lookup is defined and isn’t XFRM_SA_DIR_OUT. Only on kernels v6.10 and newer.
  • XfrmInStateDirError: If the direction of the XFRM state found during the lookup is defined and isn’t XFRM_SA_DIR_IN. Only on kernels v6.10 and newer.

Performance Considerations

This section describes the data structures used to hold the XFRM policies and states. This is useful to understand when dealing with a large number of states and policies as the information they hold can help improve indexing and speed up the lookups. When dealing with thousands of policies and states, the lookup cost can become non-negligible even when compared to the encryption/decryption cost.

Data Structure for XFRM Policies

XFRM policies are stored in a rather complex data structure made of multiple red-black trees and hash tables. At the root, everything is contained in a resizable hash table indexed by network namespace, IP family, direction, and interface (in case XFRM interfaces are used). Each entry in this resizable hash table contains several black-red trees, which themselves hold the XFRM policies. Those entries are represented by the structure xfrm_pol_inexact_bin.

graph LR table(resizable
hashtable) --> bin1(...) table(resizable
hashtable) --> bin(xfrm_pol_inexact_bin) table(resizable
hashtable) --> bin2(...) bin --> rbtree_dst(root_d) bin --> rbtree_src(root_s) bin --> list_anyany((any;any)) rbtree_src --> node_src1(( )) rbtree_src --> node_src2(( )) node_src1 --> node_src3(( )) node_src1 --> node_src4(( )) node_src2 --> node_src5(( )) node_src2 --> node_src6(( )) node_src6 --> list_srcany((src;any)) rbtree_dst --> node_dst1(( )) rbtree_dst --> node_dst2(( )) node_dst1 --> node_dst3(( )) node_dst1 --> node_dst4(( )) node_dst2 --> node_dst5(( )) node_dst2 --> node_dst6(( )) node_dst5 --> rbtree_dstsrc(root) node_dst5 --> list_dstany((any;dst)) rbtree_dstsrc --> node_dstsrc1(( )) rbtree_dstsrc --> node_dstsrc2(( )) node_dstsrc1 --> node_dstsrc3(( )) node_dstsrc1 --> node_dstsrc4(( )) node_dstsrc2 --> node_dstsrc5(( )) node_dstsrc2 --> node_dstsrc6(( )) node_dstsrc6 --> list_dstsrc((src;dst))

Once xfrm_pol_inexact_bin has been retrieved (based on current IP family, namespace, and direction), each of its red-black trees is looked up using the source and destination IP addresses. The root_s tree contains policies sorted by source IP addresses; the root_d tree contains policies sorted by destination IP addresses. In addition, leaf nodes of the root_d tree also contain another tree with policies sorted by source IP addresses. That allows the lookups into root_s and root_d to return three lists of candidate (src_ip; dst_ip) policies from the leaf nodes:

  • A list of (src_ip; any) candidates from root_s.
  • A list of (any; dst_ip) candidates from root_d.
  • A list of (src_ip; dst_ip) candidates from the trees pointed by the leaf nodes of root_d.

These three lists of candidate XFRM policies are completed by a list of (any; any) candidates directly stored in the xfrm_pol_inexact_bin entry.

Note that an XFRM policy will only be present in one of the four candidate lists, according to its source and destination CIDRs.

These four lists of candidate XFRM policies are then evaluated. The kernel iterates through each list, looking for the highest-priority (lowest priority number) candidate that matches the packet. If two policies match and have the same priority, the newest one is preferred. It’s also only during this linear evaluation of candidates that the packet mark is compared with the policy marks.

Data Structure for XFRM States

XFRM states are organized in four hash tables, with different XFRM fields used for indexing and different purposes:

  • net->xfrm.state_bydst is indexed by source and destination IP addresses as well as reqid.
  • net->xfrm.state_bysrc is indexed only by source and destination IP addresses.
  • net->xfrm.state_byspi is indexed by destination IP address, SPI, and protocol.
  • net->xfrm.state_byseq is indexed by sequence number only.

net->xfrm.state_byspi is used when looking up an XFRM state for ingressing packets. This makes sense to speed up the search as each XFRM state is encouraged to have its own SPI (cf., RFC4301, section 4.1) and the encrypted packets carry the SPI.

When searching for the XFRM state that corresponds to an XFRM policy template (before encryption), net->xfrm.state_bydst is used. That makes sense because the indexing information is what the XFRM policy template provides. That hash table is typically also the one being used when iterating through all XFRM states (ex., when flushing them), but any hash table would do the job for that.

net->xfrm.state_bysrc and net->xfrm.state_byseq are used for various other management tasks, such as looking up an XFRM state to update, answering a netlink query from the user, or checking for existing states before adding a new one.

Conclusion

There’s still a lot that would need to be covered for this guide to be a complete reference on XFRM/IPsec in Linux. To cite a few, this guide didn’t cover XFRM interfaces, VTIs, ESP-in-TCP, ESP-in-UDP, XFRM offloads, or per-resource child SAs. Nonetheless, it should constitute a good basis to configure and troubleshoot XFRM configurations for IPsec tunneling.


Thanks to Louis, Gray, and Simone for identifying several mistakes and helping with wordsmithing.


  1. For example, after XFRM decode, packets actually hit the tc/qdisc of the ingress device again, as you can see if you hook into tc-bpf.