The Cost of BPF Tail Calls

For an upcoming blog post, I wanted to measure the cost of BPF tail calls. Tail calls allow you to jump from one BPF program to another. Their overhead varied a lot in recent kernels, with a first increase caused by Spectre mitigations and a decrease thanks to improvements in Linux 5.5.

In this blog post, I’ll quickly go over how I measured that overhead for diverse configurations and will then present the results.

An evaluation of tail call costs was presented before at Linux Plumbers 2020. That evaluation focused on extracting realistic numbers for Cloudflare’s environment on two Linux releases, whereas I’m interested in variations across a larger set of kernel releases. I am therefore using cheaper but less realistic hardware.

Methodology

I wanted to measure the cost of both a single tail call and a chain of tail calls. The maximum number of chained tail calls is 33¹, so I wrote the following BPF program.

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

struct bpf_map_def SEC("maps") progs = {
	.type = BPF_MAP_TYPE_PROG_ARRAY,
	.key_size = sizeof(__u32),
	.value_size = sizeof(__u32),
	.max_entries = 34,
};

/* Macro to define the Xth BPF program, in its own section. */
#define PROG(X) SEC("action/prog" #X)		\
int bpf_prog ## X(void *ctx) {			\
	bpf_tail_call(ctx, &progs, X+1);	\
	return 0;				\
}

PROG(0)
...
PROG(33)

Then, from userspace, we can fill the tail call map such that a given number of tail calls are performed. For example, to perform 2 tail calls, we update the map such that:

1 maps to program action/prog1;
2 maps to program action/prog2.

When running program action/prog0, it will tail call to action/prog1, which will tail call to action/prog2. action/prog2 will attempt a tail call using 3 as index, fail as there are no such map entry, and exit.

Finally, I rely on BPF_PROG_TEST_RUN to run the chain of programs and retrieve the mean runtime.

Results

I ran all measurements on a low-end t1.small.x86 Packet server with the following specs, and used an Ansible playbook to automate and parallelize the installation of Linux releases.

1x Intel Atom C2750 @ 2.4GHz
Turbo frequency (enabled): 2.6 GHz
8GB of RAM

I first measured the cost of tail call chains of varying lengths on the last three LTS Linux releases, plus v5.5 for which better performance is expected. I instructed BPF_PROG_TEST_RUN to run the chain of programs 100 million times and repeated the experiment 10 times², taking the average over all runs.

In this first plot, retpoline is enabled. We can observe that the difference between the three apparent lines is not very visible until we reach 6 chained tail calls. For shorter tail call chains, differences in measurements tend to be covered by noise.

The performance of v4.19 varies a lot for long tail call chains, causing its mean to be higher than v5.4’s. Those variations seem be have been introduced at some point between v5.2 and v5.3. I did not investigate further³ as I’m more interested in changes introduced by v5.5 and v5.10.

The lower cost of tail calls on v5.5 is expected and is the result of work to compile direct tail calls into jumps, which removed the need for a retpoline. That work is best covered in Cilium v1.7’s blog post. The increased cost in v5.10 is however unexpected.

In the next plot, we’ll zoom in on the apparent performance regression in v5.10 and compare our numbers to those with retpoline disabled. We’ll focus on chains of 33 tail calls, but you can use the slider below the figure to observe shorter chains.

This time we can see a clear difference with and without retpoline before v5.5. Version 5.5 eliminates the difference and lowers the overall cost.

We can also observe that the performance regression was introduced in v5.10; previous releases have similar performance to v5.5’s. Further bisecting narrowed down the regression to commit ebf7d1f (“bpf, x64: rework pro/epilogue and tailcall handling in JIT”). That commit reworked the x86 JIT compiling of tail calls to enable their combination with BPF function calls.

Looking at the code changes, it’s unclear why they would lower performance. I’ll have to retrieve some perf samples (and maybe write a second blog post if we solve it).

A Note on the Standard Deviation

I started this blog post several months ago. It took me a while to finish it because I initially wanted to measure the standard deviation of tail call costs. Easy right?

BPF_PROG_TEST_RUN runs 100 million times and reports the mean of runtimes. If I compute the standard deviation of BPF_PROG_TEST_RUN’s results, I’ll get the standard deviation of means and not the standard deviation itself. The standard deviation of the mean runtimes doesn’t tell us anything on the runtime variations.

I tried to extend BPF_PROG_TEST_RUN to report the standard deviation, but whichever way I implemented it, it impacted the results a lot. Thus, I eventually decided to not include any standard deviation at all in the plots, to not give a false impression of stability. Except for v4.19, the runtime means are very stable, but that doesn’t imply that the actual runtimes are.

Conclusion

To sum up, the performance of tail calls improved a lot in v5.5 and worsened a bit in v5.10. The changes appear to be smaller on the hardware I used than on other hardware. For example, on my i7-9750H CPU clocked at 2.60GHz, the mean runtime decreases by ~10x in v5.5 and increases back by ~4x in v5.10.

Obviously this BPF program isn’t representative of production workloads. It’s hard to say to what extent the performance changes we observed here would affect production workloads.

BPF tail calls are most often used to tackle BPF complexity issues, which I’ll cover in an upcoming blog post :-)

Thanks Céline and Quentin for the reviews!

It looks like the maximum number of tail calls was intended to be 32, but ended up being 33 because of an off-by-one error. The few JIT compilers using a limit of 32 were later updated to 33 for compatibility and consistency. ↩
On the slowest kernel versions, it took 4h to complete. The playbook runs the experiment for each kernel version in parallel on different machines. ↩
Bisecting variations is very time consuming since you need to repeat each experiment many times. I did check that 3193c08 is not the culprit. ↩