I gave a talk at FOSDEM 2020 on the topic of this blog post, with some additional information on the BPF algorithms used in strace. The slides are hosted on this site and the recording on FOSDEM’s website.
On my machine, stracing a kernel build with the
--seccomp-bpf option makes it about twice as fast to list
It only slightly slows the build.
Of course, the actual speedup will depend on your machine and the traced workload.
My kernel build is limited by disk writes, so you can expect larger speedups.
Under the Hood
How does it work? How does a sandboxing facility like seccomp help improve the performance of a debugger?
Seccomp-bpf was introduced in Linux v3.5 by Will Drewry.
It allows userspace processes to attach a cBPF program to the seccomp hook, right before executing syscalls, to decide which syscalls should be allowed or denied.
The cBPF program returns
SECCOMP_RET_KILL to allow or deny a syscall.
Alternatively, it can return
SECCOMP_RET_TRACE to notify a ptracer, a process attached to the process doing the syscalls (the tracee).
Why would you want to notify a ptracer process? The use case then was for a sandbox to give control to a userspace process so that it could parse syscall arguments without the limitations of seccomp-bpf2.
So once you know that the strace process is a ptracer process, how
--seccomp-bpf works becomes quite evident:
it defines a cBPF program that returns
SECCOMP_RET_TRACE for any syscall strace is interested in and
SECCOMP_RET_ALLOW for others.
Why does it speed up strace?
Well, strace usually behaves as a very normal ptracer: it intercepts all syscall entries and exits, with two context switches per syscall.
These slow the tracee a lot.
--seccomp-bpf, we only switch to the strace process in userspace for syscalls the user actually wants to see.
When the strace process is done decoding and displaying the syscall, it can restart the tracee in the kernel with the
To perform the same action, but stop the tracee at the next syscall entry or exit, the strace process can use
--seccomp-bpf, strace mostly uses
When we pass the
--seccomp-bpf option, we restart the tracee with
PTRACE_CONT at syscall exits and rely on the cBPF program to notify us at the next syscall of interest.
We can’t do the same at syscall entries however.
Since the cBPF program can only notify us of syscall entries, we need to restart with
PTRACE_SYSCALL to stop at the syscall exit.
There are two main limitations with the current
They relate to the fact that seccomp-bpf was originally meant for sandboxing and not tracing.
First, when using
--seccomp-bpf, all child processes of the tracee are also traced (same as using
We don’t have a choice.
Once we attach a seccomp-bpf program to a process, all children inherit it.
If these child processes are stopped by a seccomp-bpf program with
SECCOMP_RET_TRACE and don’t have a ptracer attached, it won’t end well: the syscall will error with
--seccomp-bpf does not work on processes attached with
strace -p [pid] (processes that already exist).
The Linux kernel simply doesn’t provide a way to attach seccomp-bpf programs to existing processes.
--seccomp-bpf is currently an experimental feature, but if it proves succesful, we may enable it by default.
When/if we do, all strace users will get a transparent performance boost.
strace --seccomp-bpf still stops at each syscall of interest and should therefore not be used on production systems!
I’m working on some improvements for the syscall matching algorithm of the cBPF program, which I may describe in a later post.
I’m also hoping to lose at least one of the above limitations in a future version of
This is likely not my last strace post :-)
Thanks to Chen for the original version of the
--seccomp-bpf patchset, Dmitry and Eugene for code reviews, and Céline and Yoann for their reviews of this post.
Yes, there are a few
connectsyscalls during a kernel build. Only to Unix sockets though. ↩
The main limitation is the inability to examine syscall arguments passed by pointers. eBPF wouldn’t have changed that, had it existed. Because of where in the stack the BPF programs are executed, any check on such arguments would be vulnerable to time-of-check-to-time-of-use (TOCTTOU) races. ↩