master
  1//! API bits for the Secure Computing facility in the Linux kernel, which allows
  2//! processes to restrict access to the system call API.
  3//!
  4//! Seccomp started life with a single "strict" mode, which only allowed calls
  5//! to read(2), write(2), _exit(2) and sigreturn(2). It turns out that this
  6//! isn't that useful for general-purpose applications, and so a mode that
  7//! utilizes user-supplied filters mode was added.
  8//!
  9//! Seccomp filters are classic BPF programs. Conceptually, a seccomp program
 10//! is attached to the kernel and is executed on each syscall. The "packet"
 11//! being validated is the `data` structure, and the verdict is an action that
 12//! the kernel performs on the calling process. The actions are variations on a
 13//! "pass" or "fail" result, where a pass allows the syscall to continue and a
 14//! fail blocks the syscall and returns some sort of error value. See the full
 15//! list of actions under ::RET for more information. Finally, only word-sized,
 16//! absolute loads (`ld [k]`) are supported to read from the `data` structure.
 17//!
 18//! There are some issues with the filter API that have traditionally made
 19//! writing them a pain:
 20//!
 21//! 1. Each CPU architecture supported by Linux has its own unique ABI and
 22//!    syscall API. It is not guaranteed that the syscall numbers and arguments
 23//!    are the same across architectures, or that they're even implemented. Thus,
 24//!    filters cannot be assumed to be portable without consulting documentation
 25//!    like syscalls(2) and testing on target hardware. This also requires
 26//!    checking the value of `data.arch` to make sure that a filter was compiled
 27//!    for the correct architecture.
 28//! 2. Many syscalls take an `unsigned long` or `size_t` argument, the size of
 29//!    which is dependant on the ABI. Since BPF programs execute in a 32-bit
 30//!    machine, validation of 64-bit arguments necessitates two load-and-compare
 31//!    instructions for the upper and lower words.
 32//! 3. A further wrinkle to the above is endianness. Unlike network packets,
 33//!    syscall data shares the endianness of the target machine. A filter
 34//!    compiled on a little-endian machine will not work on a big-endian one,
 35//!    and vice-versa. For example: Checking the upper 32-bits of `data.arg1`
 36//!    requires a load at `@offsetOf(data, "arg1") + 4` on big-endian systems
 37//!    and `@offsetOf(data, "arg1")` on little-endian systems. Endian-portable
 38//!    filters require adjusting these offsets at compile time, similar to how
 39//!    e.g. OpenSSH does[1].
 40//! 4. Syscalls with userspace implementations via the vDSO cannot be traced or
 41//!    filtered. The vDSO can be disabled or just ignored, which must be taken
 42//!    into account when writing filters.
 43//! 5. Software libraries -  especially dynamically loaded ones - tend to use
 44//!    more of the syscall API over time, thus filters must evolve with them.
 45//!    Static filters can result in reduced or even broken functionality when
 46//!    calling newer code from these libraries. This is known to happen with
 47//!    critical libraries like glibc[2].
 48//!
 49//! Some of these issues can be mitigated with help from Zig and the standard
 50//! library. Since the target CPU is known at compile time, the proper syscall
 51//! numbers are mixed into the `os` namespace under `std.os.SYS (see the code
 52//! for `arch_bits` in `os/linux.zig`). Referencing an unimplemented syscall
 53//! would be a compile error. Endian offsets can also be defined in a similar
 54//! manner to the OpenSSH example:
 55//!
 56//! ```zig
 57//! const offset = if (native_endian == .little) struct {
 58//!     pub const low = 0;
 59//!     pub const high = @sizeOf(u32);
 60//! } else struct {
 61//!     pub const low = @sizeOf(u32);
 62//!     pub const high = 0;
 63//! };
 64//! ```
 65//!
 66//! Unfortunately, there is no easy solution for issue 5. The most reliable
 67//! strategy is to keep testing; test newer Zig versions, different libcs,
 68//! different distros, and design your filter to accommodate all of them.
 69//! Alternatively, you could inject a filter at runtime. Since filters are
 70//! preserved across execve(2), a filter could be setup before executing your
 71//! program, without your program having any knowledge of this happening. This
 72//! is the method used by systemd[3] and Cloudflare's sandbox library[4].
 73//!
 74//! [1]: https://github.com/openssh/openssh-portable/blob/master/sandbox-seccomp-filter.c#L81
 75//! [2]: https://sourceware.org/legacy-ml/libc-alpha/2017-11/msg00246.html
 76//! [3]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=
 77//! [4]: https://github.com/cloudflare/sandbox
 78//!
 79//! See Also
 80//! - seccomp(2), seccomp_unotify(2)
 81//! - https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
 82const IOCTL = @import("ioctl.zig");
 83
 84// Modes for the prctl(2) form `prctl(PR_SET_SECCOMP, mode)`
 85pub const MODE = struct {
 86    /// Seccomp not in use.
 87    pub const DISABLED = 0;
 88    /// Uses a hard-coded filter.
 89    pub const STRICT = 1;
 90    /// Uses a user-supplied filter.
 91    pub const FILTER = 2;
 92};
 93
 94// Operations for the seccomp(2) form `seccomp(operation, flags, args)`
 95pub const SET_MODE_STRICT = 0;
 96pub const SET_MODE_FILTER = 1;
 97pub const GET_ACTION_AVAIL = 2;
 98pub const GET_NOTIF_SIZES = 3;
 99
100/// Bitflags for the SET_MODE_FILTER operation.
101pub const FILTER_FLAG = struct {
102    pub const TSYNC = 1 << 0;
103    pub const LOG = 1 << 1;
104    pub const SPEC_ALLOW = 1 << 2;
105    pub const NEW_LISTENER = 1 << 3;
106    pub const TSYNC_ESRCH = 1 << 4;
107};
108
109/// Action values for seccomp BPF programs.
110/// The lower 16-bits are for optional return data.
111/// The upper 16-bits are ordered from least permissive values to most.
112pub const RET = struct {
113    /// Kill the process.
114    pub const KILL_PROCESS = 0x80000000;
115    /// Kill the thread.
116    pub const KILL_THREAD = 0x00000000;
117    pub const KILL = KILL_THREAD;
118    /// Disallow and force a SIGSYS.
119    pub const TRAP = 0x00030000;
120    /// Return an errno.
121    pub const ERRNO = 0x00050000;
122    /// Forward the syscall to a userspace supervisor to make a decision.
123    pub const USER_NOTIF = 0x7fc00000;
124    /// Pass to a tracer or disallow.
125    pub const TRACE = 0x7ff00000;
126    /// Allow after logging.
127    pub const LOG = 0x7ffc0000;
128    /// Allow.
129    pub const ALLOW = 0x7fff0000;
130
131    // Masks for the return value sections.
132    pub const ACTION_FULL = 0xffff0000;
133    pub const ACTION = 0x7fff0000;
134    pub const DATA = 0x0000ffff;
135};
136
137pub const IOCTL_NOTIF = struct {
138    pub const RECV = IOCTL.IOWR('!', 0, notif);
139    pub const SEND = IOCTL.IOWR('!', 1, notif_resp);
140    pub const ID_VALID = IOCTL.IOW('!', 2, u64);
141    pub const ADDFD = IOCTL.IOW('!', 3, notif_addfd);
142};
143
144/// Tells the kernel that the supervisor allows the syscall to continue.
145pub const USER_NOTIF_FLAG_CONTINUE = 1 << 0;
146
147/// See seccomp_unotify(2).
148pub const ADDFD_FLAG = struct {
149    pub const SETFD = 1 << 0;
150    pub const SEND = 1 << 1;
151};
152
153pub const data = extern struct {
154    /// The system call number.
155    nr: c_int,
156    /// The CPU architecture/system call convention.
157    /// One of the values defined in `std.os.linux.AUDIT`.
158    arch: u32,
159    instruction_pointer: u64,
160    arg0: u64,
161    arg1: u64,
162    arg2: u64,
163    arg3: u64,
164    arg4: u64,
165    arg5: u64,
166};
167
168/// Used with the ::GET_NOTIF_SIZES command to check if the kernel structures
169/// have changed.
170pub const notif_sizes = extern struct {
171    /// Size of ::notif.
172    notif: u16,
173    /// Size of ::resp.
174    notif_resp: u16,
175    /// Size of ::data.
176    data: u16,
177};
178
179pub const notif = extern struct {
180    /// Unique notification cookie for each filter.
181    id: u64,
182    /// ID of the thread that triggered the notification.
183    pid: u32,
184    /// Bitmask for event information. Currently set to zero.
185    flags: u32,
186    /// The current system call data.
187    data: data,
188};
189
190/// The decision payload the supervisor process sends to the kernel.
191pub const notif_resp = extern struct {
192    /// The filter cookie.
193    id: u64,
194    /// The return value for a spoofed syscall.
195    val: i64,
196    /// Set to zero for a spoofed success or a negative error number for a
197    /// failure.
198    @"error": i32,
199    /// Bitmask containing the decision. Either USER_NOTIF_FLAG_CONTINUE to
200    /// allow the syscall or zero to spoof the return values.
201    flags: u32,
202};
203
204pub const notif_addfd = extern struct {
205    id: u64,
206    flags: u32,
207    srcfd: u32,
208    newfd: u32,
209    newfd_flags: u32,
210};