Qualys Security Advisory
Sequoia: A deep root in Linux's filesystem layer (CVE-2021-33909)
========================================================================
Contents
========================================================================
Summary
Analysis
Exploitation overview
Exploitation details
Mitigations
Acknowledgments
Timeline
========================================================================
Summary
========================================================================
We discovered a size_t-to-int conversion vulnerability in the
Linux
kernel's filesystem layer: by creating, mounting, and deleting a
deep
directory structure whose total path length exceeds 1GB, an
unprivileged
local attacker can write the 10-byte string "//deleted" to an
offset of
exactly -2GB-10B below the beginning of a vmalloc()ated kernel
buffer.
We successfully exploited this uncontrolled out-of-bounds write,
and
obtained full root privileges on default installations of Ubuntu
20.04,
Ubuntu 20.10, Ubuntu 21.04, Debian 11, and Fedora 34 Workstation;
other
Linux distributions are certainly vulnerable, and probably
exploitable.
Our exploit requires approximately 5GB of memory and 1M inodes; we
will
publish it in the near future. A basic proof of concept (a crasher)
is
attached to this advisory and is available at:
https://www.qualys.com/research/security-advisories/
To the best of our knowledge, this vulnerability was introduced
in July
2014 (Linux 3.16) by commit 058504ed ("fs/seq_file: fallback to
vmalloc
allocation").
========================================================================
Analysis
========================================================================
The Linux kernel's seq_file interface produces virtual files
that
contain sequences of records (for example, many files in /proc
are
seq_files, and records are usually lines). Each record must fit
into a
seq_file buffer, which is therefore enlarged as needed, by doubling
its
size at line 242 (seq_buf_alloc() is a simple wrapper around
kvmalloc()):
------------------------------------------------------------------------
168 ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter
*iter)
169 {
170 struct seq_file *m = iocb->ki_filp->private_data;
...
205 /* grab buffer if we didn't have one */
206 if (!m->buf) {
207 m->buf = seq_buf_alloc(m->size = PAGE_SIZE);
...
210 }
...
220 // get a non-empty record in the buffer
...
223 while (1) {
...
227 err = m->op->show(m, p);
...
236 if (!seq_has_overflowed(m)) // got it
237 goto Fill;
238 // need a bigger buffer
...
240 kvfree(m->buf);
...
242 m->buf = seq_buf_alloc(m->size <<= 1);
...
246 }
------------------------------------------------------------------------
This size multiplication is not a vulnerability in itself,
because
m->size is a size_t (an unsigned 64-bit integer, on x86_64), and
the
system would run out of memory long before this multiplication
overflows
the integer m->size.
Unfortunately, this size_t is also passed to functions whose
size
argument is an int (a signed 32-bit integer), not a size_t. For
example,
the show_mountinfo() function (which is called at line 227 to
format the
records in /proc/self/mountinfo) calls seq_dentry() (at line 150),
which
calls dentry_path() (at line 530), which calls prepend() (at line
387):
------------------------------------------------------------------------
135 static int show_mountinfo(struct seq_file *m, struct vfsmount
*mnt)
136 {
...
150 seq_dentry(m, mnt->mnt_root, " \t\n\\");
------------------------------------------------------------------------
523 int seq_dentry(struct seq_file *m, struct dentry *dentry, const
char *esc)
524 {
525 char *buf;
526 size_t size = seq_get_buf(m, &buf);
...
529 if (size) {
530 char *p = dentry_path(dentry, buf, size);
------------------------------------------------------------------------
380 char *dentry_path(struct dentry *dentry, char *buf, int
buflen)
381 {
382 char *p = NULL;
...
385 if (d_unlinked(dentry)) {
386 p = buf + buflen;
387 if (prepend(&p, &buflen, "//deleted", 10) != 0)
------------------------------------------------------------------------
11 static int prepend(char **buffer, int *buflen, const char *str,
int namelen)
12 {
13 *buflen -= namelen;
14 if (*buflen < 0)
15 return -ENAMETOOLONG;
16 *buffer -= namelen;
17 memcpy(*buffer, str, namelen);
------------------------------------------------------------------------
As a result, if an unprivileged local attacker creates, mounts,
and
deletes a deep directory structure whose total path length exceeds
1GB,
and if the attacker open()s and read()s /proc/self/mountinfo,
then:
- in seq_read_iter(), a 2GB buffer is vmalloc()ated (line 242),
and
show_mountinfo() is called (line 227);
- in show_mountinfo(), seq_dentry() is called with the empty 2GB
buffer
(line 150);
- in seq_dentry(), dentry_path() is called with a 2GB size (line 530);
- in dentry_path(), the int buflen is therefore negative
(INT_MIN,
-2GB), p points to an offset of -2GB below the vmalloc()ated
buffer
(line 386), and prepend() is called (line 387);
- in prepend(), *buflen is decreased by 10 bytes and becomes a
large but
positive int (line 13), *buffer is decreased by 10 bytes and points
to
an offset of -2GB-10B below the vmalloc()ated buffer (line 16),
and
the 10-byte string "//deleted" is written out of bounds (line
17).
========================================================================
Exploitation overview
========================================================================
1/ We mkdir() a deep directory structure (roughly 1M nested
directories)
whose total path length exceeds 1GB, we bind-mount it in an
unprivileged
user namespace, and rmdir() it.
2/ We create a thread that vmalloc()ates a small eBPF program
(via
BPF_PROG_LOAD), and we block this thread (via userfaultfd or FUSE)
after
our eBPF program has been validated by the kernel eBPF verifier
but
before it is JIT-compiled by the kernel.
3/ We open() /proc/self/mountinfo in our unprivileged user
namespace,
and start read()ing the long path of our bind-mounted directory,
thereby
writing the string "//deleted" to an offset of exactly -2GB-10B
below
the beginning of a vmalloc()ated buffer.
4/ We arrange for this "//deleted" string to overwrite an
instruction of
our validated eBPF program (and therefore nullify the security
checks of
the kernel eBPF verifier), and transform this uncontrolled
out-of-bounds
write into an information disclosure, and into a limited but
controlled
out-of-bounds write.
5/ We transform this limited out-of-bounds write into an
arbitrary read
and write of kernel memory, by reusing Manfred Paul's beautiful btf
and
map_push_elem techniques from:
https://www.thezdi.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification
6/ We use this arbitrary read to locate the modprobe_path[]
buffer in
kernel memory, and use the arbitrary write to replace the contents
of
this buffer ("/sbin/modprobe" by default) with a path to our
own
executable, thus obtaining full root privileges.
========================================================================
Exploitation details
========================================================================
a/ We create a directory whose total path length exceeds 1GB: in
theory,
we need to create over 1GB/256B=4M nested directories (NAME_MAX is
255);
in practice, show_mountinfo() replaces each '\\' character in our
long
directory with the 4-byte string "\\134", and we therefore need
to
create only 1M nested directories.
b/ We fill all large vmalloc holes: we bind-mount (MS_BIND)
various
parts of our long directory in several unprivileged user namespaces
and
vmalloc()ate large seq_file buffers by read()ing
/proc/self/mountinfo.
For example, we vmalloc()ate 768MB of large buffers in our
exploit.
c/ We vmalloc()ate two 1GB buffers and one 2GB buffer (by
bind-mounting
our long directory in three different user namespaces, and by
read()ing
/proc/self/mountinfo), and we check that "//deleted" is indeed
written
to an offset of -2GB-10B below the beginning of our 2GB buffer
(i.e.,
8182B above the beginning of our first 1GB buffer -- the "XXX"s
are
guard pages):
"//deleted"
|
4KB v 1GB 4KB 1GB 4KB 2GB
-----|---|---+-------------|---|-----------------|---|-----------------|
... |XXX| seq_file buffer |XXX| seq_file buffer |XXX| seq_file
buffer |
-----|---|---+-------------|---|-----------------|---|-----------------|
| | |
|
\----<----<----<----<----<----<----<----/
8182B -2GB-10B
d/ We fill all small vmalloc holes: we vmalloc()ate various
small socket
buffers by send()ing numerous NETLINK_USERSOCK messages. For
example, we
vmalloc()ate 256MB of small buffers in our exploit.
e/ We create 1024 user-space threads; each thread starts loading
an eBPF
program into the kernel, but (via userfaultfd or FUSE) we block
every
thread in kernel space (at line 2101), before our eBPF programs
are
actually vmalloc()ated (at line 2162):
------------------------------------------------------------------------
2076 static int bpf_prog_load(union bpf_attr *attr, union bpf_attr
__user *uattr)
2077 {
....
2100 /* copy eBPF program license from user space */
2101 if (strncpy_from_user(license,
u64_to_user_ptr(attr->license),
....
2161 /* plain bpf_prog allocation */
2162 prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt),
GFP_USER);
------------------------------------------------------------------------
f/ We vfree() our first 1GB seq_file buffer (where "//deleted"
was
written out of bounds), and we immediately unblock all 1024
threads; our
eBPF programs are vmalloc()ated into the 1GB hole that we just
vfree()d:
4KB 1GB 4KB 1GB 4KB 2GB
-----|---|-----------------|---|-----------------|---|-----------------|
... |XXX| eBPF programs |XXX| seq_file buffer |XXX| seq_file buffer
|
-----|---|-----------------|---|-----------------|---|-----------------|
g/ Next, (again via userfaultfd or FUSE) we block one of our
threads (at
line 12795) after its eBPF program has been validated by the kernel
eBPF
verifier but before it is JIT-compiled by the kernel:
------------------------------------------------------------------------
12640 int bpf_check(struct bpf_prog **prog, union bpf_attr
*attr,
12641 union bpf_attr __user *uattr)
12642 {
.....
12795 print_verification_stats(env);
------------------------------------------------------------------------
h/ Last, we overwrite an instruction of this eBPF program with
an
out-of-bounds "//deleted" string (again via our 2GB seq_file
buffer),
and therefore nullify the security checks of the kernel eBPF
verifier:
"//deleted"
|
4KB v 1GB 4KB 1GB 4KB 2GB
-----|---|---+-------------|---|-----------------|---|-----------------|
... |XXX| eBPF programs |XXX| seq_file buffer |XXX| seq_file buffer
|
-----|---|---+-------------|---|-----------------|---|-----------------|
| | |
|
\----<----<----<----<----<----<----<----/
8182B -2GB-10B
First, we transform this uncontrolled eBPF-program corruption
into an
information disclosure. Our first, uncorrupted eBPF program is
deemed
safe by the kernel eBPF verifier ("storage" and "control" are two
basic
BPF_MAP_TYPE_ARRAYs, readable and writable from user space via
BPF_MAP_LOOKUP_ELEM and BPF_MAP_UPDATE_ELEM):
- BPF_LD_IMM64_RAW(BPF_REG_2, BPF_PSEUDO_MAP_VALUE, storage)
loads the
address of our storage map (which resides in kernel space and
whose
address is unknown to us) into the eBPF register BPF_REG_2;
- BPF_MOV64_IMM(BPF_REG_2, 0) immediately replaces the contents
of
BPF_REG_2 (the address of our storage map) with the constant value
0;
- BPF_LD_IMM64_RAW(BPF_REG_3, BPF_PSEUDO_MAP_VALUE, control)
loads the
address of our control map into BPF_REG_3;
- BPF_STX_MEM(BPF_DW, BPF_REG_3, BPF_REG_2, 0) stores the
contents of
BPF_REG_2 (the constant value 0) into our control map.
However, our eBPF-program corruption overwrites the
instruction
BPF_MOV64_IMM(BPF_REG_2, 0) with the 8-byte string "deleted",
which
translates into the instruction BPF_ALU32_IMM(BPF_LSH, BPF_REG_5,
0x74):
a NOP ("no operation"), because our program does not use BPF_REG_5.
As a
result, we do not store the constant value 0 into our control
map:
instead, we store and disclose the address of our storage map.
(This information disclosure allowed us to greatly reduce the
number of
hardcoded kernel offsets in our exploit: our Ubuntu 20.04 exploit
worked
out of the box on Ubuntu 20.10, Ubuntu 21.04, Debian 11, and Fedora
34.)
Second, we transform our uncontrolled eBPF-program corruption
into a
limited but controlled out-of-bounds write. Our second, uncorrupted
eBPF
program is also deemed safe by the kernel eBPF verifier ("corrupt"
is a
3*64KB BPF_MAP_TYPE_ARRAY):
- BPF_LD_IMM64_RAW(BPF_REG_4, BPF_PSEUDO_MAP_VALUE, corrupt)
loads the
address of our corrupt map into BPF_REG_4;
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 3*64KB/2) points BPF_REG_4
to the
middle of our corrupt map;
- BPF_ALU64_IMM(BPF_SUB, BPF_REG_4, 3*64KB/4) points BPF_REG_4
to the
first quarter of our corrupt map;
- BPF_LD_IMM64_RAW(BPF_REG_3, BPF_PSEUDO_MAP_VALUE, control)
loads the
address of our control map into BPF_REG_3;
- BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_3, 0) loads a variable
16-bit
offset from our control map into BPF_REG_7;
- BPF_ALU64_REG(BPF_ADD, BPF_REG_4, BPF_REG_7) adds BPF_REG_7
(our
variable 16-bit offset) to BPF_REG_4, which therefore points
safely
within the bounds of our corrupt map (because BPF_REG_7 is in
the
[0,64KB] range).
However, our eBPF-program corruption overwrites the
instruction
BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 3*64KB/2) with the string
"deleted",
which translates into BPF_ALU32_IMM(BPF_LSH, BPF_REG_5, 0x74) (a
NOP).
As a result, the following BPF_ALU64_IMM(BPF_SUB, BPF_REG_4,
3*64KB/4)
points BPF_REG_4 out of bounds and allows us to read from and write
to
the struct bpf_map that precedes our corrupt map in kernel
space.
Finally, we transform this limited out-of-bounds read and write
into an
arbitrary read and write of kernel memory, by reusing Manfred
Paul's btf
and map_push_elem techniques:
- With the arbitrary kernel read we locate the symbol
"__request_module"
and hence the function __request_module(), disassemble this
function,
and extract the address of modprobe_path[] from the instructions
for
"if (!modprobe_path[0])".
- With the arbitrary kernel write we overwrite the contents
of
modprobe_path[] ("/sbin/modprobe" by default) with a path to our
own
executable, and call request_module() (by creating a netlink
socket),
which executes modprobe_path, and hence our own executable, as
root.
========================================================================
Mitigations
========================================================================
Important note: the following mitigations prevent only our
specific
exploit from working (but other exploitation techniques may exist);
to
completely fix this vulnerability, the kernel must be patched.
- Set /proc/sys/kernel/unprivileged_userns_clone to 0, to
prevent an
attacker from mounting a long directory in a user namespace.
However,
the attacker may mount a long directory via FUSE instead; we have
not
fully explored this possibility, because we accidentally stumbled
upon
CVE-2021-33910 in systemd: if an attacker FUSE-mounts a long
directory
(longer than 8MB), then systemd exhausts its stack, crashes,
and
therefore crashes the entire operating system (a kernel panic).
- Set /proc/sys/kernel/unprivileged_bpf_disabled to 1, to
prevent an
attacker from loading an eBPF program into the kernel. However,
the
attacker may corrupt other vmalloc()ated objects instead (for
example,
thread stacks), but we have not investigated this possibility.
========================================================================
Acknowledgments
========================================================================
We thank the PaX Team for answering our many questions about the
Linux
kernel. We also thank Manfred Paul, Jann Horn, Brandon Azad,
Simon
Scannell, and Bruce Leidl for their exploits and write-ups:
https://www.thezdi.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
https://googleprojectzero.blogspot.com/2020/12/an-ios-hacker-tries-android.html
https://scannell.io/posts/ebpf-fuzzing/
https://github.com/brl/grlh
We thank Red Hat Product Security and the members of
linux-distros@openwall and security@kernel for their work on
this
coordinated disclosure. We also thank Mitre's CVE Assignment
Team.
Finally, we thank Marco Ivaldi for his continued support.
========================================================================
Timeline
========================================================================
2021-06-09: We sent our advisories for CVE-2021-33909 and
CVE-2021-33910
to Red Hat Product Security (the two vulnerabilities are closely
related
and the systemd-security mailing list is hosted by Red Hat).
2021-07-06: We sent our advisories, and Red Hat sent the patches
they
wrote, to the linux-distros@openwall mailing list.
2021-07-13: We sent our advisory for CVE-2021-33909, and Red Hat
sent
the patch they wrote, to the security@kernel mailing list.
2021-07-20: Coordinated Release Date (12:00 PM UTC).