diff options
author | cheloha <cheloha@cvs.openbsd.org> | 2020-08-23 21:38:48 +0000 |
---|---|---|
committer | cheloha <cheloha@cvs.openbsd.org> | 2020-08-23 21:38:48 +0000 |
commit | b2e0fb73546bcfaf752b03fb857d98fbbfc1a6c7 (patch) | |
tree | 5469db21bd2b0ccf79032c4a845aa7d7215f7bd7 /lib/libc/arch | |
parent | 6446d40c4e87255ce69ace9d2d4d6505d7e71479 (diff) |
amd64: TSC timecounter: prefix RDTSC with LFENCE
Regarding RDTSC, the Intel ISA reference says (Vol 2B. 4-545):
> The RDTSC instruction is not a serializing instruction.
>
> It does not necessarily wait until all previous instructions
> have been executed before reading the counter.
>
> Similarly, subsequent instructions may begin execution before the
> read operation is performed.
>
> If software requires RDTSC to be executed only after all previous
> instructions have completed locally, it can either use RDTSCP (if
> the processor supports that instruction) or execute the sequence
> LFENCE;RDTSC.
To mitigate this problem, Linux and DragonFly use LFENCE. FreeBSD and
NetBSD take a more complex route: they selectively use MFENCE, LFENCE,
or CPUID depending on whether the CPU is AMD, Intel, VIA or something
else.
Let's start with just LFENCE. We only use the TSC as a timecounter on
SSE2 systems so there is no need to conditionally compile the LFENCE.
We can explore conditionally using MFENCE later.
Microbenchmarking on my machine (Core i7-8650) suggests a penalty of
about 7-10% over a "naked" RDTSC. This is acceptable. It's a bit of
a moot point though: the alternative is a considerably weaker
monotonicity guarantee when comparing timestamps between threads,
which is not acceptable.
It's worth noting that kernel timecounting is not *exactly* like
userspace timecounting. However, they are similar enough that we can
use userspace benchmarks to make conjectures about possible impacts on
kernel performance.
Concerns about kernel performance, in particular the network stack,
were the blocking issue for this patch. Regarding networking
performance, claudio@ says a 10% slower nanotime(9) or nanouptime(9)
is acceptable and that shaving off "tens of cycles" is a
micro-optimization. There are bigger optimizations to chase down
before such a difference would matter.
There is additional work to be done here. We could experiment with
conditionally using MFENCE. Also, the userspace TSC timecounter
doesn't have access to the adjustment skews available to the kernel
timecounter. pirofti@ has suggested a scheme involving RDTSCP and an
array of skews mapped into user memory. deraadt@ has suggested a
scheme where the skew would be kept in the TCB. However it is done,
access to the skews will improve monotonicity, which remains a problem
with the TSC.
First proposed by kettenis@ and pirofti@. With input from pirofti@,
deraadt@, guenther@, naddy@, kettenis@, and claudio@. Based on
similar changes in Linux, FreeBSD, NetBSD, and DragonFlyBSD.
ok deraadt@ pirofti@ kettenis@ naddy@ claudio@
Diffstat (limited to 'lib/libc/arch')
-rw-r--r-- | lib/libc/arch/amd64/gen/usertc.c | 8 |
1 files changed, 4 insertions, 4 deletions
diff --git a/lib/libc/arch/amd64/gen/usertc.c b/lib/libc/arch/amd64/gen/usertc.c index 7529af598da..6c37a8c5cc1 100644 --- a/lib/libc/arch/amd64/gen/usertc.c +++ b/lib/libc/arch/amd64/gen/usertc.c @@ -1,4 +1,4 @@ -/* $OpenBSD: usertc.c,v 1.2 2020/07/08 09:17:48 kettenis Exp $ */ +/* $OpenBSD: usertc.c,v 1.3 2020/08/23 21:38:47 cheloha Exp $ */ /* * Copyright (c) 2020 Paul Irofti <paul@irofti.net> * @@ -19,10 +19,10 @@ #include <sys/timetc.h> static inline u_int -rdtsc(void) +rdtsc_lfence(void) { uint32_t hi, lo; - asm volatile("rdtsc" : "=a"(lo), "=d"(hi)); + asm volatile("lfence; rdtsc" : "=a"(lo), "=d"(hi)); return ((uint64_t)lo)|(((uint64_t)hi)<<32); } @@ -31,7 +31,7 @@ tc_get_timecount(struct timekeep *tk, u_int *tc) { switch (tk->tk_user) { case TC_TSC: - *tc = rdtsc(); + *tc = rdtsc_lfence(); return 0; } |