Age | Commit message (Collapse) | Author |
|
The object sent to vmm(4) contained file paths and details the
kernel does not need for cpu virtualization as device emulation is
in userland. Effectively, "pull up" the struct members from the
vm_create_params struct to the parent vmop_create_params struct.
This allows us to clean up some of vmd(8) and simplify things for
switching to having vmctl(8) open the "kernel" file (SeaBIOS, bsd.rd,
etc.) to allow users to boot recovery ramdisk kernels.
ok mlarkin@
|
|
features: while all are appropriate for xsaves/xrstors, the
supervisor-state features aren't for xcr0 but rather for the new XSS_MSR,
making the current names kinda confusing.
Add #defines for masking bits for xcr0 vs XSS.
Add and report the new XSAVE_XFD xsave subfeature bit.
ok mlarkin@
|
|
The IDTVEC() and KIDTVEC() macros also get a endbr64, and therefore we need
to change the way that vectors are aliased with a new IDTVEC_ALIAS() macro.
with guenther, jsg
|
|
ok deraadt@
|
|
"these are fine," mlarkin@
|
|
ever ran on, and it's unlikely to ever be implemented, so remove it.
ok jsg@
|
|
|
|
requested by and ok deraadt@
|
|
except alpha. This will put the stack at a random location in the upper
1/4th of the userland virtual address space providing up to 26 additional
bits of randomness in the address. Skip alpha for now since it currently
puts the stack at a (for a 64-bit architecture) very low address. Skip
32-bit architectures for now as well since those have a much smaller
virtual address space and we need more time to figure out what a safe
amount of extra randomizations is. These architectures will continue to
use a mildly randomized stack address through the existing stackgap random
mechanism. We will revisit this after 7.3 is released.
This should make it harder for an attacker to find the stack.
ok deraadt@, miod@
|
|
against classic BROP with a range-checking wrapper in front of copyin() and
copyinstr() which ensures the userland source doesn't overlap the main program
text, ld.so text, signal tramp text (it's mapping is hard to distinguish
so it comes along for the ride), or libc.so text. ld.so tells the kernel
libc.so text range with msyscall(2). The range checking for 2-4 elements is
done without locking (because all 4 ranges are immutable!) and is inexpensive.
write(sock, &open, 400) now fails with EFAULT. No programs have been
discovered which require reading their own text segments with a system call.
On a machine without mmu enforcement, a test program reports the following:
userland kernel
ld.so readable unreadable
mmap xz unreadable unreadable
mmap x readable readable
mmap nrx readable readable
mmap nwx readable readable
mmap xnwx readable readable
main readable unreadable
libc unmapped? readable unreadable
libc mapped readable unreadable
ok kettenis, additional help from miod
|
|
Take a simple approach for saving and restoring PKRU if the host
has PKE support enabled. Uses explicit rdpkru/wrpkru instructions
for now instead of xsave.
This functionality is still gated behind amd64 pmap checking for
operation under a hypervisor as well as vmm masking the cpuid bit
for PKU.
"if your diff is good, then commit it" -deraadt@
|
|
Part of an ongoing effort to move userland-specific information out
of a kernel header and directly into vmd(8). No functional change.
ok mlarkin@
|
|
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc
|
|
support. The current implementation doesn't handle the transition from
RWX to RW correctly. Also generalize the pmap_write_protect() function
in recognition of the fact that execute permission, write permission,
and in the future read permission on executable pages, are handled by
separate bits.
ok deraadt@, mpi@
|
|
We don't emulate or support most of the EAX=7,ECX=0 feature bits,
so restrict the mask further to just UMIP.
ok deraadt@
|
|
matches tom@'s i386 rev 1.47 change
|
|
This function is only ever called with PROT_NONE or PROT_READ where
PROT_NONE removes the mapping from the page tables and PROT_READ takes
away write permission. Add a KASSERT to make sure no other values are
passed. This KASSERT should be optimized away by any decent compiler.
ok deraadt@, mpi@, guenther@
|
|
for execute-only, and the PKU value used by userland to use that key.
|
|
that is compatible with what FreeBSD and NetBSD have. Setting EFI
variables is only allowed at securelevel 0 and below.
Heavily based on work done by Sergii Dmytruk.
ok yasuoka@
|
|
ok deraadt@
|
|
ok deraadt@
|
|
ok deraadt@
|
|
Alder Lake and similar-era Intel platforms introduced new userland
wait instructions. Since vmm was passing this cpuid bit into guests,
some would attempt TPAUSE instructions and trigger invalid instruction
exceptions because VMX requires additional configuration to support
emulation.
This also adds WAITPKG to i386 and amd64 cpu feature identification.
Input from anton@, cheloha@, and guenther@. Tested by jmatthew@.
OK deraadt.
|
|
and pass it to the kernel.
ok jca@, patrick@
|
|
When booting guests with SeaBIOS, vmd(8) supplied details about the
available guest memory via CMOS registers. Consequently, we've been
carrying some patches in the ports tree to SeaBIOS to fetch this
information like it's the 1990s.
When a vm initializes memory ranges, we now track what each range
represents. This information can be used to supply the e820 memory
map to SeaBIOS via the fw_cfg interface allowing it to properly
communicate memory ranges to a guest operating system. (This will
also allow us to drop some patches from the port.)
Given the ranges can now be marked with a purpose, this also allows
vmm(4) to switch from hard-coded mmio ranges and instead let the
information on the memory range dictate if vmm should be handling
a page fault or sending to vmd for a memory assist.
Tested by Mischa Peters and others. OK mlarkin@.
|
|
Start eliminating it.
ok mpi@ mlarkin@ krw@
|
|
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.
ok mpi@ krw@ mlarkin@
|
|
The initial mmio support for vmd adds support for only specific MOV
and MOVZX instructions. Plan is to begin iterating in-tree on other
missing pieces. All functionality is gated behind an #if for now.
Only change to vmm(4) is reordering register #define's in vmmvar.h.
ok mlarkin@
|
|
Since vmm doesn't support hot-plug vcpus we can reduce complexity
by treating the vcpu list per vm as immutable after creation.
As a consequence, we can use the vm reference count to protect the
lifetime of the vcpus, removing the need for reference counting
individual vcpu objects. With an immutable list, we no longer need
a rwlock protecting it either.
Original diff from dlg@ that I reworked and tested.
ok dlg@, mlarkin@
|
|
this records which physical cpu a vcpu is running on. this is used
by the code that marks a vcpu as having a pending interrupt to check
if the vcpu is currently running. if it thinks the vcpu is running,
it sends a nop IPI to the physical cpu it is running on to trigger
a vmexit, which in turn runs interrupt handling in the guest.
ok mlarkin@
|
|
Switch amd64 to the clockintr(9) subsystem. There are lots of little
changes, but the bigs ones are listed here.
When using the local apic timer:
- Run the timer in one-shot mode.
- lapic_delay() is gone. We can't use it to delay(9) when running
the timer in one-shot mode.
- Add a randomized statclock(); stathz = hz.
- Add support for switching to profhz when profiling is enabled;
profhz = stathz * 10.
When using the i8254/mc146818:
- i8254's clockintr() no longer has a monopoly on hardclock().
- mc146818's rtcintr() no longer has a monopoly on statclock().
- In profiling mode, the statclock() will drift very slightly
because (profhz = 1024) does not divide evenly into one billion.
We could avoid this by setting (profhz = 512) instead and
programming the RTC to run at that rate.
Early revisions reviewed by mlarkin@. Extensively tested by mlarkin@
on a variety of physical and virtual hardware. Additional testing
from dv@ and jmc@.
Link: https://marc.info/?l=openbsd-tech&m=166776339203279&w=2
ok kettenis@ mlarkin@
|
|
Not all of the clocks with a delay(9) implementation necessarily keep
ticking across suspend/resume. We need a clean way to reverse
delay_init() during suspend when those clocks stop ticking.
Hence, delay_fini(). delay_fini() resets delay_func() to
i8254_delay() if the given function pointer is the active delay(9)
implementation.
ok mlarkin@
|
|
Compute the TSC frequency on AMD family 17h and 19h CPUs using the
PStateDef MSRs.
Link 1: https://marc.info/?l=openbsd-tech&m=166394236029484&w=2
Link 2: https://marc.info/?l=openbsd-tech&m=166446065916283&w=2
Test list: https://marc.info/?l=openbsd-tech&m=166646389821326&w=2
Reviewed by kettenis@ using the AMD documents cited in the comments.
Maybe reviewed by mlarkin@? I can't remember. He seemed supportive
of the idea at least.
ok kettenis@
|
|
in the future to implement support for things like EFI variables.
ok krw@ (a few others ok'ed earlier incarnations of this diff)
|
|
tweaks from cheloha@; ok deraadt@, sthen@, cheloha@
|
|
to a separate function that gets called after identifycpu() so that
we have the required information to handle the correct MSRs for each
cpu.
Additionally, move the handling of the DE_CFG_SERIALIZE_LFENCE and
IA32_DEBUG_INTERFACE_LOCK MSRs out of identifycpu() to the new
function so that they get set again after a suspend/resume cycle as
well, which in fixes TSC sync failures.
discussed with and input from deraadt@, mlarkin@
|
|
Simplify things by sending any io exits from IN/OUT instructions
to userland instead of trying to emulate anything in the kernel.
vmm was sending most pertinent exits to vmd anyways, so this
functionally changes little.
An added benefit is this solves an issue reported by tb@ where i386
OpenBSD guests would probe for a pc keyboard repeatedly and cause
excessive vm exits. (The emulation in vmm was not properly handling
these port reads.)
While here, make the assignment of the VEI_DIR_{IN,OUT} enum values
not assume the underlying integer the compiler may assign.
ok mlarkin@
|
|
Provide the basic information required for a userland assist in
emulating instructions touching mmio regions, sending as much
information as is provided by the host hardware.
No decode or assist provided at the moment by vmd(8).
ok mlarkin@
|
|
|
|
c99 6.11.5:
"The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature."
ok guenther@
|
|
ok miod@ guenther@
|
|
Because the clock situation on x86 and amd64 is a terminal
clusterfuck, there are many different ways to delay(9). We need a
rudimentary mechanism for gracefully switching to progressively better
delay(9) implementations as they become available during boot without
riddling the code with ifdefs and function pointer comparisons.
This patch adds delay_init() to both amd64 and i386. If the quality
value passed to delay_init() exceeds the quality value of the current
delay_func, delay_init() changes delay_func to the given function
pointer and updates the quality value. Both platforms start with
delay_func set to i8254_delay() and a quality value of zero: all other
delay(9) implementations are preferable.
Idea and patch provided by jsg@. With tons of input, research, and
advice from jsg@.
Link: https://marc.info/?l=openbsd-tech&m=166053729104923&w=2
ok mlarkin@ jsg@
|
|
ok daniel@
|
|
Cyrix CPUs don't support amd64. These defines were probably carried
over from i386 accidentally when the amd64 code was first imported.
ok mlarkin@, jsg@
|
|
Computing a per-CPU TSC skew value is error-prone, especially on
multisocket machines and VMs. My best guess is that larger latencies
appear to the current skew measurement test as TSC desync, and so the
TSC is demoted to a kernel timecounter on these machines or marked
non-monotonic.
This patch eliminates per-CPU TSC skew values. Instead of trying to
measure and correct for TSC desync we only try to detect desync, which
is less error-prone. This approach should allow a wider variety of
machines to use the TSC as a timecounter when running OpenBSD.
In the new sync test, both CPUs repeatedly try to detect whether their
TSC is trailing the other CPU's TSC. The upside to this approach is
that it yields no false positives. The downside to this approach is
that it takes more time than the current skew measurement test. Each
test round takes 1ms, and we run up to two rounds per CPU, so this
patch slows boot down by 2ms per AP.
If any CPU fails the sync test, the TSC is marked non-monotonic and a
different timecounter is activated. The TC_USER flag remains intact.
There is no middle ground where we fall back to only using the TSC in
the kernel.
Before running the test, we check for the IA32_TSC_ADJUST register and
reset it if necessary. This is a trivial way to work around firmware
bugs that desync the TSC before we reach the kernel. Unfortunately,
at the moment this register appears to only be available on Intel
processors. I cannot find an equivalent but differently-named MSR for
AMD processors.
Because there is no per-CPU skew value, there is also no concept of
TSC drift anymore.
Miscellaneous notes:
- This patch adds a new timecounter utility function, tc_reset_quality().
Used after sync test failure to mark the TSC non-monotonic.
- I have left TSC_DEBUG enabled for now. Unsure if we should leave it
enabled for release or not. If we disable it we no longer run the
sync test after failing it once. Running the test even after failure
provides information about the desync on every CPU.
- Taking 1ms per test round is fairly conservative. We can experiment
with and discuss shorter test rounds. My main goal with a relatively
long test round is ensuring VMs actually run the test. It would be
bad if a hypervisor interrupted the test for so long that it concealed
desync.
- The use of two test rounds is mostly a diagnostic tool: it would be
very strange if a CPU passed the first round but failed the second.
If we ever saw this in the wild it would indicate something odd.
- Most of the desync seen in test reports is on Ryzen CPUs. I
believe, but cannot prove, that this is due to a widespread
firmware bug on AMD motherboards. Hopefully AMD and/or the
downstream vendors fix it.
- Fixing TSC desync by writing the TSC directly with WRMSR is very
difficult. The TSC is a moving target incrementing very quickly and
compensating for WRMSR overhead is non-trivial. We can experiment
with this, but my confidence is low that we can make it work reliably.
Prompted by deraadt@ and kettenis@ in 2021. Shepherded along by
deraadt@ throughout. Reprompted by Yuichiro Naito several times.
With input from Yuichiro Naito, naddy@, sthen@, dv@, and deraadt@.
Tested by florian@, gnezdo@, sthen@, Josh Rickmar, dv@, Mohamed Aslan,
Hrvoje Popovski, Yuichiro Naito, semarie@, mlarkin@, asou@, jmatthew@,
Renato Aguiar, and Timo Myyra.
Patch v1: https://marc.info/?l=openbsd-tech&m=164330092208035&w=2
Patch v2: https://marc.info/?l=openbsd-tech&m=164558519712957&w=2
Patch v3: https://marc.info/?l=openbsd-tech&m=165698681018991&w=2
Patch v4: https://marc.info/?l=openbsd-tech&m=165835507113680&w=2
Patch v5: https://marc.info/?l=openbsd-tech&m=165923705118770&w=2
"just commit it" deraadt@
|
|
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1
ok jsg@ mlarkin@
|
|
suggested by and ok mlarkin@
|
|
Previously for __cpu_simple_lock parts. Now only hppa and m88k use
__cpu_simple_lock (and hppa uses atomic.h for it).
ok miod@ visa@
|
|
Unlocking most of vmm last year at k2k21 exposed bugs related to
lifetime management of vm and vcpu objects.
Add reference counts to make sure we don't attempt to teardown vcpu
or vm related objects while a thread is holding a reference. This
also reduces abuse of rwlocks originally intended to protect the
linked lists cleaning things up quite a bit. While here, also
document assumptions on how struct members are protected for the
next brave soul wander in.
ok mlarkin@
|
|
UART found on AMD's Ryzen Embedded V1000 family) as an early console.
This requires additional parameters to be passed by the bootloader to the
kernel so it changes the struct for the BOOTARG_CONSDEV boot argument.
The old struct will still be supported until OpenBSD 7.3 has been released
such that new kernels boot with the old bootloader.
ok anton@, deraadt@
|