Age | Commit message (Collapse) | Author |
|
If any other CPU has not finished wbinvd, PSP command may fail. To
avoid races, call wbinvd_on_all_cpus_acked() which waits for
acknowledgement from IPI handler. Provide stub to build non-MP
kernels.
from hshoexer@; OK mlarkin@
|
|
|
|
|
|
Implement wbinvd_on_all_cpus_acked() similar to pmap_tlb_shootpage().
This ensures, wbinvd has been executed on all cores when the function
returns. This is needed to avoid psp(4) races.
from hshoexer@; OK mlarkin@
|
|
ok claudio@
|
|
|
|
|
|
vmm(4) doesn't use the VMX VMFUNC instruction.
ok mlarkin@
|
|
vmm(4) doesn't need this information anymore. vmd(8) is the only
consumer of this information.
ok mlarkin@
|
|
Similar to how the fast ipi for tlb flush is implemented, this adds
one for calling INVEPT to invalidate EPT caches on the cpu. This
is the first step to allowing guest memory to not be wired by UVM
and decreases the behavioral differences between Intel and AMD's
nested paging in vmm(4) and pmap(9).
This change does not hook EPT ptes into the PV list, so the ipi is
only used during address space teardown and pte removal. (With the
removal of the "mprotect" ioctl, vmm(4) no longer modifies EPT ptes
other than inserting them and removing them.)
ok mlarkin@
|
|
This old ioctl isn't used by vmd(8) and is getting in the way of some
improvements we want to do. It was used by solo5 but the person who was
helping maintain this is no longer involved with that project.
ok dv
|
|
'fine with me' hshoexer, ok bluhm@
|
|
Limit ccp ioctls to processes that pledge vmm. Specific psp device
ioctls for AMD SEV will allowed for vmd(8).
from hshoexer@; input deraadt@ jsg@
|
|
Bring the pieces for vmm(4) to support guests with SEV memory
encryption on AMD CPUs. The corresponding vmd(8) changes will
follow.
Emulate cpuid 0x8000001f so the guest can discover SEV features.
Allow vmd(8) to enable SEV on VM creation. Inform vmd(8) about the
c-bit position and ASID assigned to each VCPU.
Note that vmd(8) has to be rebuilt with the new header files.
from hshoexer@; input dv@; OK mlarkin@
|
|
When running as SEV guest, as indicated by variable cpu_sev_guestmode,
allocate additional pages for each segment on dma map creation.
These pages are mapped with the PMAP_NOCRYPT attribute, i.e. the
crypt bit is not set in the PTE. Thus, these pages are shared with
the hypervisor.
When the map is loaded with actual pages, the address in the
descriptor is replaced by the corresponding bounce buffer. Using
bus_dmamap_sync(), data is copied from the encrypted pages used by
guest drivers to the unencrypted bounce buffers shared with the
hypervisor, and vice versa.
If the kernel is not running in SEV guest mode, which means as
normal host or non-SEV guest, no bounce buffers are used.
from hshoexer@; based on ancient code of mickey@; OK kettenis@
|
|
various Intel SoCs. The driver takes care of calling the AML methods
needed to enter low power idle states during suspend-to-idle (S0i).
The driver also implements some debug code that prints the residency of
various power states in dmesg. Based on some earlier code by jcs@
ok jcs@
|
|
Actually determine the C-bit position if we are running as a guest
with SEV enabled. Configure pg_crypt, pg_frame and pg_lgframe
accordingly, using the physical address bit reduction provided by
cpuid.
from hshoexer@; OK mlarkin@
|
|
Designed to let userland peek at AT_HWCAP and AT_HWCAP2 using an already
existing interface coming from FreeBSD. Headers bits were snatched from
there. Input & ok kettenis@
libc bump and sets sync will follow soon
|
|
Since vmm handles nested page faults in the vcpu run loop, trying
to avoid trips back to userland, it's possible for the thread to
move host cpus. vmm(4) already updates some local cpu state when
this happens, but also needs to update the host cr3 in the vmcs to
allow vmx to restore the proper cr3 value on the next vm exit.
Additionally, we should be flushing the ept cache on the new cpu.
If the single context flush is available, use that instead of the
global flush.
ok mlarkin@
|
|
Makes as much of the core of vmd mi, pushing x86-isms into separate
compilation units. Adds build logic for arm64, but no emulation
yet. (You can build vmd, but it won't have a vmm device to connect
to.)
Some more cleanup probably needed around interrupt controller
abstraction, but that can come as we implement more than the i8259.
ok mlarkin@
|
|
The C-bit in a page table entry is used by a SEV guest to specify,
which pages are to be encrypted and which not. The latter is needed
to share pages with the hypervisor for virtio(4).
The actual position of the C-bit within a PTE is CPU implementation
dependend and needs to be determined dynamically at system boot.
The position of the C-bit also determines the actual size of page
frame mask. This will be provided by a separate change.
To be able to use the same kernel as both host and guest, the C-bit
is provided as variable similar to the NX-bit. Same holds for the
page frame masks.
Right now, pg_crypt is set to 0, pg_frame an pg_lgframe to PG_FRAME
and PG_LGFRAME respectively. Thus the kernel works as a host system
same as before.
Also introduce a PMAP_NOCRYPT flag. A guest will use this with
busdma to establish unencrypted mappings that can be shared with
the hypervisor.
from hshoexer@; OK mlarkin@
|
|
To prepare for mi/md splitting vmd, need to fixup the dev/vmm/vmm.h
mi header. Move the vm_run_params struct and clean up the includes
in vmd.
"sure", mlarkin@
|
|
Enable identifycpu() to discover and show AMD SEV related information
provided by cpuid.
The "crypt bit" for page table entries is stored in amd64_pos_cbit,
although it is not used yet.
Registers ecx and edx provide the number of guest and minimum ASID
for SEV-only guests. At least the latter value can be configured
in the BIOS, so it is useful to have this information in dmesg.
Therefore define emtpy bit masks for printf("%b") to get the raw
numbers.
from hshoexer@; OK mlarkin@
|
|
Having differences between architectures is asking for problems. And
adding a barrier here just makes sense in most cases. This is also what
cpu_relax() provides in Linux land.
ok kettenis@ claudio@
|
|
|
|
ok deraadt@, guenther@, mlarkin@, jsg@
|
|
on machines that don't support S3. In its current state it doesn't save
a lot of power, but this should improve over time. Implementation of
wakeup methods is incomplete which means that some machine can't resume
at the moment.
ok mglocker@, mlarkin@, stsp@, deraadt@
|
|
i386 such that we can call the necessary hooks in the suspend/resume code
without adding #ifdefs. Tweak the arm64 implementation such that we can
call the hooks earlier as this is necessary to mask MSI and MSI-X
interrupts on arm64.
ok deraadt@, mlarkin@
|
|
|
|
|
|
rev 1.17 (2017-5-27) when tlbflushg() stopped using it
|
|
isaphysmem and isaphysmempgs were removed in 1998
ok kettenis@
|
|
cpuid uses into identifycpu(), as they aren't needed anywhere else.
ok kettenis@
|
|
too. This is also much more space efficient.
Reduce the cpu flag noise in dmesg by suppressing lines and registers
that are identical with the previous CPU and show -/+ info if there
are any differences.
particular feedback from deraadt@, kettenis@, jsg@, and dv@
ok deraadt@
|
|
The compiler already translates the generic code into arithmetic
byte-swap instructions or byte-swapping memory load and store
instructions if available on an architecture.
ok deraadt@ guenther@
|
|
The caches are used primarily to reduce contention on uvm_lock_fpageq() during
concurrent page faults. For the moment only uvm_pagealloc() tries to get a
page from the current CPU's cache. So on some architectures the caches are
also used by the pmap layer.
Each cache is composed of two magazines, design is borrowed from jeff bonwick
vmem's paper and the implementation is similar to the one of pool_cache from
dlg@. However there is no depot layer and magazines are refilled directly by
the pmemrange allocator.
This version includes splvm()/splx() dances because the buffer cache flips
buffers in interrupt context. So we have to prevent recursive accesses to
per-CPU magazines.
Tested by naddy@, solene@, krw@, robert@, claudio@ and Laurence Tratt.
ok claudio@, kettenis@
|
|
There's no need to distinguish the "first" time running a vcpu from
the subsequent times because vmm(4) uses in-kernel state tracking
the last vm exit reason to optimize the logic for updating vcpu
registers from userland. While here, clean up the DPRINTF's to make
the Intel VMX logic similar to the AMD SVM.
ok mlarkin@
|
|
|
|
The caches are used primarily to reduce contention on uvm_lock_fpageq() during
concurrent page faults. For the moment only uvm_pagealloc() tries to get a
page from the current CPU's cache. So on some architectures the caches are
also used by the pmap layer.
Each cache is composed of two magazines, design is borrowed from jeff bonwick
vmem's paper and the implementation is similar to the one of pool_cache from
dlg@. However there is no depot layer and magazines are refilled directly by
the pmemrange allocator.
Tested by robert@, claudio@ and Laurence Tratt.
ok kettenis@
|
|
unused Skylake AVX-512 MDS handler and increases the ci_mds_tmp array to
64 bytes. With help from guenther@
ok deraadt@, guenther@
|
|
ok miod@ guenther@
|
|
In order to continue work on mmio and other instruction emulation,
vmd(8) needs the ability to inject exceptions (like page faults)
from userland.
Refactor the way events are injected from userland, cleaning up how
hardware (external) interrupts are injected in the process.
ok mlarkin@
|
|
level and a numeric mapping of the cpu vendor, both from CPUID(0).
Convert the general use of strcmp(cpu_vendor) to simple numeric
tests of ci_vendor. Track the minimum of all ci_cpuid_level in the
cpuid_level global and continue to use that for what we vmm exposes.
AMD testing help matthieu@ krw@
ok miod@ deraadt@ cheloha@
|
|
of later enhancements, removing the save/restore of flags, selectors,
and MSRs: flags are caller-saved and don't need restoring while
selectors and MSRs are auto-restored. The FSBASE, GSBASE, and
KERNELGSBASE MSRs just need the correct values set with vmwrite()
in the "on new CPU?" block of vcpu_run_vmx().
Also, only rdmsr(MSR_MISC_ENABLE) once in vcpu_reset_regs_vmx(),
give symbolic names to the exit-load MSR slots, eliminate
VMX_NUM_MSR_STORE, and #if 0 the vc_vmx_msr_entry_load_{va,pa} code
and definitions as unused.
ok dv@
|
|
present in Intel Atom CPUs, reordering some ASM in return-to-userspace and
start/resume-vmx-guest to reduce the number of kernel values still live in
registers when VERW is used. This mitigation requires updated firmware which
has affected CPUs report RFDS_CLEAR in dmesg.
Firmware packaging by jsg@ and sthen@
Logic for interpreting intel's flags by jsg@ after lots of discussion
between him, deraadt@, and I
ok deraadt@
|
|
Xsyscall32 stub and UCODE32 selector, set MSR_CSTAR to zero at CPU
startup, and rezero on ACPI resume and VM exit.
requested a while ago by deraadt@
AMD VM testing chris@
testing and ok krw@
|
|
The code has outgrown the original name for this struct. Both the
external and internal APIs have used the "clockqueue" namespace for
some time when operating on it, and that name is eyeball-consistent
with "clockintr" and "clockrequest", so "clockqueue" it is.
|
|
userspace from cross-process BTI to the kernel. Have each CPU track
the last pmap run on in userspace and the last vmm VCPU in guest-mode
and use the IBPB msr to flush predictors right before running in
userspace on a different pmap or entering guest-mode on a different
VCPU. Codepatch-nop the userspace bits and conditionalize the vmm
bits to keep working if IBPB isn't supported.
ok deraadt@ kettenis@
|
|
requires retpoline. If 0, we should do everything in our power to avoid
pure retpoline (replacing it with a simple thunk where possible), because
by it's nature retpoline converts an indirect-branch into a direct branch
(push to stack & ret), and therefore it is an IBT (endbr64) bypass method.
This sysctl leverages guenther's decision-making logic in the kernel, which
already uses codepatch to fix the kernel retpoline thunk.
In my opinion, the retpoline-using logic really should be flipped; ROP
execution bypassing IBT to re-enter regular control flow is more dangerous
than spectre.
ok kettenis
|
|
first six entries are in the same order as syscall arguments, such
that syscall() can just use the trapframe as the argument vector
for mi_syscall() and not need to reorder into another buffer on the
stack. This doesn't affect coredump layout or ptrace(2), but does
affect kernel crash dumps.
Possibility noted during miod@'s cleanup of the MD syscall()
implementations
ok mlarkin@ kurt@
|