Age | Commit message (Collapse) | Author |
|
Upgrade/downgrade operations on a `vmmaplk' are no longer necessary since
vm_map_busy() completely unlocks it (r1.318 of uvm/uvm_map.c).
ok kettenis@
|
|
Change the semantic of vm_map_busy() to be able to completely unlock the
`vmmaplk' instead of downgrading it to a read lock in mlock(2). This is
necessary because uvm_fault_wire() tries to re-grab the same lock.
We now keep track of the thread currently holding the vmmap busy to ensure
it can relock & unbusy the vmmap. The new pattern becomes:
....vm_map_lock(map);
....vm_map_busy(map); /* prevent other threads to grab an exclusive lock */
....vm_map_unlock(map);
....
..../*
.... * Do some stuff generally requiring a tsleep(9).
.... */
....
....vm_map_lock(map);
....vm_map_unbusy(map); /* allow other threads to make progress after unlock */
....vm_map_unlock(map);
Fix adapted from NetBSD's r1.249 of uvm/uvm_map.c. Issue reported by
Jacqueline Jolicoeur exposed by a "wallet refresh" of the Monero App.
Panic hand-copied below:
sleep_finish()
rw_enter()
uvmfault_lookup()
uvm_fault_check()
uvm_fault()
uvm_fault_wire()
uvm_map_pageable_wire()
sys_mlock()
This version skips bumping the map's timestamp if the lock is acquired by the
thread marked the VM map busy. This prevents a KASSERT() reported by bluhm@
triggered by regress/misc/posixtestsuite conformance/interfaces/mmap/18-1
ok kettenis@
|
|
Do not grab the `vmmaplk' recursively, prevent a self-deadlock.
It causes panic: uvm_map_pageable_wire: stale map
Found by regress/misc/posixtestsuite conformance/interfaces/mmap/18-1
requested by deraadt@
|
|
Change the semantic of vm_map_busy() to be able to completely unlock the
`vmmaplk' instead of downgrading it to a read lock in mlock(2). This is
necessary because uvm_fault_wire() tries to re-grab the same lock.
We now keep track of the thread currently holding the vmmap busy to ensure
it can relock & unbusy the vmmap. The new pattern becomes:
....vm_map_lock(map);
....vm_map_busy(map); /* prevent other threads to grab an exclusive lock */
....vm_map_unlock(map);
....
..../*
.... * Do some stuff generally requiring a tsleep(9).
.... */
....
....vm_map_lock(map);
....vm_map_unbusy(map); /* allow other threads to make progress after unlock */
....vm_map_unlock(map);
Fix adapted from NetBSD's r1.249 of uvm/uvm_map.c. Issue reported by
Jacqueline Jolicoeur exposed by a "wallet refresh" of the Monero App.
Panic hand-copied below:
sleep_finish()
rw_enter()
uvmfault_lookup()
uvm_fault_check()
uvm_fault()
uvm_fault_wire()
uvm_map_pageable_wire()
sys_mlock()
ok kettenis@
|
|
ever ran on, and it's unlikely to ever be implemented, so remove it.
ok jsg@
|
|
|
|
Mark the VM map as busy instead to prevent any sibling thread to request an
exclusive version of the vm_map. This is necessary to let any PG_BUSY page,
found in the UVM vnode object, to be released by a sibling in the middle of
a page-fault.
Note: the page-fault handler releases & re-grab a shared version of the
vm_map lock and expect it to be available to make progress.
Prevent a 3-Threads deadlock between msync(2), page-fault and mmap(2). The
deadlock reported on bugs@ by many occured as follow:
..ThreadA faults & grabs the shared `vmmaplk' then release it before calling
..uvn_get() which might sleep to allocate pages and mark them as PG_BUSY.
..Once the lock is released, threadB calls uvn_flush(). It sees at least a
..PG_BUSY page and sleeps on the `vmmaplk' waiting for threadA to un-busy
..the page.
..At the same time threadC asked for an exclusive version of the lock and
..sleeps until all reader are done with it. This prevents threadA to
..acquire a shared-version of the lock and finish the page fault.
This issue is similar to NetBSD's PR #56952 and the fix is from Chuck Silvers.
Tested by many on bugs@, thanks!
ok kettenis@
|
|
ok kettenis
|
|
the xonly rules are not applied a child. This was due to the same
misunderstanding as with msyscall a few days ago.
ok guenther kettenis
|
|
only freshly executed processes were actually locked. (This happened
because I didn't realize how the uvm_map's contents are copied entry
by entry, and other parts are not)
ok kettenis
|
|
against classic BROP with a range-checking wrapper in front of copyin() and
copyinstr() which ensures the userland source doesn't overlap the main program
text, ld.so text, signal tramp text (it's mapping is hard to distinguish
so it comes along for the ride), or libc.so text. ld.so tells the kernel
libc.so text range with msyscall(2). The range checking for 2-4 elements is
done without locking (because all 4 ranges are immutable!) and is inexpensive.
write(sock, &open, 400) now fails with EFAULT. No programs have been
discovered which require reading their own text segments with a system call.
On a machine without mmu enforcement, a test program reports the following:
userland kernel
ld.so readable unreadable
mmap xz unreadable unreadable
mmap x readable readable
mmap nrx readable readable
mmap nwx readable readable
mmap xnwx readable readable
main readable unreadable
libc unmapped? readable unreadable
libc mapped readable unreadable
ok kettenis, additional help from miod
|
|
Correct the logic, still blocking PROT_EXEC
ok anton kettenis
|
|
|
|
read/write operations, so mask out PROT_EXEC to avoid creating an pointless
exec mapping in the kernel.
We probably need this masking upon minprot (for the non-UVM_EXTRACT_FIXPROT
case) also, but I haven't done a test yet.
ok kettenis
|
|
|
|
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis
|
|
Also grab the lock in uvm_map_teardown() and uvm_map_deallocate() to
satisfy the assertions. Grabbing the lock there shouldn't be strictly
necessary, because no other reference to the map should exist when the
reaper is holding it, but it doesn't hurt and makes our life easier.
Inputs & tests from Ivo van der Sangen, tb@, gnezdo@, kn@
kettenis@ and tb@ agree with the direction, ok kn@
|
|
If enabled the debug code currently panic the kernel. To investigate.
|
|
for IMMUTABLE, before traversing for unmap. I didn't copy enough traversal
code for the scan, and thus MAP_FIXED was subtly broken.
test help from tb, ok kettenis miod
|
|
I really want immutable to not allow such transitions either, because it will
help bring code up to the highest standard.
For now, allow this for all processes, until we find out the underlying
reason.
|
|
|
|
go back to the old approach: using a new anon mapping because it removes
any potential gadgetry pre-placed in the region (by making it zero). But
also bring in a few more validation checks beyond contigious mapping -- it
must not be a syscall region, and the protection must be precisely RW.
This does allow sigaltstack() to shoot zero'd MAP_STACK non-immutable regions
into the main stack area (which will soon be immutable). I am not sure we
can keep reinforce immutable on the region after we do stack (like maybe
determine this while doing the validation entry walk?)
Sadly, continued support for sigaltstack() does require selecting the guessed
best compromise.
ok kettenis
|
|
|
|
problem because haphazard use could shoot holes in the address space
(changing permissions, providing opportunities for pivoting, etc). I
tried to write a diff to convert the address space correctly but did
not understand enough about map entries, so instead we mapped new
memory over top of the existing object. Placing a new mapping becomes
unfeasible with the upcoming mimmutable model, so here is code that
adds MAP_STACK to the region. It will only do so for a contigiously
mapped region that is non-syscall with permission RW, otherwise it
returns an error.
Food for thought: If we know the object isn't service by an object,
we should consider zero'ing the region, to block pre-pivot placement?
ok kettenis
|
|
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis
|
|
|
|
ok miod@ millert@
|
|
to be available to other files. NFC
ok kettenis@ mpi@
|
|
ok millert@, kettenis@
|
|
This introduced a lock ordering issue reported by naddy@, anton@ and syzkaller.
Reported-by: syzbot+739bb901045d9b193bde@syzkaller.appspotmail.com
|
|
to prevent another thread from unmapping the memory and triggering
an assertion or even corrupting random physical memory pages.
This fix is similar to the change in uvm_glue.c rev. 1.74. However in this
case we need to be careful since some sysctl(2) calls look at the map of
the current process. In those cases we must not attempt to lock the map
again.
ok mpi@
Should fix:
Reported-by: syzbot+be89fe83d6c004fcb412@syzkaller.appspotmail.com
|
|
The (known) lock order reversals which now occur more reliably and much
earlier on WITNESS boots with this diff knock out syzcaller reports since
syzcaller stops at the first "crash report":
https://syzkaller.appspot.com/bug?id=81b39e970cd2eb21b97d1b31746c693e300fd2dd
|
|
This is an updated version of uvm_map.c r1.283 "Unwire with map lock held".
The previous version introduced a use-after-free by not unlocking vm_map
locks in uvm_map_teardown(), resulting in dangling references on the
reaper's lock list (thanks visa!).
Lock and unlock the map in around uvm_map_teardown() instead.
This code path holds the last reference, hence the lock isn't strictly
needed except for satisfying upcoming locking assertions.
Tested on amd64, arm64, i386, macppc, octeon, sparc64.
This time also with WITNESS enabled (except on sparc64 which builds but does
not boot with WITNESS; this is a known issue).
OK mpi visa
|
|
WITNESS builds broke^W^Wkernels panic on boot as reported by anton and bluhm.
Booting bsd.mp in single-user mode inside VMM shows:
root on sd0a (5f9e458ed30b39ab.a) swap on sd0b dump on sd0b
Enter pathname of shell or RETURN for sh:
witness: lock order reversal:
1st 0xfffffd801f8ce468 vmmaplk (&map->lock)
2nd 0xfffffd801b8162c0 inode (&ip->i_lock)
lock order "&ip->i_lock"(rrwlock) -> "&map->lock"(rwlock) first seen at:
#0 rw_enter_read+0x38
#1 uvmfault_lookup+0x8a
#2 uvm_fault_check+0x32
#3 uvm_fault+0xfb
#4 kpageflttrap+0x12c
#5 kerntrap+0x91
#6 alltraps_kern_meltdown+0x7b
#7 copyout+0x53
#8 ffs_read+0x1f6
#9 VOP_READ+0x41
#10 vn_rdwr+0xa1
#11 vmcmd_map_readvn+0xa0
#12 exec_process_vmcmds+0x88
#13 sys_execve+0x732
#14 start_init+0x26f
#15 proc_trampoline+0x1c
lock order data w1 -> w2 missing
# exit
kernel: protection fault trap, code=0
Stopped at witness_checkorder+0x312: movl 0x10(%r14),%ecx
gkoehler reported faults on poisened addresses on macppc dual G5.
|
|
WITNESS builds broke as reported by anton and bluhm:
root on sd0a (5ec49b3ad23eb2d4.a) swap on sd0b dump on sd0b
kernel: protection fault trap, code=0
Stopped at witness_checkorder+0x4ec: movl 0x10(%r12),%ecx
https://syzkaller.appspot.com/bug?id=be02b290a93c648986c35370a271aad4135a5044
https://syzkaller.appspot.com/text?tag=CrashLog&x=136e9aa4700000
|
|
Introduce vm_map_assert_{wrlock,rdlock,anylock,unlocked}() in rwlock(9)
fashion and back up function comments about locking assumptions with proper
assertions.
Also add new comments/assertions based on code analysis and sync with
NetBSD as much as possible.
vm_map_lock() and vm_map_lock_read() are used for exclusive and shared
access respectively; currently no code path is purely protected by
vm_map_lock_read() alone, i.e. functions called with a read lock held by the
callee are also called with a write lock elsewhere.
Thus only vm_map_assert_{wrlock,anylock}() are used as of now.
This should help with unlocking UVM related syscalls
Tested as part of a larger diff through
- amd64 package bulk build by naddy
- amd64, arm64, powerpc64 base builds and regress by bluhm
- amd64 and sparc64 base builds and regress by me
Input mpi
Feedback OK kettenis
|
|
uvm_unmap_remove() effectively requires its caller to lock the vm map.
Even though uvm_map_teardown() is only called after a map's last reference
is dropped and is thus safe from other threads accessing the map, grab the
map's lock in uvm_map_teardown() to satify upcoming lock assertions in
uvm_unmap_remove().
Tested as part of a larger diff through
- amd64 package bulk builds by naddy
- amd64, arm64, powerpc64 base builds and regress by bluhm
- amd64 and sparc64 base builds and regress by me
Feedback mpi
OK kettenis
|
|
Pass the correct entry to uvm_fault_unwire_locked().
Reported-by: syzbot+bb2f63f076618e9ed0d3@syzkaller.appspotmail.com
ok kettenis@, deraadt@
|
|
Like the per-amap lock the `vmobjlock' is principally used to serialized
access to objects in the fault handler to allow faults occurring on
different CPUs and different objects to be processed in parallel.
The fault handler now acquires the `vmobjlock' of a given UVM object as
soon as it finds one. For now a write-lock is always acquired even if
some operations could use a read-lock.
Every pager, corresponding to a different kind of UVM object, now expect
the UVM object to be locked and some operations, like *_get() return it
unlocked. This is enforced by assertions checking for rw_write_held().
The KERNEL_LOCK() is now pushed to the VFS boundary in the vnode pager.
To ensure the correct amap or object lock is held when modifying a page
many uvm_page* operations are now asserting for the "owner" lock.
However, fields of the "struct vm_page" are still being protected by the
global `pageqlock'. To prevent lock ordering issues with the new
`vmobjlock' and to reduce differences with NetBSD this lock is now taken
and released for each page instead of around the whole loop.
This commit does not remove the KERNEL_LOCK/UNLOCK() dance. Unlocking
will follow if there is no fallout.
Ported from NetBSD, tested by many, thanks!
ok kettenis@, kn@
|
|
prints the end which is in the next page. Subtract 1 to avoid confusion.
|
|
Reduce differences with NetBSD, tested by many as part of a larger diff.
ok kettenis@
|
|
This is possible now that amaps & anons are protected by a per-map rwlock.
Tested by many as part of a bigger diff.
ok kettenis@
|
|
This change introduced or exposed a leak of anons which result in system
freezes.
anton@ observed a high number of INUSE for anonpl and semarie@ saw multiple
processes waiting in the fault handler on "flt_noramX" probably the one
related to allocating an anon.
|
|
This is possible now that amaps & anons are protected by a per-map rwlock.
ok kettenis@, jmatthew@
|
|
This is necessary to do this accounting without the KERNEL_LOCK().
ok mvs@, kettenis@
|
|
No functional change.
ok mlarkin@
|
|
ok mpi@
|
|
|
|
ok mpi@
|
|
A rwlock is attached to every amap and is shared with all its anon. The
same lock will be used by multiple amaps if they have anons in common.
This should be enough to get the upper part of the fault handler out of the
KERNEL_LOCK() which seems to bring up to 20% improvements in builds.
This is based/copied/adapted from the most recent work done in NetBSD which
is an evolution of the precendent simple_lock scheme.
Tested by many, thanks!
ok kettenis@, mvs@
|