summaryrefslogtreecommitdiff
path: root/sys/kern/kern_exec.c
AgeCommit message (Collapse)Author
2024-09-30Remove code after exit1() and NOTREACHED comment. Nothing will ever get there.Claudio Jeker
OK mpi@
2024-08-21We do not need the PS_LIBCPIN and PS_PIN flag fields anymore, which wereTheo de Raadt
used during devlopment (for visibility). There is speculation claudio will immediately use these bits for something else.
2024-08-06Stop using KERNEL_LOCK to protect the per process kqueue listClaudio Jeker
Instead of the KERNEL_LOCK use the ps_mtx for most operations. If the ps_klist is modified an additional global rwlock (kqueue_ps_list_lock) is required. This includes the knotes with NOTE_FORK and NOTE_EXIT since in either cases a ps_klist is changed. In the NOTE_FORK | NOTE_TRACK case the call to kqueue_register() can sleep this is why a global rwlock is used. Adjust the reaper() to call knote_processexit() without KERNEL_LOCK. Double lock idea from visa@ OK mvs@
2024-07-08Rework per proc and per process time usage accountingClaudio Jeker
For procs (threads) the accounting happens now lockless by curproc using a generation counter. Callers need to use tu_enter() and tu_leave() for this. To read the proc p_tu struct tuagg_get_proc() should be used. It ensures that the values read is consistent. For processes only the time of exited threads is accumulated in ps_tu and to get the proper process time usage tuagg_get_process() needs to be called. tuagg_get_process() will sum up all procs p_tu plus the ps_tu. This removes another SCHED_LOCK() dependency. Adjust the code in exit1() and exit2() to correctly account for the full run time. For this adjust sched_exit() to do the runtime accounting like it is done in mi_switch(). OK jca@ dlg@
2024-04-02Delete the msyscall mechanism entirely, since mimmutable+pinsyscalls hasTheo de Raadt
replaced it with a more strict mechanism, which happens to be lockless O(1) rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts, but leave the syscall around for a bit longer so that people can build through it, since ld.so(1) still wants to call it.
2024-01-17Since pinsyscalls(2) applies to all system calls and does a more preciseTheo de Raadt
check earlier, the pinsyscall(SYS_execve mechanism has become redundant. It needs to be removed delicately since ld.so and static binaries use it. As a first step, neuter the checking code in sys_execve(). Further steps will follow slowly. ok kettenis
2024-01-16The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS inTheo de Raadt
the main program or ld.so, and accept a submission of that information for libc.so from ld.so via pinsyscalls(2). At system call invocation, the syscall number is matched to the specific address it must come from. ok kettenis, gnezdo, testing of variations by many people
2023-10-30Use ERESTART for any single_thread_set() error in sys_execve().Claudio Jeker
If single thread is already held by another thread just unwind to userret() wait there and retry the system call later (if at all). OK mpi@
2023-09-29Extend single_thread_set() mode with additional flag attributes.Claudio Jeker
The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter the behaviour of single_thread_set(). This allows explicit control of the SINGLE_DEEP behaviour. If SINGLE_DEEP is set the deep flag is passed to the initial check call and by that the check will error out instead of suspending (SINGLE_UNWIND) or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to single_thread_set() outside of userret. E.g. at the start of sys_execve because the proc is not allowed to call exit1() in that location. SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor returns BEFORE all threads have been parked. Currently this is only used by the ptrace code and should not be used anywhere else. Not waiting for all threads to settle is asking for trouble. This solves an issue by using SINGLE_UNWIND in the coredump case where the code should actually exit in case another thread crashed moments earlier. Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since the call to pledge_fail() is for sure not at the kernel boundary. OK mpi@
2023-07-10Add PS_NOBTCFI, a per-process flag indicating that Branch TargetPhilip Guenther
Control Flow Integrity has been disabled for the process. At exec-time, set that flag iff EXEC_NOBTCFI is passed from the ELF exec bits (which set it based on presence of a PT_OPENBSD_NOBTCFI segment). This will be used by the amd64 code. kern_exec.c part by kettenis@ ok guenther@ deraadt@
2023-07-06remove during-development special cases for MNT_WXALLOWED and chrome andTheo de Raadt
IBT/BTI, because many more things are about to work correctly
2023-05-30spellingJonathan Gray
ok jmc@ guenther@ tb@
2023-04-24Abuse the wxallowed flag to decide whether we should enforce branch targetMark Kettenis
or not. The idea is that since /usr/local has wxallowed by default this will enable enforcement for base while leaving ports alone for now. This will help us transition to a state where ports are properly marked and allow us to establish that base is really clean. Also add an exception for chrome. Chrome already appears to be clean on arm64 and this exception can be easily modified for testing other ports. This will screw over people that deliberately disable wxallowed on /usr/local or who don't have a separate partition for /usr/local. We think that is an acceptable compromise for the next months. ok robert@, deraadt@ (who came up with the idea)
2023-02-21for process kills due to execve from non-pinned syscall address, exportTheo de Raadt
a new AEXECVE bit to acct(4), and print it in lastcomm(8) ok bluhm
2023-02-17Validate execve() libc stub location if kernel knows it. (due to ld.soTheo de Raadt
telling the kernel with pinsyscall(2)
2023-02-10Adjust knote(9) APIVisa Hankala
Make knote(9) lock the knote list internally, and add knote_locked(9) for the typical situation where the list is already locked. Remove the KNOTE(9) macro to simplify the API. Manual page OK jmc@ OK mpi@ mvs@
2023-01-13Since the signal trampoline is now execute-only we no longer write itMark Kettenis
into core dumps. As a result backtraces through signal handlers no longer work in gdb and other debuggers. Fix this by keeping a read-only mapping of the signal trampoline in the kernel and writing it into the core dump at the virtual address where it is mapped in the process. ok deraadt@, tb@
2023-01-07Add {get,set}thrname(2) for putting thread names in the kernel andPhilip Guenther
exposed in a new field returned by sysctl(KERN_PROC). Update pthread_{get,set}_name_np(3) to use the syscalls. Show them, when set, in ps -H and top -H output. libc and libpthread minor bumps ok mpi@, mvs@, deraadt@
2023-01-05after a few trap.c were fixed to fault with the right access, theTheo de Raadt
signal trampoline can now be PROT_EXEC (without PROT_READ) everywhere ok kettenis
2022-11-23cache ps_auxinfo inside the kernel, to avoid codedump() reading theMoritz Buhl
copy on userland stack which points at an illicit region. ok kettenis, deraadt
2022-11-17stack growth from setrlimit was never updated to set UVM_ET_STACK onTheo de Raadt
the entries, so the check-sp-at-system-call check failed. Quite strange it took this long to find this. ok kettenis
2022-10-30Simplfity setregs() by passing it the ps_strings and switchingPhilip Guenther
sys_execve() to return EJUSTRETURN. setregs() is the MD routine used by sys_execve() to set up the thread's trapframe and PCB such that, on 'return' to userspace, it has the register values defined by the ABI and otherwise zero. It had to set the syscall retval[] values previously because the normal syscall return path overwrites a couple registers with the retval[] values. By instead returning EJUSTRETURN that and some complexity with program-counter handling on m88k and sparc64 goes away. Also, give setregs() add a 'struct ps_strings *arginfo' argument so powerpc, powerpc64, and sh can directly get argc/argv/envp values for registers instead of copyin()ing the one in userspace. Improvements from miod@ and millert@ Testing assistance miod@, kettenis@, and aoyama@ ok miod@ kettenis@
2022-10-21the debug "name" parameter to uvm_map_immutable() is no longer neededTheo de Raadt
2022-10-21sigaltstack() was adapted to work on mimmutable regions (an unfortunateTheo de Raadt
compromise...), but it means the stack can be marked immutable again. ok kettenis
2022-10-12The sigaltstack() MAP_STACK re-map mechanism is incompatible with immutableTheo de Raadt
regions, so immutable stack isn't viable yet. There are configure programs which create sigstacks upon their own stacks, and there is no simple fix for the sigaltstack mechanism... discovered by sthen and tb
2022-10-08The stack can also be marked immutable, because we expect no sane programTheo de Raadt
to try to change the permissions of it. We won't know who's trying that until we enable it and see what breaks. A tricky piece relating to setrlimit stack size changing was previously commited. ok kettenis
2022-10-08The signal trampoline and timekeep regions can be marked immutable atTheo de Raadt
execve() time ok kettenis
2022-10-07Add mimmutable(2) system call which locks the permissions (PROT_*) ofTheo de Raadt
memory mappings so they cannot be changed by a later mmap(), mprotect(), or munmap(), which will error with EPERM instead. ok kettenis
2022-08-14remove unneeded includes in sys/kernJonathan Gray
ok mpi@ miod@
2022-02-22Start using new _MAXCOMLEN (a proper string expanded to 24 bytesTheo de Raadt
including the NUL), in all internal interafaces, and expose this in ktrace, core, or proc.h visibility. ok millert
2022-02-07Delete STACKGAPLEN: this exec-time allocation at the top of thePhilip Guenther
original thread's stack hasn't been used since 2015. ok miod@ deraadt@
2021-12-09We only have one syscall table: inline sysent/SYS_MAXSYSCALL andPhilip Guenther
SYS_syscall as the nosys() function into the MD syscall entry routines and the SYSCALL_DEBUG support. Adjust alpha's syscall check to match the other archs. Also, make sysent const to get it into .rodata. With that, 'struct emul' is unused: delete it and all its references ok millert@
2021-12-07Delete the last emulation callbacks: we're Just ELF, so declarePhilip Guenther
exec_elf_fixup() and coredump_elf() in <sys/exec_elf.h> and call them and the MD setregs() directly in kern_exec.c and kern_sig.c Also delete e_name[] (only used by sysctl), e_errno (unused), and e_syscallnames[] (only used by SYSCALL_DEBUG) and constipate syscallnames to 'const char *const[]' ok kettenis@
2021-12-07Continue to delete emulation support: we only have one sigcode andPhilip Guenther
sigobject. Just use the existing globals for the former and use a global for the latter. ok jsg@ kettenis@
2021-12-07Continue to delete emulation support: since we're Just ELF, the sizePhilip Guenther
of the auxinfo is fixed: provide ELF_AUX_WORDS in <sys/exec_elf.h> as a replacement for emul->e_arglen ok millert@
2021-12-06Start to delete emulation support: since we're Just ELF, makePhilip Guenther
copyargs() return 0/1 and merge elf_copyargs() into it. Rename ep_emul_arg and ep_emul_argp to have clearer meaning and type and eliminate ep_emul_argsize as no longer necessary. Make sure ep_auxinfo (nee ep_emul_argp) is initialized as powerpc64 always uses it in setregs(). ok semarie@ deraadt@ kettenis@
2021-03-16handle theoretical case of sigfillsz not being pow2-sized on someTheo de Raadt
architecture. from miod
2021-03-12Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semanticMartin Pieuchot
single_thread_set() is modified to explicitly indicated when waiting until sibling threads are parked is required. This is obviously not required if a traced thread is switching away from a CPU after handling a STOP signal. ok claudio@
2021-03-08Revert commitid: AZrsCSWEYDm7XWuv;Claudio Jeker
Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic. This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.
2021-03-08Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.Martin Pieuchot
single_thread_set() is modified to explicitly indicated when waiting until sibling threads are parked is required. This is obviously not required if a traced thread is switching away from a CPU after handling a STOP signal. ok claudio@
2020-10-15_exit(2), execve(2): tweak per-process interval timer cancellationcheloha
If we fold the for-loop iterating over each interval timer into the helper function the result is slightly tidier than what we have now. Rename the helper function "cancel_all_itimers". Based on input from millert@ and kettenis@.
2020-10-15_exit(2), execve(2): cancel per-process interval timers safelycheloha
During _exit(2) and sometimes during execve(2) we need to cancel any active per-process interval timers. We don't currently do this in an MP-safe way. Both syscalls ignore the locking assumptions documented in proc.h. The easiest way to make them MP-safe is to use setitimer(), just like the getitimer(2) and setitimer(2) syscalls do. To make things a bit cleaner I have added a helper function, cancelitimer(), so the callers don't need to fuss with an itimerval struct. While we're here we can remove the splclock/splx dance from execve(2). It is no longer necessary. ok deraadt@
2020-07-11timekeep_sz now already includes the round_page() adjustment; ok kettenis@Christian Weisgerber
2020-07-07small typoTheo de Raadt
2020-07-06Wire down the timekeep page. If we don't do this, the pagedaemon mayMark Kettenis
page it out and bad things will happen when we try to page it back in from within the clock interrupt handler. While there, make sure we set timekeep_object back to NULL if we fail to make the timekeep page into kernel space. ok deraadt@ (who had a very similar diff)
2020-07-06Add support for timeconting in userland.Paul Irofti
This diff exposes parts of clock_gettime(2) and gettimeofday(2) to userland via libc eliberating processes from the need for a context switch everytime they want to count the passage of time. If a timecounter clock can be exposed to userland than it needs to set its tc_user member to a non-zero value. Tested with one or multiple counters per architecture. The timing data is shared through a pointer found in the new ELF auxiliary vector AUX_openbsd_timekeep containing timehands information that is frequently updated by the kernel. Timing differences between the last kernel update and the current time are adjusted in userland by the tc_get_timecount() function inside the MD usertc.c file. This permits a much more responsive environment, quite visible in browsers, office programs and gaming (apparently one is are able to fly in Minecraft now). Tested by robert@, sthen@, naddy@, kmos@, phessler@, and many others! OK from at least kettenis@, cheloha@, naddy@, sthen@
2020-02-15Consistently perform atomic writes to the ps_flags field of structanton
process. ok bluhm@ claudio@ visa@
2019-12-11Replace p_xstat with ps_xexit and ps_xsigPhilip Guenther
Convert those to a consolidated status when needed in wait4(), kevent(), and sysctl() Pass exit code and signal separately to exit1() (This also serves as prep for adding waitid(2)) ok mpi@
2019-12-01comply with POSIX and make execve() return EACCES for directoriesChristian Weisgerber
ok millert@ deraadt@
2019-11-29Repurpose the "syscalls must be on a writeable page" mechanism toTheo de Raadt
enforce a new policy: system calls must be in pre-registered regions. We have discussed more strict checks than this, but none satisfy the cost/benefit based upon our understanding of attack methods, anyways let's see what the next iteration looks like. This is intended to harden (translation: attackers must put extra effort into attacking) against a mixture of W^X failures and JIT bugs which allow syscall misinterpretation, especially in environments with polymorphic-instruction/variable-sized instructions. It fits in a bit with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash behaviour, particularily for remote problems. Less effective once on-host since someone the libraries can be read. For static-executables the kernel registers the main program's PIE-mapped exec section valid, as well as the randomly-placed sigtramp page. For dynamic executables ELF ld.so's exec segment is also labelled valid; ld.so then has enough information to register libc's exec section as valid via call-once msyscall(2) For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon. We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface. This was started at a nano-hackathon in Bob Beck's basement 2 weeks ago during a long discussion with mortimer trying to hide from the SSL scream-conversations, and finished in more comfortable circumstances next to a wood-stove at Elk Lakes cabin with UVM scream-conversations. ok guenther kettenis mortimer, lots of feedback from others conversations about go with jsing tb sthen