src - OpenBSD base system

Age	Commit message (Collapse)	Author
2024-09-30	Remove code after exit1() and NOTREACHED comment. Nothing will ever get there.	Claudio Jeker
	OK mpi@
2024-08-21	We do not need the PS_LIBCPIN and PS_PIN flag fields anymore, which were	Theo de Raadt
	used during devlopment (for visibility). There is speculation claudio will immediately use these bits for something else.
2024-08-06	Stop using KERNEL_LOCK to protect the per process kqueue list	Claudio Jeker
	Instead of the KERNEL_LOCK use the ps_mtx for most operations. If the ps_klist is modified an additional global rwlock (kqueue_ps_list_lock) is required. This includes the knotes with NOTE_FORK and NOTE_EXIT since in either cases a ps_klist is changed. In the NOTE_FORK \| NOTE_TRACK case the call to kqueue_register() can sleep this is why a global rwlock is used. Adjust the reaper() to call knote_processexit() without KERNEL_LOCK. Double lock idea from visa@ OK mvs@
2024-07-08	Rework per proc and per process time usage accounting	Claudio Jeker
	For procs (threads) the accounting happens now lockless by curproc using a generation counter. Callers need to use tu_enter() and tu_leave() for this. To read the proc p_tu struct tuagg_get_proc() should be used. It ensures that the values read is consistent. For processes only the time of exited threads is accumulated in ps_tu and to get the proper process time usage tuagg_get_process() needs to be called. tuagg_get_process() will sum up all procs p_tu plus the ps_tu. This removes another SCHED_LOCK() dependency. Adjust the code in exit1() and exit2() to correctly account for the full run time. For this adjust sched_exit() to do the runtime accounting like it is done in mi_switch(). OK jca@ dlg@
2024-04-02	Delete the msyscall mechanism entirely, since mimmutable+pinsyscalls has	Theo de Raadt
	replaced it with a more strict mechanism, which happens to be lockless O(1) rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts, but leave the syscall around for a bit longer so that people can build through it, since ld.so(1) still wants to call it.
2024-01-17	Since pinsyscalls(2) applies to all system calls and does a more precise	Theo de Raadt
	check earlier, the pinsyscall(SYS_execve mechanism has become redundant. It needs to be removed delicately since ld.so and static binaries use it. As a first step, neuter the checking code in sys_execve(). Further steps will follow slowly. ok kettenis
2024-01-16	The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in	Theo de Raadt
	the main program or ld.so, and accept a submission of that information for libc.so from ld.so via pinsyscalls(2). At system call invocation, the syscall number is matched to the specific address it must come from. ok kettenis, gnezdo, testing of variations by many people
2023-10-30	Use ERESTART for any single_thread_set() error in sys_execve().	Claudio Jeker
	If single thread is already held by another thread just unwind to userret() wait there and retry the system call later (if at all). OK mpi@
2023-09-29	Extend single_thread_set() mode with additional flag attributes.	Claudio Jeker
	The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter the behaviour of single_thread_set(). This allows explicit control of the SINGLE_DEEP behaviour. If SINGLE_DEEP is set the deep flag is passed to the initial check call and by that the check will error out instead of suspending (SINGLE_UNWIND) or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to single_thread_set() outside of userret. E.g. at the start of sys_execve because the proc is not allowed to call exit1() in that location. SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor returns BEFORE all threads have been parked. Currently this is only used by the ptrace code and should not be used anywhere else. Not waiting for all threads to settle is asking for trouble. This solves an issue by using SINGLE_UNWIND in the coredump case where the code should actually exit in case another thread crashed moments earlier. Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since the call to pledge_fail() is for sure not at the kernel boundary. OK mpi@
2023-07-10	Add PS_NOBTCFI, a per-process flag indicating that Branch Target	Philip Guenther
	Control Flow Integrity has been disabled for the process. At exec-time, set that flag iff EXEC_NOBTCFI is passed from the ELF exec bits (which set it based on presence of a PT_OPENBSD_NOBTCFI segment). This will be used by the amd64 code. kern_exec.c part by kettenis@ ok guenther@ deraadt@
2023-07-06	remove during-development special cases for MNT_WXALLOWED and chrome and	Theo de Raadt
	IBT/BTI, because many more things are about to work correctly
2023-05-30	spelling	Jonathan Gray
	ok jmc@ guenther@ tb@
2023-04-24	Abuse the wxallowed flag to decide whether we should enforce branch target	Mark Kettenis
	or not. The idea is that since /usr/local has wxallowed by default this will enable enforcement for base while leaving ports alone for now. This will help us transition to a state where ports are properly marked and allow us to establish that base is really clean. Also add an exception for chrome. Chrome already appears to be clean on arm64 and this exception can be easily modified for testing other ports. This will screw over people that deliberately disable wxallowed on /usr/local or who don't have a separate partition for /usr/local. We think that is an acceptable compromise for the next months. ok robert@, deraadt@ (who came up with the idea)
2023-02-21	for process kills due to execve from non-pinned syscall address, export	Theo de Raadt
	a new AEXECVE bit to acct(4), and print it in lastcomm(8) ok bluhm
2023-02-17	Validate execve() libc stub location if kernel knows it. (due to ld.so	Theo de Raadt
	telling the kernel with pinsyscall(2)
2023-02-10	Adjust knote(9) API	Visa Hankala
	Make knote(9) lock the knote list internally, and add knote_locked(9) for the typical situation where the list is already locked. Remove the KNOTE(9) macro to simplify the API. Manual page OK jmc@ OK mpi@ mvs@
2023-01-13	Since the signal trampoline is now execute-only we no longer write it	Mark Kettenis
	into core dumps. As a result backtraces through signal handlers no longer work in gdb and other debuggers. Fix this by keeping a read-only mapping of the signal trampoline in the kernel and writing it into the core dump at the virtual address where it is mapped in the process. ok deraadt@, tb@
2023-01-07	Add {get,set}thrname(2) for putting thread names in the kernel and	Philip Guenther
	exposed in a new field returned by sysctl(KERN_PROC). Update pthread_{get,set}_name_np(3) to use the syscalls. Show them, when set, in ps -H and top -H output. libc and libpthread minor bumps ok mpi@, mvs@, deraadt@
2023-01-05	after a few trap.c were fixed to fault with the right access, the	Theo de Raadt
	signal trampoline can now be PROT_EXEC (without PROT_READ) everywhere ok kettenis
2022-11-23	cache ps_auxinfo inside the kernel, to avoid codedump() reading the	Moritz Buhl
	copy on userland stack which points at an illicit region. ok kettenis, deraadt
2022-11-17	stack growth from setrlimit was never updated to set UVM_ET_STACK on	Theo de Raadt
	the entries, so the check-sp-at-system-call check failed. Quite strange it took this long to find this. ok kettenis
2022-10-30	Simplfity setregs() by passing it the ps_strings and switching	Philip Guenther
	sys_execve() to return EJUSTRETURN. setregs() is the MD routine used by sys_execve() to set up the thread's trapframe and PCB such that, on 'return' to userspace, it has the register values defined by the ABI and otherwise zero. It had to set the syscall retval[] values previously because the normal syscall return path overwrites a couple registers with the retval[] values. By instead returning EJUSTRETURN that and some complexity with program-counter handling on m88k and sparc64 goes away. Also, give setregs() add a 'struct ps_strings *arginfo' argument so powerpc, powerpc64, and sh can directly get argc/argv/envp values for registers instead of copyin()ing the one in userspace. Improvements from miod@ and millert@ Testing assistance miod@, kettenis@, and aoyama@ ok miod@ kettenis@
2022-10-21	the debug "name" parameter to uvm_map_immutable() is no longer needed	Theo de Raadt

2022-10-21	sigaltstack() was adapted to work on mimmutable regions (an unfortunate	Theo de Raadt
	compromise...), but it means the stack can be marked immutable again. ok kettenis
2022-10-12	The sigaltstack() MAP_STACK re-map mechanism is incompatible with immutable	Theo de Raadt
	regions, so immutable stack isn't viable yet. There are configure programs which create sigstacks upon their own stacks, and there is no simple fix for the sigaltstack mechanism... discovered by sthen and tb
2022-10-08	The stack can also be marked immutable, because we expect no sane program	Theo de Raadt
	to try to change the permissions of it. We won't know who's trying that until we enable it and see what breaks. A tricky piece relating to setrlimit stack size changing was previously commited. ok kettenis
2022-10-08	The signal trampoline and timekeep regions can be marked immutable at	Theo de Raadt
	execve() time ok kettenis
2022-10-07	Add mimmutable(2) system call which locks the permissions (PROT_*) of	Theo de Raadt
	memory mappings so they cannot be changed by a later mmap(), mprotect(), or munmap(), which will error with EPERM instead. ok kettenis
2022-08-14	remove unneeded includes in sys/kern	Jonathan Gray
	ok mpi@ miod@
2022-02-22	Start using new _MAXCOMLEN (a proper string expanded to 24 bytes	Theo de Raadt
	including the NUL), in all internal interafaces, and expose this in ktrace, core, or proc.h visibility. ok millert
2022-02-07	Delete STACKGAPLEN: this exec-time allocation at the top of the	Philip Guenther
	original thread's stack hasn't been used since 2015. ok miod@ deraadt@
2021-12-09	We only have one syscall table: inline sysent/SYS_MAXSYSCALL and	Philip Guenther
	SYS_syscall as the nosys() function into the MD syscall entry routines and the SYSCALL_DEBUG support. Adjust alpha's syscall check to match the other archs. Also, make sysent const to get it into .rodata. With that, 'struct emul' is unused: delete it and all its references ok millert@
2021-12-07	Delete the last emulation callbacks: we're Just ELF, so declare	Philip Guenther
	exec_elf_fixup() and coredump_elf() in <sys/exec_elf.h> and call them and the MD setregs() directly in kern_exec.c and kern_sig.c Also delete e_name[] (only used by sysctl), e_errno (unused), and e_syscallnames[] (only used by SYSCALL_DEBUG) and constipate syscallnames to 'const char *const[]' ok kettenis@
2021-12-07	Continue to delete emulation support: we only have one sigcode and	Philip Guenther
	sigobject. Just use the existing globals for the former and use a global for the latter. ok jsg@ kettenis@
2021-12-07	Continue to delete emulation support: since we're Just ELF, the size	Philip Guenther
	of the auxinfo is fixed: provide ELF_AUX_WORDS in <sys/exec_elf.h> as a replacement for emul->e_arglen ok millert@
2021-12-06	Start to delete emulation support: since we're Just ELF, make	Philip Guenther
	copyargs() return 0/1 and merge elf_copyargs() into it. Rename ep_emul_arg and ep_emul_argp to have clearer meaning and type and eliminate ep_emul_argsize as no longer necessary. Make sure ep_auxinfo (nee ep_emul_argp) is initialized as powerpc64 always uses it in setregs(). ok semarie@ deraadt@ kettenis@
2021-03-16	handle theoretical case of sigfillsz not being pow2-sized on some	Theo de Raadt
	architecture. from miod
2021-03-12	Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic	Martin Pieuchot
	single_thread_set() is modified to explicitly indicated when waiting until sibling threads are parked is required. This is obviously not required if a traced thread is switching away from a CPU after handling a STOP signal. ok claudio@
2021-03-08	Revert commitid: AZrsCSWEYDm7XWuv;	Claudio Jeker
	Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic. This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.
2021-03-08	Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.	Martin Pieuchot
	single_thread_set() is modified to explicitly indicated when waiting until sibling threads are parked is required. This is obviously not required if a traced thread is switching away from a CPU after handling a STOP signal. ok claudio@
2020-10-15	_exit(2), execve(2): tweak per-process interval timer cancellation	cheloha
	If we fold the for-loop iterating over each interval timer into the helper function the result is slightly tidier than what we have now. Rename the helper function "cancel_all_itimers". Based on input from millert@ and kettenis@.
2020-10-15	_exit(2), execve(2): cancel per-process interval timers safely	cheloha
	During _exit(2) and sometimes during execve(2) we need to cancel any active per-process interval timers. We don't currently do this in an MP-safe way. Both syscalls ignore the locking assumptions documented in proc.h. The easiest way to make them MP-safe is to use setitimer(), just like the getitimer(2) and setitimer(2) syscalls do. To make things a bit cleaner I have added a helper function, cancelitimer(), so the callers don't need to fuss with an itimerval struct. While we're here we can remove the splclock/splx dance from execve(2). It is no longer necessary. ok deraadt@
2020-07-11	timekeep_sz now already includes the round_page() adjustment; ok kettenis@	Christian Weisgerber

2020-07-07	small typo	Theo de Raadt

2020-07-06	Wire down the timekeep page. If we don't do this, the pagedaemon may	Mark Kettenis
	page it out and bad things will happen when we try to page it back in from within the clock interrupt handler. While there, make sure we set timekeep_object back to NULL if we fail to make the timekeep page into kernel space. ok deraadt@ (who had a very similar diff)
2020-07-06	Add support for timeconting in userland.	Paul Irofti
	This diff exposes parts of clock_gettime(2) and gettimeofday(2) to userland via libc eliberating processes from the need for a context switch everytime they want to count the passage of time. If a timecounter clock can be exposed to userland than it needs to set its tc_user member to a non-zero value. Tested with one or multiple counters per architecture. The timing data is shared through a pointer found in the new ELF auxiliary vector AUX_openbsd_timekeep containing timehands information that is frequently updated by the kernel. Timing differences between the last kernel update and the current time are adjusted in userland by the tc_get_timecount() function inside the MD usertc.c file. This permits a much more responsive environment, quite visible in browsers, office programs and gaming (apparently one is are able to fly in Minecraft now). Tested by robert@, sthen@, naddy@, kmos@, phessler@, and many others! OK from at least kettenis@, cheloha@, naddy@, sthen@
2020-02-15	Consistently perform atomic writes to the ps_flags field of struct	anton
	process. ok bluhm@ claudio@ visa@
2019-12-11	Replace p_xstat with ps_xexit and ps_xsig	Philip Guenther
	Convert those to a consolidated status when needed in wait4(), kevent(), and sysctl() Pass exit code and signal separately to exit1() (This also serves as prep for adding waitid(2)) ok mpi@
2019-12-01	comply with POSIX and make execve() return EACCES for directories	Christian Weisgerber
	ok millert@ deraadt@
2019-11-29	Repurpose the "syscalls must be on a writeable page" mechanism to	Theo de Raadt
	enforce a new policy: system calls must be in pre-registered regions. We have discussed more strict checks than this, but none satisfy the cost/benefit based upon our understanding of attack methods, anyways let's see what the next iteration looks like. This is intended to harden (translation: attackers must put extra effort into attacking) against a mixture of W^X failures and JIT bugs which allow syscall misinterpretation, especially in environments with polymorphic-instruction/variable-sized instructions. It fits in a bit with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash behaviour, particularily for remote problems. Less effective once on-host since someone the libraries can be read. For static-executables the kernel registers the main program's PIE-mapped exec section valid, as well as the randomly-placed sigtramp page. For dynamic executables ELF ld.so's exec segment is also labelled valid; ld.so then has enough information to register libc's exec section as valid via call-once msyscall(2) For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon. We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface. This was started at a nano-hackathon in Bob Beck's basement 2 weeks ago during a long discussion with mortimer trying to hide from the SSL scream-conversations, and finished in more comfortable circumstances next to a wood-stove at Elk Lakes cabin with UVM scream-conversations. ok guenther kettenis mortimer, lots of feedback from others conversations about go with jsing tb sthen