src - OpenBSD base system

Age	Commit message (Collapse)	Author
2022-11-10	Add mbr_get_fstype() and use it to translate MBR dp_typ fields	Kenneth R Westerback
	into FS_* values. Similar to what gpt_get_fstype() does. Code is clearer and better positioned for planned enhancements to spoofing. No intentional functional change.
2022-11-10	Put CPUs in the lowest P-state just before the final suspend step. The	Mark Kettenis
	firmware probably does this for us on ACPI systems with proper S3 support, but this doesn't happen on systems where we park CPUs in a low-power idle state ourselves. ok deraadt@
2022-11-10	Add support for per-cpu event counters, to be used for clock and IPI	Jonathan Matthew
	counters where the event being counted occurs across all CPUs in the system. Counter instances can be made per-cpu by calling evcount_percpu() after the counter is attached, and this can occur before or after all system CPUs are attached. Per-cpu counter instances should be incremented using evcount_inc(). ok kettenis@ jca@ cheloha@
2022-11-10	fix build after 1.298	Jonathan Gray

2022-11-09	Remove kernel lock here since msleep() with PCATCH no longer requires it.	Claudio Jeker
	OK mpi@
2022-11-09	Some limited setsockopt/getsockopt are allowed in pledge "stdio".	Theo de Raadt
	Also allow IPPROTO_TCP:TCP_NODELAY It is very small kernel code, and will allow some software to drop "inet" requested by djm
2022-11-09	Simplify the overly complex VXLOCK handling in spec_close.	Claudio Jeker
	The code only needs to know if the vnode is exclusive locked and this can be done on entry of the function. OK mpi@
2022-11-09	timeout(9): remove TIMEOUT_KCLOCK flag	Scott Soule Cheloha
	I never should have added the TIMEOUT_KCLOCK flag. It is redundant and only serves to complicate the timeout(9) logic. In every place where we check for the flag we can just use timeout.to_kclock. So, remove the flag from <sys/timeout.h> and rewrite all affected logic to use the value of timeout.to_kclock instead. ok kn@
2022-11-09	regen	Martin Pieuchot

2022-11-09	gpt_get_fstype() doesn't modify its parameter so make said	Kenneth R Westerback
	parameter const.
2022-11-09	Mark sched_yield(2) as NOLOCK.	Martin Pieuchot
	All the fields accessed in this syscall are protected by the SCHED_LOCK() so it isn't necessary to wait for another CPU to release the KERNEL_LOCK() before that. ok claudio@
2022-11-08	allow the KERN_AUTOCONF_SERIAL sysctl in pledge'd processes	Robert Nagy
	ok deraadt@
2022-11-08	timeout(9): remove unused, undocumented timeout_in_nsec() interface	Scott Soule Cheloha
	The kernel is not quite ready for timeout_in_nsec(). Remove it and kclock_nanotime(). Both are unused. Prompted by jsg@. ok kn@
2022-11-08	tc_setclock: don't print a warning if tc_windup() rejects inittodr(9) time	Scott Soule Cheloha
	During resume, it isn't necessarily a problem if the UTC time we get from inittodr(9) lags behind the system UTC clock. In particular, if the active timecounter's frequency is low enough, tc_delta() might not overflow across a brief suspend. Remove the misleading warning message. The code is behaving as intended, just not in a way I anticipated when I added the warning message a few years ago. Discovered by kettenis@. Root cause isolated with kettenis@. Link: https://marc.info/?l=openbsd-tech&m=166790845619897&w=2 ok mlarkin@ kettenis@
2022-11-08	Push kernel lock down into ifioctl()	Klemens Nanni
	This is a mechanical diff without semantical changes, locking ioctls individually inside ifioctl() rather than all of them around it. This allows us to unlock ioctls one by one. OK mpi
2022-11-08	Regen	Martin Pieuchot

2022-11-08	Mark mmap(2), munmap(2) and mprotect(2) as NOLOCK.	Martin Pieuchot
	Accesses to data structures used by these syscalls are serialized by the VM map lock with the exception of file mappings which are still protected by the KERNEL_LOCK(). Unlocking this set of syscalls improves most of userland workloads. Tested by many including robert@ (since 2 years), mlarkin@, kn@, sdk@, jca@, aoyama@, naddy@, Scott Bennett and others. Thanks to all! Joint work with kn@. ok robert@, aja@, kettenis@, kn@, deraadt@, beck@
2022-11-07	introduce a new kern.autoconf_serial sysctl that can be used by userland	Robert Nagy
	to monitor state changes of the kernel device tree input from dnd ok dlg@, deraadt@
2022-11-07	Nuke last references to d_drivedata.	Kenneth R Westerback

2022-11-05	clockintr(9): initial commit	Scott Soule Cheloha
	clockintr(9) is a machine-independent clock interrupt scheduler. It emulates most of what the machine-dependent clock interrupt code is doing on every platform. Every CPU has a work schedule based on the system uptime clock. For now, every CPU has a hardclock(9) and a statclock(). If schedhz is set, every CPU has a schedclock(), too. This commit only contains the MI pieces. All code is conditionally compiled with __HAVE_CLOCKINTR. This commit changes no behavior yet. At a high level, clockintr(9) is configured and used as follows: 1. During boot, the primary CPU calls clockintr_init(9). Global state is initialized. 2. Primary CPU calls clockintr_cpu_init(9). Local, per-CPU state is initialized. An "intrclock" struct may be installed, too. 3. Secondary CPUs call clockintr_cpu_init(9) to initialize their local state. 4. All CPUs repeatedly call clockintr_dispatch(9) from the MD clock interrupt handler. The CPUs complete work and rearm their local interrupt clock, if any, during the dispatch. 5. Repeat step (4) until the system shuts down, suspends, or hibernates. 6. During resume, the primary CPU calls inittodr(9) and advances the system uptime. 7. Go to step (2). This time around, clockintr_cpu_init(9) also advances the work schedule on the calling CPU to skip events that expired during suspend. This prevents a "thundering herd" of useless work during the first clock interrupt. In the long term, we need an MI clock interrupt scheduler in order to (1) provide control over the clock interrupt to MI subsystems like timeout(9) and dt(4) to improve their accuracy, (2) provide drivers like acpicpu(4) a means for slowing or stopping the clock interrupt on idle CPUs to conserve power, and (3) reduce the amount of duplicated code in the MD clock interrupt code. Before we can do any of that, though, we need to switch every platform over to using clockintr(9) and do some cleanup. Prompted by "the vmm(4) time bug," among other problems, and a discussion at a2k19 on the subject. Lots of design input from kettenis@. Early versions reviewed by kettenis@ and mlarkin@. Platform-specific help and testing from kettenis@, gkoehler@, mlarkin@, miod@, aoyama@, visa@, and dv@. Babysitting and spiritual guidance from mlarkin@ and kettenis@. Link: https://marc.info/?l=openbsd-tech&m=166697497302283&w=2 ok kettenis@ mlarkin@
2022-11-05	For textrel binaries, skipping immutability on text segments is not enough:	Theo de Raadt
	It needs to be all non-writeable segments, which really means rodata. crt0 and ld.so will need to call mimmutable() later on these regions. ok kettenis
2022-11-03	Style: always use *retval and never retval[0] in syscalls,	Philip Guenther
	to reflect that retval is just a single return value. ok miod@
2022-11-03	Make scdebug_ret() behave like ktrsysret(), showing the off_t value	Philip Guenther
	for lseek() and a single register_t value for all others. ok miod@
2022-11-02	Clean up more ancient history: since 2015 the libc stubs for	Philip Guenther
	fork/vfork/__tfork haven't cared about the second return register. So, stop setting retval[1] in kern_fork.c and stop setting the second return register in the MD child_return() routines. With the above, we have no multi-register return values on LP64, so stop touching that register in the trapframe on those archs. testing miod@ and aoyama@ ok miod@
2022-10-30	Simplfity setregs() by passing it the ps_strings and switching	Philip Guenther
	sys_execve() to return EJUSTRETURN. setregs() is the MD routine used by sys_execve() to set up the thread's trapframe and PCB such that, on 'return' to userspace, it has the register values defined by the ABI and otherwise zero. It had to set the syscall retval[] values previously because the normal syscall return path overwrites a couple registers with the retval[] values. By instead returning EJUSTRETURN that and some complexity with program-counter handling on m88k and sparc64 goes away. Also, give setregs() add a 'struct ps_strings *arginfo' argument so powerpc, powerpc64, and sh can directly get argc/argv/envp values for registers instead of copyin()ing the one in userspace. Improvements from miod@ and millert@ Testing assistance miod@, kettenis@, and aoyama@ ok miod@ kettenis@
2022-10-27	Unfortunately there are still ugly text-relocation binaries in the wild.	Theo de Raadt
	Libraries are less of a concern, because ld.so can fix them in the right order. So we must scan DYNAMIC for the TEXTREL marker, and not make X LOADs immutable. ld.so will apply changes to the text segment. In upcoming diff, crt0 and ld.so will then apply immutability. ok kettenis
2022-10-27	VMCMD_SYSCALL cannot be incorporated into flags variable, because flags	Theo de Raadt
	is inspected narrowly for base address later. ok kettenis
2022-10-26	Fix handling of PGIDs in wait4(2) that I broke with the previous commit.	Mark Kettenis
	ok anton@, millert@
2022-10-25	regen	Mark Kettenis

2022-10-25	mplement waitid(2) which is now part of POSIX and used by mozilla.	Mark Kettenis
	This includes a change of siginfo_r which is technically an ABI break but this should have no real-world impact since the members involved are never touched by the kernel. ok millert@, deraadt@
2022-10-25	Implement waitid(2) which is now part of POSIX and used by mozilla.	Mark Kettenis
	This includes a change of siginfo_r which is technically an ABI break but this should have no real-world impact since the members involved are never touched by the kernel. ok millert@, deraadt@
2022-10-23	A better workaround for mips64 mimmutable problem. The problem is the	Theo de Raadt
	DT_DEBUG word is inside a R LOAD that gets marked immutable, but ld.so does a mprotect RW + adjustment + mprotect R. DT_DEBUG is specified as being inside the DYNAMIC range, solet's do all the immutables and then, on mips64 only, turn around and make DYNAMIC mutable. That gives us time to see if we can move DT_DEBUG or change what ld.so is doing. discussed at length with kettenis
2022-10-22	automatic immutable for base executable is not ready on mips	Theo de Raadt
	because DT_DEBUG isn't in the right place
2022-10-21	uvm_map_immutable() takes start,end, not start,end	Theo de Raadt
	I juggled my trees incorrectly.
2022-10-21	the debug "name" parameter to uvm_map_immutable() is no longer needed	Theo de Raadt

2022-10-21	sigaltstack() was adapted to work on mimmutable regions (an unfortunate	Theo de Raadt
	compromise...), but it means the stack can be marked immutable again. ok kettenis
2022-10-21	automatically mark immutable certain regions in program&ld.so LOADs.	Theo de Raadt
	The large commented block in elf_load_psection explains the sitaution. ok kettenis.
2022-10-17	Change pru_abort() return type to the type of void and make pru_abort()	Vitaliy Makkoveev
	optional. We have no interest on pru_abort() return value. We call it only from soabort() which is dummy pru_abort() wrapper and has no return value. Only the connection oriented sockets need to implement (*pru_abort)() handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing code for all others, it doesn't called. ok guenther@
2022-10-16	Rather than marking MAP_STACK on entries for sigaltstack() [2 days ago],	Theo de Raadt
	go back to the old approach: using a new anon mapping because it removes any potential gadgetry pre-placed in the region (by making it zero). But also bring in a few more validation checks beyond contigious mapping -- it must not be a syscall region, and the protection must be precisely RW. This does allow sigaltstack() to shoot zero'd MAP_STACK non-immutable regions into the main stack area (which will soon be immutable). I am not sure we can keep reinforce immutable on the region after we do stack (like maybe determine this while doing the validation entry walk?) Sadly, continued support for sigaltstack() does require selecting the guessed best compromise. ok kettenis
2022-10-15	During the MAP_STACK introduction in 2018, sigaltstack() became a	Theo de Raadt
	problem because haphazard use could shoot holes in the address space (changing permissions, providing opportunities for pivoting, etc). I tried to write a diff to convert the address space correctly but did not understand enough about map entries, so instead we mapped new memory over top of the existing object. Placing a new mapping becomes unfeasible with the upcoming mimmutable model, so here is code that adds MAP_STACK to the region. It will only do so for a contigiously mapped region that is non-syscall with permission RW, otherwise it returns an error. Food for thought: If we know the object isn't service by an object, we should consider zero'ing the region, to block pre-pivot placement? ok kettenis
2022-10-12	Extend struct todr_chip_handle with a todr_quality member. This allows us	Mark Kettenis
	to assign a quality to RTC implementation and pick the "best" RTC if a system has multiple RTCs (or multiple interfaces to an RTC). This allows us to prefer a battery-backed I2C RTC over an RTC that is part of the SoC which is only running of the SoC is powered. It also allows us to work around issues with firmware RTC interfaces that may lie to us or even crash the system. This change makes sure the todr_quality member of the struct is always initialized. In most cases the quality will be set to zero; further adjustments of the quality for specific subsystems/architectures will follow. ok cheloha@, patrick@
2022-10-12	The sigaltstack() MAP_STACK re-map mechanism is incompatible with immutable	Theo de Raadt
	regions, so immutable stack isn't viable yet. There are configure programs which create sigstacks upon their own stacks, and there is no simple fix for the sigaltstack mechanism... discovered by sthen and tb
2022-10-11	Give checkdisklabel() a new parameter supplying the dev_t of the	Kenneth R Westerback
	device whose disklabel is being checked. Within checkdisklabel() use this information to discover a device name iff (sic) the label is an obsolete version. Use the name to generate a meaningful warning message asking the user to rewrite the disklabel and thus promote it to the current version. Suggested by, feedback from and ok deraadt@
2022-10-08	The stack can also be marked immutable, because we expect no sane program	Theo de Raadt
	to try to change the permissions of it. We won't know who's trying that until we enable it and see what breaks. A tricky piece relating to setrlimit stack size changing was previously commited. ok kettenis
2022-10-08	The signal trampoline and timekeep regions can be marked immutable at	Theo de Raadt
	execve() time ok kettenis
2022-10-07	sync	Theo de Raadt

2022-10-07	Add mimmutable(2) system call which locks the permissions (PROT_*) of	Theo de Raadt
	memory mappings so they cannot be changed by a later mmap(), mprotect(), or munmap(), which will error with EPERM instead. ok kettenis
2022-10-03	System calls should not fail due to temporary memory shortage in	Alexander Bluhm
	malloc(9) or pool_get(9). Pass down a wait flag to pru_attach(). During syscall socket(2) it is ok to wait, this logic was missing for internet pcb. Pfkey and route sockets were already waiting. sonewconn() must not wait when called during TCP 3-way handshake. This logic has been preserved. Unix domain stream socket connect(2) can wait until the other side has created the socket to accept. OK mvs@
2022-10-03	Add a second membar producer into counters_zero(). Now it is	Alexander Bluhm
	symmetric to counters_read(). OK jmatthew@
2022-10-01	The syscall table generation awk script was also used by compat layers	Theo de Raadt
	in the past, but those compat layers are gone. Remove support for the "config file" ok miod millert