summaryrefslogtreecommitdiff
path: root/sys/kern
AgeCommit message (Collapse)Author
2022-11-10Add mbr_get_fstype() and use it to translate MBR dp_typ fieldsKenneth R Westerback
into FS_* values. Similar to what gpt_get_fstype() does. Code is clearer and better positioned for planned enhancements to spoofing. No intentional functional change.
2022-11-10Put CPUs in the lowest P-state just before the final suspend step. TheMark Kettenis
firmware probably does this for us on ACPI systems with proper S3 support, but this doesn't happen on systems where we park CPUs in a low-power idle state ourselves. ok deraadt@
2022-11-10Add support for per-cpu event counters, to be used for clock and IPIJonathan Matthew
counters where the event being counted occurs across all CPUs in the system. Counter instances can be made per-cpu by calling evcount_percpu() after the counter is attached, and this can occur before or after all system CPUs are attached. Per-cpu counter instances should be incremented using evcount_inc(). ok kettenis@ jca@ cheloha@
2022-11-10fix build after 1.298Jonathan Gray
2022-11-09Remove kernel lock here since msleep() with PCATCH no longer requires it.Claudio Jeker
OK mpi@
2022-11-09Some limited setsockopt/getsockopt are allowed in pledge "stdio".Theo de Raadt
Also allow IPPROTO_TCP:TCP_NODELAY It is very small kernel code, and will allow some software to drop "inet" requested by djm
2022-11-09Simplify the overly complex VXLOCK handling in spec_close.Claudio Jeker
The code only needs to know if the vnode is exclusive locked and this can be done on entry of the function. OK mpi@
2022-11-09timeout(9): remove TIMEOUT_KCLOCK flagScott Soule Cheloha
I never should have added the TIMEOUT_KCLOCK flag. It is redundant and only serves to complicate the timeout(9) logic. In every place where we check for the flag we can just use timeout.to_kclock. So, remove the flag from <sys/timeout.h> and rewrite all affected logic to use the value of timeout.to_kclock instead. ok kn@
2022-11-09regenMartin Pieuchot
2022-11-09gpt_get_fstype() doesn't modify its parameter so make saidKenneth R Westerback
parameter const.
2022-11-09Mark sched_yield(2) as NOLOCK.Martin Pieuchot
All the fields accessed in this syscall are protected by the SCHED_LOCK() so it isn't necessary to wait for another CPU to release the KERNEL_LOCK() before that. ok claudio@
2022-11-08allow the KERN_AUTOCONF_SERIAL sysctl in pledge'd processesRobert Nagy
ok deraadt@
2022-11-08timeout(9): remove unused, undocumented timeout_in_nsec() interfaceScott Soule Cheloha
The kernel is not quite ready for timeout_in_nsec(). Remove it and kclock_nanotime(). Both are unused. Prompted by jsg@. ok kn@
2022-11-08tc_setclock: don't print a warning if tc_windup() rejects inittodr(9) timeScott Soule Cheloha
During resume, it isn't necessarily a problem if the UTC time we get from inittodr(9) lags behind the system UTC clock. In particular, if the active timecounter's frequency is low enough, tc_delta() might not overflow across a brief suspend. Remove the misleading warning message. The code is behaving as intended, just not in a way I anticipated when I added the warning message a few years ago. Discovered by kettenis@. Root cause isolated with kettenis@. Link: https://marc.info/?l=openbsd-tech&m=166790845619897&w=2 ok mlarkin@ kettenis@
2022-11-08Push kernel lock down into ifioctl()Klemens Nanni
This is a mechanical diff without semantical changes, locking ioctls individually inside ifioctl() rather than all of them around it. This allows us to unlock ioctls one by one. OK mpi
2022-11-08RegenMartin Pieuchot
2022-11-08Mark mmap(2), munmap(2) and mprotect(2) as NOLOCK.Martin Pieuchot
Accesses to data structures used by these syscalls are serialized by the VM map lock with the exception of file mappings which are still protected by the KERNEL_LOCK(). Unlocking this set of syscalls improves most of userland workloads. Tested by many including robert@ (since 2 years), mlarkin@, kn@, sdk@, jca@, aoyama@, naddy@, Scott Bennett and others. Thanks to all! Joint work with kn@. ok robert@, aja@, kettenis@, kn@, deraadt@, beck@
2022-11-07introduce a new kern.autoconf_serial sysctl that can be used by userlandRobert Nagy
to monitor state changes of the kernel device tree input from dnd ok dlg@, deraadt@
2022-11-07Nuke last references to d_drivedata.Kenneth R Westerback
2022-11-05clockintr(9): initial commitScott Soule Cheloha
clockintr(9) is a machine-independent clock interrupt scheduler. It emulates most of what the machine-dependent clock interrupt code is doing on every platform. Every CPU has a work schedule based on the system uptime clock. For now, every CPU has a hardclock(9) and a statclock(). If schedhz is set, every CPU has a schedclock(), too. This commit only contains the MI pieces. All code is conditionally compiled with __HAVE_CLOCKINTR. This commit changes no behavior yet. At a high level, clockintr(9) is configured and used as follows: 1. During boot, the primary CPU calls clockintr_init(9). Global state is initialized. 2. Primary CPU calls clockintr_cpu_init(9). Local, per-CPU state is initialized. An "intrclock" struct may be installed, too. 3. Secondary CPUs call clockintr_cpu_init(9) to initialize their local state. 4. All CPUs repeatedly call clockintr_dispatch(9) from the MD clock interrupt handler. The CPUs complete work and rearm their local interrupt clock, if any, during the dispatch. 5. Repeat step (4) until the system shuts down, suspends, or hibernates. 6. During resume, the primary CPU calls inittodr(9) and advances the system uptime. 7. Go to step (2). This time around, clockintr_cpu_init(9) also advances the work schedule on the calling CPU to skip events that expired during suspend. This prevents a "thundering herd" of useless work during the first clock interrupt. In the long term, we need an MI clock interrupt scheduler in order to (1) provide control over the clock interrupt to MI subsystems like timeout(9) and dt(4) to improve their accuracy, (2) provide drivers like acpicpu(4) a means for slowing or stopping the clock interrupt on idle CPUs to conserve power, and (3) reduce the amount of duplicated code in the MD clock interrupt code. Before we can do any of that, though, we need to switch every platform over to using clockintr(9) and do some cleanup. Prompted by "the vmm(4) time bug," among other problems, and a discussion at a2k19 on the subject. Lots of design input from kettenis@. Early versions reviewed by kettenis@ and mlarkin@. Platform-specific help and testing from kettenis@, gkoehler@, mlarkin@, miod@, aoyama@, visa@, and dv@. Babysitting and spiritual guidance from mlarkin@ and kettenis@. Link: https://marc.info/?l=openbsd-tech&m=166697497302283&w=2 ok kettenis@ mlarkin@
2022-11-05For textrel binaries, skipping immutability on text segments is not enough:Theo de Raadt
It needs to be all non-writeable segments, which really means rodata. crt0 and ld.so will need to call mimmutable() later on these regions. ok kettenis
2022-11-03Style: always use *retval and never retval[0] in syscalls,Philip Guenther
to reflect that retval is just a single return value. ok miod@
2022-11-03Make scdebug_ret() behave like ktrsysret(), showing the off_t valuePhilip Guenther
for lseek() and a single register_t value for all others. ok miod@
2022-11-02Clean up more ancient history: since 2015 the libc stubs forPhilip Guenther
fork/vfork/__tfork haven't cared about the second return register. So, stop setting retval[1] in kern_fork.c and stop setting the second return register in the MD child_return() routines. With the above, we have no multi-register return values on LP64, so stop touching that register in the trapframe on those archs. testing miod@ and aoyama@ ok miod@
2022-10-30Simplfity setregs() by passing it the ps_strings and switchingPhilip Guenther
sys_execve() to return EJUSTRETURN. setregs() is the MD routine used by sys_execve() to set up the thread's trapframe and PCB such that, on 'return' to userspace, it has the register values defined by the ABI and otherwise zero. It had to set the syscall retval[] values previously because the normal syscall return path overwrites a couple registers with the retval[] values. By instead returning EJUSTRETURN that and some complexity with program-counter handling on m88k and sparc64 goes away. Also, give setregs() add a 'struct ps_strings *arginfo' argument so powerpc, powerpc64, and sh can directly get argc/argv/envp values for registers instead of copyin()ing the one in userspace. Improvements from miod@ and millert@ Testing assistance miod@, kettenis@, and aoyama@ ok miod@ kettenis@
2022-10-27Unfortunately there are still ugly text-relocation binaries in the wild.Theo de Raadt
Libraries are less of a concern, because ld.so can fix them in the right order. So we must scan DYNAMIC for the TEXTREL marker, and not make X LOADs immutable. ld.so will apply changes to the text segment. In upcoming diff, crt0 and ld.so will then apply immutability. ok kettenis
2022-10-27VMCMD_SYSCALL cannot be incorporated into flags variable, because flagsTheo de Raadt
is inspected narrowly for base address later. ok kettenis
2022-10-26Fix handling of PGIDs in wait4(2) that I broke with the previous commit.Mark Kettenis
ok anton@, millert@
2022-10-25regenMark Kettenis
2022-10-25mplement waitid(2) which is now part of POSIX and used by mozilla.Mark Kettenis
This includes a change of siginfo_r which is technically an ABI break but this should have no real-world impact since the members involved are never touched by the kernel. ok millert@, deraadt@
2022-10-25Implement waitid(2) which is now part of POSIX and used by mozilla.Mark Kettenis
This includes a change of siginfo_r which is technically an ABI break but this should have no real-world impact since the members involved are never touched by the kernel. ok millert@, deraadt@
2022-10-23A better workaround for mips64 mimmutable problem. The problem is theTheo de Raadt
DT_DEBUG word is inside a R LOAD that gets marked immutable, but ld.so does a mprotect RW + adjustment + mprotect R. DT_DEBUG is specified as being inside the DYNAMIC range, solet's do all the immutables and then, on mips64 only, turn around and make DYNAMIC mutable. That gives us time to see if we can move DT_DEBUG or change what ld.so is doing. discussed at length with kettenis
2022-10-22automatic immutable for base executable is not ready on mipsTheo de Raadt
because DT_DEBUG isn't in the right place
2022-10-21uvm_map_immutable() takes start,end, not start,endTheo de Raadt
I juggled my trees incorrectly.
2022-10-21the debug "name" parameter to uvm_map_immutable() is no longer neededTheo de Raadt
2022-10-21sigaltstack() was adapted to work on mimmutable regions (an unfortunateTheo de Raadt
compromise...), but it means the stack can be marked immutable again. ok kettenis
2022-10-21automatically mark immutable certain regions in program&ld.so LOADs.Theo de Raadt
The large commented block in elf_load_psection explains the sitaution. ok kettenis.
2022-10-17Change pru_abort() return type to the type of void and make pru_abort()Vitaliy Makkoveev
optional. We have no interest on pru_abort() return value. We call it only from soabort() which is dummy pru_abort() wrapper and has no return value. Only the connection oriented sockets need to implement (*pru_abort)() handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing code for all others, it doesn't called. ok guenther@
2022-10-16Rather than marking MAP_STACK on entries for sigaltstack() [2 days ago],Theo de Raadt
go back to the old approach: using a new anon mapping because it removes any potential gadgetry pre-placed in the region (by making it zero). But also bring in a few more validation checks beyond contigious mapping -- it must not be a syscall region, and the protection must be precisely RW. This does allow sigaltstack() to shoot zero'd MAP_STACK non-immutable regions into the main stack area (which will soon be immutable). I am not sure we can keep reinforce immutable on the region after we do stack (like maybe determine this while doing the validation entry walk?) Sadly, continued support for sigaltstack() does require selecting the guessed best compromise. ok kettenis
2022-10-15During the MAP_STACK introduction in 2018, sigaltstack() became aTheo de Raadt
problem because haphazard use could shoot holes in the address space (changing permissions, providing opportunities for pivoting, etc). I tried to write a diff to convert the address space correctly but did not understand enough about map entries, so instead we mapped new memory over top of the existing object. Placing a new mapping becomes unfeasible with the upcoming mimmutable model, so here is code that adds MAP_STACK to the region. It will only do so for a contigiously mapped region that is non-syscall with permission RW, otherwise it returns an error. Food for thought: If we know the object isn't service by an object, we should consider zero'ing the region, to block pre-pivot placement? ok kettenis
2022-10-12Extend struct todr_chip_handle with a todr_quality member. This allows usMark Kettenis
to assign a quality to RTC implementation and pick the "best" RTC if a system has multiple RTCs (or multiple interfaces to an RTC). This allows us to prefer a battery-backed I2C RTC over an RTC that is part of the SoC which is only running of the SoC is powered. It also allows us to work around issues with firmware RTC interfaces that may lie to us or even crash the system. This change makes sure the todr_quality member of the struct is always initialized. In most cases the quality will be set to zero; further adjustments of the quality for specific subsystems/architectures will follow. ok cheloha@, patrick@
2022-10-12The sigaltstack() MAP_STACK re-map mechanism is incompatible with immutableTheo de Raadt
regions, so immutable stack isn't viable yet. There are configure programs which create sigstacks upon their own stacks, and there is no simple fix for the sigaltstack mechanism... discovered by sthen and tb
2022-10-11Give checkdisklabel() a new parameter supplying the dev_t of theKenneth R Westerback
device whose disklabel is being checked. Within checkdisklabel() use this information to discover a device name iff (sic) the label is an obsolete version. Use the name to generate a meaningful warning message asking the user to rewrite the disklabel and thus promote it to the current version. Suggested by, feedback from and ok deraadt@
2022-10-08The stack can also be marked immutable, because we expect no sane programTheo de Raadt
to try to change the permissions of it. We won't know who's trying that until we enable it and see what breaks. A tricky piece relating to setrlimit stack size changing was previously commited. ok kettenis
2022-10-08The signal trampoline and timekeep regions can be marked immutable atTheo de Raadt
execve() time ok kettenis
2022-10-07syncTheo de Raadt
2022-10-07Add mimmutable(2) system call which locks the permissions (PROT_*) ofTheo de Raadt
memory mappings so they cannot be changed by a later mmap(), mprotect(), or munmap(), which will error with EPERM instead. ok kettenis
2022-10-03System calls should not fail due to temporary memory shortage inAlexander Bluhm
malloc(9) or pool_get(9). Pass down a wait flag to pru_attach(). During syscall socket(2) it is ok to wait, this logic was missing for internet pcb. Pfkey and route sockets were already waiting. sonewconn() must not wait when called during TCP 3-way handshake. This logic has been preserved. Unix domain stream socket connect(2) can wait until the other side has created the socket to accept. OK mvs@
2022-10-03Add a second membar producer into counters_zero(). Now it isAlexander Bluhm
symmetric to counters_read(). OK jmatthew@
2022-10-01The syscall table generation awk script was also used by compat layersTheo de Raadt
in the past, but those compat layers are gone. Remove support for the "config file" ok miod millert