summaryrefslogtreecommitdiff
path: root/sys/kern
AgeCommit message (Collapse)Author
2022-08-22Move PRU_DISCONNECT request to (*pru_disconnect).Vitaliy Makkoveev
ok bluhm@
2022-08-22Move PRU_ACCEPT request to (*pru_accept)().Vitaliy Makkoveev
ok bluhm@
2022-08-21Move PRU_CONNECT request to (*pru_connect)() handler.Vitaliy Makkoveev
ok bluhm@
2022-08-21Move PRU_LISTEN request to (*pru_listen)() handler.Vitaliy Makkoveev
ok bluhm@
2022-08-21Change soabort() return value to void. We never interesting on it.Vitaliy Makkoveev
ok bluhm@
2022-08-20Move PRU_BIND request to (*pru_bind)() handler.Vitaliy Makkoveev
For the protocols which don't support request, leave handler NULL. Do the NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in such case. This will be done for all upcoming user request handlers. ok bluhm@ guenther@
2022-08-20Restore the exemption from start/size checks that OpenBSD (A6)Kenneth R Westerback
MBR partitions previously enjoyed. Found and fix tested by matthieu@
2022-08-16Remove kqueue-related ktrace points from poll(2) and select(2)Visa Hankala
These ktrace points do not seem useful any longer because the new implementation of poll(2) and select(2) appears to work well. OK deraadt@ mpi@
2022-08-16Remove obsolete kern.nselcoll sysctl.Visa Hankala
OK millert@ deraadt@
2022-08-15Revert previous. It was not ok'ed by dlg@.Vitaliy Makkoveev
2022-08-15Introduce 'pr_usrreqs' structure and move existing user-protocolVitaliy Makkoveev
handlers into it. We want to split existing (*pr_usrreq)() to multiple short handlers for each PRU_ request as it was already done for PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)() split will be done with the following diffs. Based on reverted diff from guenther@. ok bluhm@
2022-08-15Stop doing lockless `t_flags' check within task_add(9) and task_del(9).Vitaliy Makkoveev
This doesn't work on MP systems. We do locked `t_flags' check just after lockless check, so just remove it. ok dlg@
2022-08-14remove unneeded includes in sys/kernJonathan Gray
ok mpi@ miod@
2022-08-13Introduce the pru_*() wrappers for corresponding (*pr_usrreq)() calls.Vitaliy Makkoveev
This is helpful for the following (*pr_usrreq)() split to multiple handlers. But right now this makes code more readable. Also add '#ifndef _SYS_SOCKETVAR_H_' to sys/socketvar.h. This prevents the collisions when both sys/protosw.h and sys/socketvar.h are included together. Both 'socket' and 'protosw' structures are required to be defined before pru_*() wrappers, so we need to include sys/socketvar.h to sys/protosw.h. ok bluhm@
2022-08-13blist: fix a possible blist corruption with blist_alloc() due to unsignedSebastien Marie
swblk_t on OpenBSD. reorder if condition in blst_meta_alloc(), in order to check if the node is 'Terminator' node first (and leave the loop). DragonFlyBSD is unaffected by it as swblk_t is signed (and the first condition isn't taken). add a regress test for it. while here, more the KASSERT() to KDASSERT(). it is useful but only with DEBUG. ok miod@ todd@
2022-08-12Revert to pre-r1.249 more laissez-faire checks for valid MBRKenneth R Westerback
partitions. miod@ (re)discovered an off-by-one in some device size calculations. Whether the ancient misbehaviour of some devices to confuse number of sectors with highest valid sector address or something newer. Should fix miod@'s octeon boot disk.
2022-08-12Put more struct vnode fields under splbio().Visa Hankala
Buffer cache related struct vnode fields can be accessed in interrupt context. Be more consistent with the use of splbio(). OK mpi@
2022-08-12amd64: simplify TSC synchronization testingScott Soule Cheloha
Computing a per-CPU TSC skew value is error-prone, especially on multisocket machines and VMs. My best guess is that larger latencies appear to the current skew measurement test as TSC desync, and so the TSC is demoted to a kernel timecounter on these machines or marked non-monotonic. This patch eliminates per-CPU TSC skew values. Instead of trying to measure and correct for TSC desync we only try to detect desync, which is less error-prone. This approach should allow a wider variety of machines to use the TSC as a timecounter when running OpenBSD. In the new sync test, both CPUs repeatedly try to detect whether their TSC is trailing the other CPU's TSC. The upside to this approach is that it yields no false positives. The downside to this approach is that it takes more time than the current skew measurement test. Each test round takes 1ms, and we run up to two rounds per CPU, so this patch slows boot down by 2ms per AP. If any CPU fails the sync test, the TSC is marked non-monotonic and a different timecounter is activated. The TC_USER flag remains intact. There is no middle ground where we fall back to only using the TSC in the kernel. Before running the test, we check for the IA32_TSC_ADJUST register and reset it if necessary. This is a trivial way to work around firmware bugs that desync the TSC before we reach the kernel. Unfortunately, at the moment this register appears to only be available on Intel processors. I cannot find an equivalent but differently-named MSR for AMD processors. Because there is no per-CPU skew value, there is also no concept of TSC drift anymore. Miscellaneous notes: - This patch adds a new timecounter utility function, tc_reset_quality(). Used after sync test failure to mark the TSC non-monotonic. - I have left TSC_DEBUG enabled for now. Unsure if we should leave it enabled for release or not. If we disable it we no longer run the sync test after failing it once. Running the test even after failure provides information about the desync on every CPU. - Taking 1ms per test round is fairly conservative. We can experiment with and discuss shorter test rounds. My main goal with a relatively long test round is ensuring VMs actually run the test. It would be bad if a hypervisor interrupted the test for so long that it concealed desync. - The use of two test rounds is mostly a diagnostic tool: it would be very strange if a CPU passed the first round but failed the second. If we ever saw this in the wild it would indicate something odd. - Most of the desync seen in test reports is on Ryzen CPUs. I believe, but cannot prove, that this is due to a widespread firmware bug on AMD motherboards. Hopefully AMD and/or the downstream vendors fix it. - Fixing TSC desync by writing the TSC directly with WRMSR is very difficult. The TSC is a moving target incrementing very quickly and compensating for WRMSR overhead is non-trivial. We can experiment with this, but my confidence is low that we can make it work reliably. Prompted by deraadt@ and kettenis@ in 2021. Shepherded along by deraadt@ throughout. Reprompted by Yuichiro Naito several times. With input from Yuichiro Naito, naddy@, sthen@, dv@, and deraadt@. Tested by florian@, gnezdo@, sthen@, Josh Rickmar, dv@, Mohamed Aslan, Hrvoje Popovski, Yuichiro Naito, semarie@, mlarkin@, asou@, jmatthew@, Renato Aguiar, and Timo Myyra. Patch v1: https://marc.info/?l=openbsd-tech&m=164330092208035&w=2 Patch v2: https://marc.info/?l=openbsd-tech&m=164558519712957&w=2 Patch v3: https://marc.info/?l=openbsd-tech&m=165698681018991&w=2 Patch v4: https://marc.info/?l=openbsd-tech&m=165835507113680&w=2 Patch v5: https://marc.info/?l=openbsd-tech&m=165923705118770&w=2 "just commit it" deraadt@
2022-08-12Coverity says multiplying two uint32_t's and assigning them toKenneth R Westerback
a uint64_t may not produce the (humanly) obvious result. Cast one of them to a (uint64_t) in the hope of invoking the appropriate int promotion god. CID 1519495
2022-08-11Don't trust gpt header data read from disk until after itsKenneth R Westerback
validity is checked. Found the hard way by kn@ Cluebats from millert@ and deraadt@. Fix tested by and ok kn@
2022-08-11Add TCP_INFO support to getsockopt for tcp sessions.Claudio Jeker
TCP_INFO provides a lot of information about the TCP session of this socket. Many processes like to peek at the rtt of a connection but this also provides a lot of more special info for use by e.g. tcpbench(1). While the basic minimal info is available all the time the more specific data is only populated for privileged processes. This is done to not share data back to userland that may allow to attack a session. TCP_INFO is available to pledge "inet" since pledged processes like chrome tend to use TCP_INFO when available. OK bluhm@
2022-08-08Before ypconnect(2) addition, "getpw" was a horrible "hole" that triggeredTheo de Raadt
on libc trying to open /var/run/ypbind.lock, so pledge had to BYPASSUNVEIL accesses to this file. We accepted the opening of that file for a small period for build cross-over, but that waiting period ends now.
2022-08-06Refactor readdoslabel() into a more readable form using variousKenneth R Westerback
helper functions. The refactored code ensures disklabels are read from/written to disk only from/to unused space or an OpenBSD partition. This prevents accidental damage to filesystems that start immediately following an MBR or GPT. The refactored code also finds the disklabel present on the i386/amd64 floppyXX.img, rather than spoofing the media as a single MSDOS partition. Tweak and positive comments from jmatthew@
2022-08-06blist: use swblk_t type (defined in sys/blist.h)Sebastien Marie
reduce the diff with DragonFlyBSD by using swblk_t and u_swblk_t types. while here, move bitmap type (u_swblk_t) to u_int64_t on all archs. it makes the regress the same on 64 and 32bits archs (and it success on both). ok mpi@
2022-08-02some ports bootstraps, and go internals, need a bit more time to adaptTheo de Raadt
to the padded syscalls going away.
2022-08-01syncTheo de Raadt
2022-08-01some ports bootstraps, and go internals, need a bit more time to adaptTheo de Raadt
to the padded syscalls going away.
2022-07-29Replace the swap extent(9) usage by a blist data structure.Sebastien Marie
It makes uvm_swap_free() faster: extents have a cost of O(n*n) which doesn't really scale with gigabytes of swap. Based on initial work from mpi@ The blist implementation comes from DragonFlyBSD. The diff adds also a ddb(4) 'show swap' command to show the blist and help debugging, and fix some off-by-one in size printed during hibernate. ok mpi@
2022-07-26Only allow changing the domainname (from empty) before securelevel increase.Theo de Raadt
libc YP support has a couple of places where the domainname is cached, and this results in wildly incoherent behaviour which could even be risky. If you want to change the domainname, you will have to reboot. ok beck miod
2022-07-25Replace selwakeup() with KNOTE() in socket event activationVisa Hankala
Let's try this again now that the kernel locking issue in nfsrv_rcv() has been fixed. The previous attempt of the conversion triggered hangs on NFS servers. This was probably caused by the removal of the kernel-locked section just prior to the socket upcall. The section had masked a locking error in NFS code.
2022-07-23timecounting: use full 96-bit product when computing elapsed timeScott Soule Cheloha
The timecounting subsystem computes elapsed time by scaling (64 bits) the difference between two counter values (32 bits at most) up into a struct bintime (128 bits). Under normal circumstances it is sufficient to do this with 64-bit multiplication, like this: struct bintime bt; bt.sec = 0; bt.frac = th->tc_scale * tc_delta(th); However, if tc_delta() exceeds 1 second's worth of counter ticks, that multiplication overflows. The result is that the monotonic clock appears to jump backwards. When can this happen? In practice, I have seen it when trying to compile LLVM on an EdgeRouter Lite when using an SD card as the backing disk. The box gets stuck in swap, the hardclock(9) is delayed, and we appear to "lose time". To avoid this overflow we need to compute the full 96-bit product of the delta and the scale. This commit adds TIMECOUNT_TO_BINTIME(), a function for computing that full product, to sys/time.h. The patch puts the new function to use in lib/libc/sys/microtime.c and sys/kern/kern_tc.c. (The commit also reorganizes some of our high resolution bintime code so that we always read the timecounter first.) Doing the full 96-bit multiplication is between 0% and 15% slower than doing the cheaper 64-bit multiplication on amd64. Measuring a precise difference is extremely difficult because the computation is already quite fast. I would guess that the cost is slightly higher than that on 32-bit platforms. Nobody ever volunteered to test, so this remains a guess. Thread: https://marc.info/?l=openbsd-tech&m=163424607918042&w=2 6 month bump: https://marc.info/?l=openbsd-tech&m=165124251401342&w=2 Committed after 9 months without review.
2022-07-23kernel: remove global "randompid" toggleScott Soule Cheloha
Apparently, we used to created several kthreads before the kernel random number generator was up and running. A toggle, "randompid", was needed to tell allocpid() whether it made sense to attempt to allocate random PIDs. However, these days we get e.g. arc4random(9) into a working state before any kthreads are spawned, so the toggle is no longer needed. Thread: https://marc.info/?l=openbsd-tech&m=165541052614453&w=2 Very nice historical context provided by miod@. probably ok miod@ deraadt@
2022-07-20the _pad_ system calls from 2021/12/23 can go awayTheo de Raadt
ok guenther
2022-07-20syncTheo de Raadt
2022-07-20the _pad_ system calls from 2021/12/23 can go awayTheo de Raadt
ok guenther
2022-07-18Restrict pledge("vminfo") callers to read-only swapctl(2) operations.Jeremie Courreges-Anglas
Those are the read-only operations allowed for non-root users: SWAP_NSWAP and SWAP_STATS. Users of pledge("vminfo") in base which also call swapctl(2) with said commands: top(1) and pstat(8). No regression spotted with top(1) and pstat(8) -s/-T. ok deraadt@
2022-07-18Delete the YPACTIVE toggling code when "getpw" code access/open are done toTheo de Raadt
/var/run/ypbind.lock. "getpw" is now only allows ypconnect(2) and the minimum unveil bypasses. Still allow open/acesss to file for a little while, because getpwent/getgrent/etc were opening it unconditionally to hint for YPACTIVE. That code should be deleted before 7.2
2022-07-18the domainname is under root control, but because we are producing a pathTheo de Raadt
inside ypconnect(), it is best if we prevent "../" problems. so reject domainnames containing '/. discussed with jca
2022-07-18For opening up the bindings file in ypconnect(2), bail out earlyTheo de Raadt
if chrooted issue pointed out by semarie
2022-07-17backout last step: the path checks are too strong until everyone has aTheo de Raadt
new libc..
2022-07-17the PLEDGE_YPACTIVE "hack" bit related to "getpw" pledge goes away. libcTheo de Raadt
no longer does accesses /var/run/ypbind.lock to trigger extra permissions for userland-opening of files & sockets to engage with ypserver for YP/LDAP lookups. libc now uses the super secret special ypconnect() system call to perform socket-setup. Delete some other things which are no longer reached via libc/rpc ok jmatthew, miod
2022-07-15Allow ypconnect() in "getpw"Theo de Raadt
Annotate two blocks relating to ypbind.lock that will be deleted once libc switches over to the new mechanism.
2022-07-15syncTheo de Raadt
2022-07-15pledge "getpw" would notice access to /var/run/ypbind.lock, and grant "inet"Theo de Raadt
rights, so that libc/yp could access YP services via a fairly complex 'protocol' including file access, sockets, etc. This YP protocol is also used by ypldap -- this is our way of bringing 'NIS' services into libc without monster sub-libraries. I have managed to remove this "inet" right by creating a new ypconnect() system call, which performs parts of the yp_bind.c dance inside the kernel.. It checks if domainname is set, looks for a binding file with advisory lock, reads it to get the IP and udp/tcp port numbers, and then establishes a connnected socket direct to that ypserv. This socket has a SS_YP flag set, and non-required system calls are prohibited. libc maintains lifetime on this socket so a process should never see it, but it seems safer to block udp re-connect and other calls even in non-pledge mode. Userland changes to use this will follow in a few days. Lots of help from claudio and jmatthew, also ok miod
2022-07-10Remove trailing whitespace. No code change.Mike Larkin
2022-07-09Unwrap klist from struct selinfo as this code no longer uses selwakeup().Visa Hankala
OK jsg@
2022-07-05Remove old poll/select wakeup mechanism.Visa Hankala
Also remove unneeded seltrue() and selfalse(). OK mpi@ jsg@
2022-07-02Unlock peer in the SOCK_STREAM and SOCK_SEQPACKET error path.Vitaliy Makkoveev
Reported-by: syzbot+a648408d6a58fd40b59a@syzkaller.appspotmail.com by anton@
2022-07-02Remove unused device poll functions.Visa Hankala
Also remove unneeded includes of <sys/poll.h> and <sys/select.h>. Some addenda from jsg@. OK miod@ mpi@
2022-07-01Make fine grained unix(4) domain sockets locking. Use the per-socketVitaliy Makkoveev
`so_lock' rwlock(9) instead of global `unp_lock' which locks the whole layer. The PCB of unix(4) sockets are linked to each other and we need to lock them both. This introduces the lock ordering problem, because when the thread (1) keeps lock on `so1' and trying to lock `so2', the thread (2) could hold lock on `so2' and trying to lock `so1'. To solve this we always lock sockets in the strict order. For the sockets which are already accessible from userland, we always lock socket with the smallest memory address first. Sometimes we need to unlock socket before lock it's peer and lock it again. We use reference counters for prevent the connected peer destruction during to relock. We also handle the case where the peer socket was replaced by another socket. For the newly connected sockets, which are not yet exported to the userland by accept(2), we always lock the listening socket `head' first. This allows us to avoid unwanted relock within accept(2) syscall. ok claudio@