summaryrefslogtreecommitdiff
path: root/sys/kern
AgeCommit message (Collapse)Author
2024-04-17dogetrusage() must be called with the KERNEL_LOCK held for now.Claudio Jeker
OK mpi@
2024-04-15Don't take solock() in soreceive() for udp(4) sockets.Vitaliy Makkoveev
These sockets are not connection oriented, they don't call pru_rcvd(), but they have splicing ability and they set `so_error'. Splicing ability is the most problem. However, we can hold `sb_mtx' around `ssp_socket' modifications together with solock(). So the `sb_mtx' is pretty enough to isspiced() check in soreceive(). The unlocked `so_sp' dereference is fine, because we set it only once for the whole socket life-time and we do this before `ssp_socket' assignment. We also need to take sblock() before splice sockets, so the sosplice() and soreceive() are both serialized. Since `sb_mtx' required to unsplice sockets too, it also serializes somove() with soreceive() regardless on somove() caller. The sosplice() was reworked to accept standalone sblock() for udp(4) sockets. soreceive() performs unlocked `so_error' check and modification. Previously, we have no ability to predict which concurrent soreceive() or sosend() thread will fail and clean `so_error'. With this unlocked access we could have sosend() and soreceive() threads which fails together. `so_error' stored to local `error2' variable because `so_error' could be overwritten by concurrent sosend() thread. Tested and ok bluhm
2024-04-15Regen after sigsuspend and __thrsigdivert unlockClaudio Jeker
2024-04-15sigsuspend and __thrsigdivert no longer require the KERNEL_LOCK sinceClaudio Jeker
dosigsuspend() no longer needs it. OK mvs@ mpi@
2024-04-13correct indentationJonathan Gray
no functional change, found by smatch warnings ok miod@ bluhm@
2024-04-12Split single TCP inpcb table into IPv4 and IPv6 parts.Alexander Bluhm
With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-04-11Don't take solock() in soreceive() for SOCK_RAW inet sockets.Vitaliy Makkoveev
For inet sockets solock() is the netlock wrapper, so soreceive() could be performed simultaneous with exclusively locked code paths. These sockets are not connection oriented, they don't call pru_rcvd(), they can't be spliced, they don't set `so_error'. Nothing to protect with solock() in soreceive() path. `so_rcv' buffer protected by `sb_mtx' mutex(9), but since it released, sblock() required to serialize concurrent soreceive() and sorflush() threads. Current sblock() is some kind of rwlock(9) implementation, so introduce `sb_lock' rwlock(9) and use it directly for that purpose. The sorflush() and callers were refactored to avoid solock() for raw inet sockets. This was done to avoid packet processing stop. Tested and ok bluhm.
2024-04-11Take solock_shared() in soo_stat().Vitaliy Makkoveev
Only unix(4) and tcp(4) sockets set (*pru_sence)() handler. The rest of soo_stat() is the read only access. ok bluhm
2024-04-10Remove `head' socket re-locking in sonewconn().Vitaliy Makkoveev
uipc_attach() releases solock() because it should be taken after `unp_gc_lock' rwlock(9) which protects the `unp_link' list. For this reason, the listening `head' socket should be unlocked too while sonewconn() calls uipc_attach(). This could be reworked because now `so_rcv' sockbuf relies on `sb_mtx' mutex(9). The last one `unp_link' foreach loop within unp_gc() discards sockets previously marked as UNP_GCDEAD. These sockets are not accessed from the userland. The only exception is the sosend() threads of connected sending peers, but they only sbappend*() mbuf(9) to `so_rcv'. So it's enough to unlink mbuf(9) chain with `sb_mtx' held and discard lockless. Please note, the existing SS_NEWCONN_WAIT logic was never used because the listening unix(4) socket protected from concurrent unp_detach() by vnode(9) lock, however `head' re-locked all times. ok bluhm
2024-04-10Unlock dosigsuspend() and with that some aspects of ppoll and pselectClaudio Jeker
Change p_sigmask from atomic back to non-atomic updates. All changes to p_sigmask are only allowed by curproc (the owner). There is no need for atomic instructions here. p_sigmask is mostly accessed by curproc with the exception of ptsignal(). In ptsignal() p_sigmask is now only read once unless a SSLEEP proc gets the signal. In that case recheck the p_sigmask before wakeup to ensure that no unnecessary wakeup happens. Add some KASSERT(p == curproc) to ensure this precondition. sigabort() is special since it is also called by ddb but apart from that only works for curproc. With and OK mvs@ OK mpi@
2024-04-05syncTheo de Raadt
2024-04-05msyscall(2) goes awayTheo de Raadt
2024-04-05noone calls msyscall() anymore.Theo de Raadt
2024-04-02Implement SO_ACCEPTCONN in getsockopt(2)Claudio Jeker
Requested by robert@ OK mvs@ millert@ deraadt@
2024-04-02Remove wrong "temporary udp error" comment in filt_so{read,write}(). NotVitaliy Makkoveev
only udp(4) sockets set and check `so_error'. No functional changes. ok bluhm
2024-04-02Delete the msyscall mechanism entirely, since mimmutable+pinsyscalls hasTheo de Raadt
replaced it with a more strict mechanism, which happens to be lockless O(1) rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts, but leave the syscall around for a bit longer so that people can build through it, since ld.so(1) still wants to call it.
2024-04-02remove useless whitespace; from Jia TanTheo de Raadt
2024-03-31Allow listen(2) only on sockets of type SOCK_STREAM or SOCK_SEQPACKET.Vitaliy Makkoveev
listen(2) man(1) page clearly prohibits sockets of other types. Reported-by: syzbot+00450333592fcd38c6fe@syzkaller.appspotmail.com ok bluhm
2024-03-31Mark `so_rcv' sockbuf of udp(4) sockets as SB_OWNLOCK.Vitaliy Makkoveev
sbappend*() and soreceive() of SB_MTXLOCK marked sockets uses `sb_mtx' mutex(9) for protection, meanwhile buffer usage check and corresponding sbwait() sleep still serialized by solock(). Mark udp(4) as SB_OWNLOCK to avoid solock() serialization and rely to `sb_mtx' mutex(9). The `sb_state' and `sb_flags' modifications must be protected by `sb_mtx' too. ok bluhm
2024-03-30Prevent a recursion inside wakeup(9) when scheduler tracepoints are enabled.Martin Pieuchot
Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside the loop iterating over sleeping threads as part of wakeup_proc(). When such tracepoints were enabled they could result in another wakeup(9) possibly corrupting the sleepqueue. Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then call setrunnable() and possible tracepoints for each of them. This requires moving unsleep() outside of setrunnable() because it messes with the sleepqueue. ok claudio@
2024-03-29Remove one global variable duplicating uvmexp.pagesize.Miod Vallat
ok guenther@ deraadt@
2024-03-28sysTheo de Raadt
2024-03-28Delete pinsyscall(2) [which was specific only to SYS_execve] nowTheo de Raadt
that it has been replaced with pinsyscalls(2) [which tells the kernel the location of all system calls in libc.so] floated to various people before release, but it was prudent to wait.
2024-03-27Introduce SB_OWNLOCK to mark sockets which `so_rcv' buffer modifiedVitaliy Makkoveev
outside socket lock. `sb_mtx' mutex(9) used for this case and it should not be released between `so_rcv' usage check and corresponding sbwait() sleep. Otherwise wakeup() could be lost sometimes. ok bluhm
2024-03-26Improve spinning in mtx_enter().Alexander Bluhm
Instead of calling mtx_enter_try() in each spinning loop, do it only if the result of a lockless read indicates that the mutex has been released. This avoids some expensive atomic compare-and-swap operations. Up to 5% reduction of spinning time during kernel build can been seen on a 8 core amd64 machine. On other machines there was no visible effect. Test on powerpc64 has revealed a bug in mtx_owner declaration. Not the variable was volatile, but the object it points to. Move the volatile declaration in struct mutex to avoid a hang when going to multiuser. from Mateusz Guzik; input kettenis@ jca@; OK mpi@
2024-03-26Use `sb_mtx' to protect `so_rcv' receive buffer of unix(4) sockets.Vitaliy Makkoveev
This makes re-locking unnecessary in the uipc_*send() paths, because it's enough to lock one socket to prevent peer from concurrent disconnection. As the little bonus, one unix(4) socket can perform simultaneous transmission and reception with one exception for uipc_rcvd(), which still requires the re-lock for connection oriented sockets. The socket lock is not held while filt_soread() and filt_soexcept() called from uipc_*send() through sorwakeup(). However, the unlocked access to the `so_options', `so_state' and `so_error' is fine. The receiving socket can't be or became listening socket. It also can't be disconnected concurrently. This makes immutable SO_ACCEPTCONN, SS_ISDISCONNECTED and SS_ISCONNECTED bits which are clean and set respectively. `so_error' is set on the peer sockets only by unp_detach(), which also can't be called concurrently on sending socket. This is also true for filt_fiforead() and filt_fifoexcept(). For other callers like kevent(2) or doaccept() the socket lock is still held. ok bluhm
2024-03-25Move the "no (hard) linking directories" and "no cross-mount links"Philip Guenther
checks from all the filesystems that support hardlinks at all into the VFS layer. Simplify, EPERM description in link(2). ok miod@ mpi@
2024-03-25regenVitaliy Makkoveev
2024-03-25Unlock shutdown(2).Vitaliy Makkoveev
ok bluhm
2024-03-22Use sorflush() instead of direct unp_scan(..., unp_discard) to discardVitaliy Makkoveev
dead unix(4) sockets. The difference in direct unp_scan() and sorflush() is the mbuf(9) chain. For the first case it is still linked to the `so_rcv', for the second it is not. This is required to make `sb_mtx' mutex(9) the only `so_rcv' sockbuf protection and remove socket re-locking from the most of uipc_*send() paths. The unlinked mbuf(9) chain doesn't require any protection, so this allows to perform sleeping unp_discard() lockless. Also, the mbuf(9) chain of the discarded socket still contains addresses of file descriptors and it is much safer to unlink it before FRELE() them. This is the reason to commit this diff standalone. ok bluhm
2024-03-22pledge: Allow the AUDIO_GETDEV ioctl in "audio"Alexandre Ratchov
ok deraadt, kn, phessler
2024-03-17Do UNP_CONNECTING and UNP_BINDING flags check in uipc_listen() andVitaliy Makkoveev
return EINVAL if set. This prevents concurrent solisten() thread to make this socket listening while socket is unlocked. Reported-by: syzbot+4acfcd73d15382a3e7cf@syzkaller.appspotmail.com ok mpi
2024-03-05Revert m_defrag() mbuf alignment to IP header.Alexander Bluhm
m_defrag() is intended as last resort to make DMA transfers to the hardware. Therefore page alingment is more important than IP header alignment. The reason, why the mbuf returned by m_defrag() was switched to IP header alingment, was that ether_extract_headers() failed in em(4) driver with TSO on sparc64. This has been fixed by using memcpy(). The alignment change in m_defrag() is too late in the 7.5 relaese process. It may affect several drivers on different architectures. Bus dmamap for ixl(4) on sun4v expects page alignment. Such alignment issues and TSO mbuf mapping for IOMMU need more thought. OK deraadt@
2024-03-01Protect pool_get() with kernel lock in sys_ypconnect().Alexander Bluhm
Pool namei_pool is initialized with IPL_NONE as filesystem always runs with kernel lock. So pool_get() needs kernel lock also in sys_ypconnect(). OK kn@ deraadt@
2024-02-28No need to kick a CPU twice when putting a thread on its runqueue.Martin Pieuchot
From Christian Ludwig, ok claudio@
2024-02-25clockintr: rename "struct clockintr_queue" to "struct clockqueue"Scott Soule Cheloha
The code has outgrown the original name for this struct. Both the external and internal APIs have used the "clockqueue" namespace for some time when operating on it, and that name is eyeball-consistent with "clockintr" and "clockrequest", so "clockqueue" it is.
2024-02-25clockintr.h, kern_clockintr.c: add 2023, 2024 to copyright rangeScott Soule Cheloha
2024-02-25New accounting flag ABTCFI to indicate signal SIGILL + code ILL_BTCFITheo de Raadt
has occurred in the process. ok various people
2024-02-24clockintr: rename clockqueue_reset_intrclock to clockqueue_intrclock_reprogramScott Soule Cheloha
The function should be in the clockqueue_intrclock namespace. Also, "reprogram" is a better word for what the function actually does.
2024-02-23timecounting: start system uptime at 0.0 instead of 1.0Scott Soule Cheloha
OpenBSD starts the system uptime clock at 1.0 instead of 0.0. We inherited this behavior from FreeBSD when we imported kern_tc.c. patrick@ reports that this causes a problem in sdmmc(4) during boot: the sdmmc_delay() call in sdmmc_init() doesn't block for the full 250ms. This happens because the system hardclock() starts at 0.0 and executes about hz times, rapidly, to "catch up" to 1.0. This instantly expires the first hz timeout ticks, hence the short sleep. Starting the system uptime at 0.0 fixes the problem. Prompted by patrick@. Tested by patrick@. In snaps since Feb 19 2023. Thread: https://marc.info/?l=openbsd-tech&m=170830229732396&w=2 ok patrick@ deraadt@
2024-02-23timeout: make to_kclock validation more rigorousScott Soule Cheloha
In kern_timeout.c, the to_kclock checks are not strict enough to catch all plausible programmer mistakes. Tighten them up: - timeout_set_flags: KASSERT that kclock is valid - timeout_abs_ts: KASSERT that to_kclock is KCLOCK_UPTIME We can also add to_kclock validation to softclock() and db_show_timeout(), which may help to debug memory corruption: - softclock: panic if to_kclock is not KCLOCK_NONE or KCLOCK_UPTIME - db_show_timeout: print warning if to_kclock is invalid Prompted by bluhm@ in response to a syzbot panic. Hopefully these changes help to narrow down the root cause. Link: https://syzkaller.appspot.com/bug?extid=49d3f7118413963f651a Reported-by: syzbot+49d3f7118413963f651a@syzkaller.appspotmail.com ok bluhm@
2024-02-21Keep mbuf data alignment intact in m_defrag()Claudio Jeker
The recent TSO support in em(4) triggered an alignment error on the TCP header. In em(4) m_defrag() is called before setting up the TSO dma bits and with that the TCP header was suddenly no longer aligned. Like other mbuf functions preserve the data alignment in m_defrag() to prevent such unaligned packets. With help and OK bluhm@ mglocker@
2024-02-14Enable the pool gc thread on m88k MULTIPROCESSOR kernels now thatMiod Vallat
pmap_unmap_direct() has been fixed; also tested by aoyama@
2024-02-12Pass protosw instead of domain structure to soalloc() to get realVitaliy Makkoveev
`pr_type'. The corresponding domain is referenced as `pr_domain'. Otherwise dp->dom_protosw->pr_type of inet sockets always points to inetsw[0]. ok bluhm
2024-02-12kernel: disable hardclock() on secondary CPUsScott Soule Cheloha
There is no useful work left for secondary CPUs to do in hardclock(). Disable cq_hardclock on secondary CPUs and remove the now-unnecessary early-return from hardclock(). This change reduces every system's normal clock interrupt rate by (HZ - HZ/10) per secondary CPU. For example, an 8-core machine with a HZ=100 kernel should see its clock interrupt rate drop from ~1600 to ~960. Thread: https://marc.info/?l=openbsd-tech&m=170750140915898&w=2 ok kettenis@
2024-02-11Release `sb_mtx' mutex(9) before sbunlock().Vitaliy Makkoveev
ok bluhm
2024-02-11Use `sb_mtx' instead of `inp_mtx' in receive path for inet sockets.Vitaliy Makkoveev
In soreceve(), we only touch `so_rcv' socket buffer, which has it's own `sb_mtx' mutex(9) for protection. So, we can avoid solock() in this path - it's enough to hold `sb_mtx' in soreceive() and around corresponding sbappend*(). But not right now :) This time we use shared netlock for some inet sockets in the soreceive() path. To protect `so_rcv' buffer we use `inp_mtx' mutex(9) and the pru_lock() to acquire this mutex(9) in socket layer. But the `inp_mtx' mutex belongs to the PCB. We initialize socket before PCB, tcp(4) sockets could exist without PCB, so use `sb_mtx' mutex(9) to protect sockbuf stuff. This diff mechanically replaces `inp_mtx' by `sb_mtx' in the receive path. Only for sockets which already use `inp_mtx'. All other sockets left as is. They will be converted later. Since the `sb_mtx' is optional, the new SB_MTXLOCK flag introduced. If this flag is set on `sb_flags', the `sb_mtx' mutex(9) should be taken. New sb_mtx_lock() and sb_mtx_unlock() was introduced to hide this check. They are temporary and will be replaced by mtx_enter() when all this area will be converted to `sb_mtx' mutex(9). Also, the new sbmtxassertlocked() function introduced to throw corresponding assertion for SB_MTXLOCK marked buffers. This time only sbappendaddr() calls it. This function is also temporary and will be replaced by MTX_ASSERT_LOCKED() later. ok bluhm
2024-02-10On kernels without ucom(4) support, 'sysctl hw.ucomnames' should returnTheo de Raadt
the empty string, rather than error. ok krw
2024-02-09dt(4): move interval/profile entry points to dedicated clockintr callbackScott Soule Cheloha
To improve the utility of dt(4)'s interval and profile probes we need to move the probe entry points from the fixed-frequency hardclock() to a dedicated clock interrupt callback so that the probes can fire at arbitrary frequencies. - Remove entry points for interval/profile probes from hardclock(). - Merge dt_prov_profile_enter(), dt_prov_interval_enter(), and dt_prov_profile_fire() into one function, dt_clock(). This is the now-unified callback for interval/profile probes. dt_clock() will consume multiple events during a single execution if it is delayed, but on platforms with high quality interrupt clocks this should be rare. - Each struct dt_pcb gets its own clockintr handle, dp_clockintr. - In struct dt_pcb, replace dp_maxtick/dp_nticks with dp_nsecs, the PCB's sampling period. Aynchronous probes must initialize dp_nsecs to a non-zero value during dtpv_alloc(). - In struct dt_pcb, replace dp_cpuid with dp_cpu so that dt_ioctl_record_start() knows where to bind the PCB's dp_clockintr. - dt_ioctl_record_start() binds, staggers, and starts all interval/profile PCBs on the given dt_softc. Each dp_clockintr is given a reference to its enclosing PCB so that dt_clock() doesn't need to search for it. The staggering sort-of simulates the current behavior under hardclock(). - dt_ioctl_record_stop() unbinds all interval/profile PCBs. The CL_BARRIER ensures that dp_clockintr's PCB reference is not in use by dt_clock() so that the PCB may be safely freed upon return from dt_ioctl_record_stop(). Blocking while holding dt_lock is not ideal, but in practice blocking in this spot is rare and dt_clock() completes quickly on all but the oldest hardware. An extremely unlucky thread could block for every interval/profile PCB on the softc, but this is implausible. DT_FA_PROFILE values are up-to-date for amd64, i386, and macppc. Somebody with the right hardware needs to check-and-maybe-fix the values on octeon, powerpc64, and sparc64. Joint effort with mpi@. Thread: https://marc.info/?l=openbsd-tech&m=170629371821879&w=2 ok mpi@
2024-02-09clockintr: add clockintr_unbind()Scott Soule Cheloha
The clockintr_unbind() function cancels any pending execution of the given clock interrupt object's callback and severs the binding between the object and its host CPU. Upon return from clockintr_unbind(), the clock interrupt object may be rebound with a call to clockintr_bind(). The optional CL_BARRIER flag tells clockintr_unbind() to block if the clockintr's callback function is executing at the moment of the call. This is useful when the clockintr's arg is a shared reference and the caller needs to be certain the reference is inactive. Now that clockintrs can be bound and unbound repeatedly, there is more room for error. To help catch programmer errors, clockintr_unbind() sets cl_queue to NULL. Calls to other API functions after a clockintr is unbound will then fault on a NULL dereference. clockintr_bind() also KASSERTs that cl_queue is NULL to ensure the clockintr is not already bound. These checks are not perfect, but they do catch some common errors. With input from mpi@. Thread: https://marc.info/?l=openbsd-tech&m=170629367121800&w=2 ok mpi@