summaryrefslogtreecommitdiff
path: root/sys/kern
AgeCommit message (Collapse)Author
2024-01-10Split UDP PCB table into IPv4 and IPv6.Alexander Bluhm
Having two hash tables instead of a common one, reduces table size and contention on the per table lock. The address family is always known in advance. The lookups and loops are more specific. OK sashan@
2024-01-07Error out if one syscall ever takes more than 6 arguments.Miod Vallat
This is not necessarily wrong per se, but would need special consideration, as not all platforms are currently able to process more than six syscall arguments (and upcoming diffs will rely upon reasonably-sized argument lists), so better break now and reconsider later if need be. ok deraadt@
2024-01-03Run connect(2) in parallel within inet doamin.Alexander Bluhm
This unlocks soconnect() for UDP, rip, rip6 and divert. It takes shared net lock in combination with per socket lock. TCP and GRE still use exclusive net lock when connecting. OK mvs@
2024-01-01copyright++;Jonathan Gray
2023-12-21Remove logic and comments related to INDIR now that they aren't supportedMiod Vallat
anymore. ok tb@ deraadt@, no need to regen anything
2023-12-19Release inpcb mutex while calling sbwait().Alexander Bluhm
As sbwait() may sleep, holding any mutex is not allowed. Call pru_unlock() before sbwait() in soreceive(). Bug spotted by sashan@; OK sashan@ mvs@
2023-12-19syncTheo de Raadt
2023-12-19the 4th argument of pinsyscalls() is now "number of pin elements",Theo de Raadt
not "size of the storage of the pin elements"
2023-12-19soreceive() must not hold mutex when calling sblock().Alexander Bluhm
In my recent commit I missed that sblock() may sleep while soreceive() holds the incpb mutex. Call pru_lock() after sblock(). Reported-by: syzbot+f79c896ec019553655a0@syzkaller.appspotmail.com Reported-by: syzbot+08b6f1102e429b2d4f84@syzkaller.appspotmail.com OK mvs@
2023-12-18Run bind(2) system call in parallel.Alexander Bluhm
For protocols that care about locking, use the shared net lock to call sobind(). Use the per socket rwlock together with shared net lock. This affects protocols UDP, raw IP, and divert. Move the inpcb mutex locking into soreceive(), it is only used there. Add a comment to describe the current inmplementation of inpcb locking. OK mvs@ sashan@
2023-12-15provide the pieces for ktrace/kdump to observe pinsyscall violations.Theo de Raadt
(not used yet, because the pinsyscall changes are still being worked on) ok kettenis
2023-12-14Workaround for broken clang which has a broken -fno-zero-initialized-in-bssClaudio Jeker
implementation. Set nkmempages to -1 by default instead of 0 so that the value ends up in the data section. This way config(8) is able to alter the value as promised. See also: https://github.com/llvm/llvm-project/issues/74632 OK miod@
2023-12-14Bring default logic to set nkmempages into the 21st century.Claudio Jeker
The new logic is: Up to 1G physmem use physical memory / 4, above 1G add an extra 16MB per 1G of memory. Clamp it down depending on available kernel virtual address space - up and including 512M -> 64MB (macppc, arm, sh) - between 512M and 1024M -> 128MB (hppa, i386, mips, luna88k) - over 1024M clamping to VM_KERNEL_SPACE_SIZE / 4 The result is much more malloc(9) space on 64bit archs with lots of memory and large kva space. Note: amd64 only has 4G of kva and therefor nkmempages is limited to 262144 As a side-effect NKMEMPAGES_MAX and nkmempages_max are no longer used. Tested and OK miod@
2023-12-12put pinsyscalls(2) into the "always" groupTheo de Raadt
2023-12-12syncTheo de Raadt
2023-12-12remove support for syscall(2) -- the "indirection system call" becauseTheo de Raadt
it is a dangerous alternative entry point for all system calls, and thus incompatible with the precision system call entry point scheme we are heading towards. This has been a 3-year mission: First perl needed a code-generated wrapper to fake syscall(2) as a giant switch table, then all the ports were cleaned with relatively minor fixes, except for "go". "go" required two fixes -- 1) a framework issue with old library versions, and 2) like perl, a fake syscall(2) wrapper to handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over the place in the "go" ecosystem because the "go developers" are plan9-loving unix-hating folk who tried to build an ecosystem without allowing "ioctl". ok kettenis, jsing, afresh1, sthen
2023-12-11Implement per-CPU caching for the page table page (vp) pool and the PTEMark Kettenis
descriptor (pted) pool in the arm64 pmap implementation. This significantly reduces the side-effects of lock contention on the kernel map lock that is (incorrectly) translated into excessive page daemon wakeups. This is not a perfect solution but it does lead to significant speedups on machines with many CPU cores. This requires adding a new pmap_init_percpu() function that gets called at the point where kernel is ready to set up the per-CPU pool caches. Dummy implementations of this function are added for all non-arm64 architectures. Some other architectures can probably benefit from providing an actual implementation that sets up per-CPU caches for pmap pools as well. ok phessler@, claudio@, miod@, patrick@
2023-12-10syncTheo de Raadt
2023-12-10pinsyscalls(2) 2nd argument can be "uint *" instead of "void *Theo de Raadt
ok kettenis
2023-12-07syncTheo de Raadt
2023-12-07Add a stub pinsyscalls() system call that simply returns 0 for now,Theo de Raadt
before future work where ld.so(1) will need this new system call. Putting this in the kernel ahead of time will save some grief. ok kettenis
2023-11-29regen syscallsAlexander Bluhm
2023-11-29Unlock bind(2) syscall.Alexander Bluhm
For internet sockets sobind() runs with exclusive net lock due to solock(). For unix domain sockets uipc_bind() grabs the kernel lock itself. So sys_bind() is MP safe. Add NOLOCK flag to avoid kernel lock. OK mvs@
2023-11-29Cleanup kmeminit_nkmempages().Claudio Jeker
NKMEMPAGES_MIN was removed long time ago in all archs so there is no need to keep it. Also initalize nkmempages_max at compile time since sparc (with variable page size) is long gone as well. No objection from miod@
2023-11-28correct spelling of FALLTHROUGHJonathan Gray
2023-11-24Fix comments longer than 80 column.ASOU Masato
ok miod@
2023-11-21Fix kernel build without option PTRACE, but with dt(4).Alexander Bluhm
Since revision 1.26 dt_ioctl_get_auxbase() is calling process_domem(). Build the latter function into kernel if pseudo device dt is enabled. from Matthias Pitzl; OK claudio@
2023-11-15Constify disk_map()'s path argumentKlemens Nanni
The disklabel UID passed in is not modified, reflect that and allow callers using 'const char *'. OK miod
2023-10-30Do not truncate MSG_EOR in recvmsg().Alexander Bluhm
The soreceive() code depends on the fact that MSG_EOR is set on the last mbuf of the chain. In sbappendcontrol() move MSG_EOR to the end like sbcompress() does it. This fixes MSG_EOR handling for SOCK_SEQPACKET sockets with control message. bug reported by Eric Wong analysed, tested and OK claudio@
2023-10-30Use ERESTART for any single_thread_set() error in sys_execve().Claudio Jeker
If single thread is already held by another thread just unwind to userret() wait there and retry the system call later (if at all). OK mpi@
2023-10-24Normally context switches happen in mi_switch() but there are 3 casesClaudio Jeker
where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
2023-10-20Avoid assertion failure when splitting mbuf cluster.Alexander Bluhm
m_split() calls m_align() to initialize the data pointer of newly allocated mbuf. If the new mbuf will be converted to a cluster, this is not necessary. If additionally the new mbuf is larger than MLEN, this can lead to a panic. Only call m_align() when a valid m_data is needed. This is the case if we do not refecence the existing cluster, but memcpy() the data into the new mbuf. Reported-by: syzbot+0e6817f5877926f0e96a@syzkaller.appspotmail.com OK claudio@ deraadt@
2023-10-17clockintr: move callback-specific API behaviors to "clockrequest" namespaceScott Soule Cheloha
The API's behavior when invoked from a callback function is impossible to document. Move the special behavior into a distinct namespace, "clockrequest". - Add a 'struct clockrequest'. Basically a stripped-down 'struct clockintr' for exclusive use during clockintr_dispatch(). - In clockintr_queue, replace the "cq_shadow" clockintr with a "cq_request" clockrequest. They serve the same purpose. - CLST_SHADOW_PENDING -> CR_RESCHEDULE; different namespace, same meaning. - CLST_IGNORE_SHADOW -> CLST_IGNORE_REQUEST; same meaning. - Move shadow branch in clockintr_advance() to clockrequest_advance(). - clockintr_request_random() becomes clockrequest_advance_random(). - Delete dead shadow branches in clockintr_cancel(), clockintr_schedule(). - Callback functions now get a clockrequest pointer instead of a special clockintr pointer: update all prototypes, callers. No functional change intended.
2023-10-12timeout: add TIMEOUT_MPSAFE flagScott Soule Cheloha
Add a TIMEOUT_MPSAFE flag to signal that a timeout is safe to run without the kernel lock. Currently, TIMEOUT_MPSAFE requires TIMEOUT_PROC. When the softclock() is unlocked in the future this dependency will be removed. On MULTIPROCESSOR kernels, softclock() now shunts TIMEOUT_MPSAFE timeouts to a dedicated "timeout_proc_mp" bucket for processing by the dedicated softclock_thread_mp() kthread. Unlike softclock_thread(), softclock_thread_mp() is not pinned to any CPU and runs run at IPL_NONE. Prompted by bluhm@. Lots of input from bluhm@. Joint work with mvs@. Prompt: https://marc.info/?l=openbsd-tech&m=169646019109736&w=2 Thread: https://marc.info/?l=openbsd-tech&m=169652212131109&w=2 ok mvs@
2023-10-11kernel: expand fixed clock interrupt periods to 64-bit valuesScott Soule Cheloha
Technically, all the current fixed clock interrupt periods fit within an unsigned 32-bit value. But 32-bit multiplication is an accident waiting to happen. So, expand the fixed periods for hardclock, statclock, profclock, and roundrobin to 64-bit values. One exception: statclock_mask remains 32-bit because random(9) yields 32-bit values. Update the initclocks() comment to make it clear that this is not an accident.
2023-10-11clockintr: move clockintr_schedule() into public APIScott Soule Cheloha
Prototype clockintr_schedule() in <sys/clockintr.h>.
2023-10-11clockintr_stagger: rename parameters: "n" -> "numer", "count" -> "denom"Scott Soule Cheloha
Rename these parameters to align the code with the forthcoming manpage. No functional change.
2023-10-08clockintr: move intrclock wrappers from sys/clockintr.h to kern_clockintr.cScott Soule Cheloha
intrclock_rearm() and intrclock_trigger() are not part of the public API, so there's no reason to implement them in sys/clockintr.h. Move them to kern_clockintr.c.
2023-10-06In sys___thrsigdivert() switch tsleep_nsec() to use the nowake identClaudio Jeker
channel instead of inventing an own one. OK kettenis@ mvs@
2023-10-01Add sysctl hw.ucomnames to list 'fixed' paths to USB serialKenneth R Westerback
ports. Suggested by deraadt@, USB route idea from kettenis@. Feedback from anton@, man page improvements from deraadt@, jmc@, schwarze@. ok deraadt@ kettenis@
2023-09-29Extend single_thread_set() mode with additional flag attributes.Claudio Jeker
The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter the behaviour of single_thread_set(). This allows explicit control of the SINGLE_DEEP behaviour. If SINGLE_DEEP is set the deep flag is passed to the initial check call and by that the check will error out instead of suspending (SINGLE_UNWIND) or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to single_thread_set() outside of userret. E.g. at the start of sys_execve because the proc is not allowed to call exit1() in that location. SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor returns BEFORE all threads have been parked. Currently this is only used by the ptrace code and should not be used anywhere else. Not waiting for all threads to settle is asking for trouble. This solves an issue by using SINGLE_UNWIND in the coredump case where the code should actually exit in case another thread crashed moments earlier. Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since the call to pledge_fail() is for sure not at the kernel boundary. OK mpi@
2023-09-25ddb(4): clockintr: print cl_arg address when displaying a clockintrScott Soule Cheloha
2023-09-24kern_clockintr.c: remove extra newlineScott Soule Cheloha
2023-09-23Fix unreliable sys_setsockopt() with consistent use of M_WAITJan Klemkow
Also remove useless NULL check. ok bluhm@
2023-09-22Make `logread_filterops' MP safe. For that purpose use `log_mtx' mutex(9)Vitaliy Makkoveev
protecting message buffer. ok bluhm
2023-09-21Move code inside exit1() to better spots.Claudio Jeker
- PS_PROFIL bit is moved into the process cleanup block where it belongs - The proc read-only limit cache cleanup is moved up right after clearing p->p_fd cache. lim_free() can potentially sleep and so needs to be above the line where p_stat is set to SDEAD. With and OK jca@
2023-09-19Improve the output of ddb "show proc" commandClaudio Jeker
Include missing fields -- like the sleep channel and message -- and show both the PID and TID of the proc. Also add '/t' as an argument that can be used to specify a proc by TID instead of by address. OK mpi@
2023-09-19Add a KASSERT for p->p_wchan == NULL to setrunqueue()Claudio Jeker
There is the same check in sched_chooseproc() but that is too late to know where the bad insertion into the runqueue was done. OK mpi@
2023-09-19Before coredump or in pledge_fail use SINGLE_UNWIND to stop all threads.Claudio Jeker
SINGLE_UNWIND unwinds to the kernel boundary. On the other hand SINGLE_SUSPEND will sleep inside tsleep(9) and other sleep functions. Since the code will exit1() very soon after it is better to already unwind. Now one could argue that for coredumps all threads should stop asap to get a clean dump. Using SINGLE_UNWIND the sleep will fail with ERESTART and no copyout should happen in that case. This is a bit of a workaround since SINGLE_SUSPEND has a small race where single_thread_wait() returns before all threads are really stopped. When SINGLE_EXIT is called quickly after this can blow up inside sleep_finish. Reported-by: syzbot+3ef066fcfaf991f2ac2c@syzkaller.appspotmail.com OK mpi@ kettenis@
2023-09-17clockintr.h: forward-declare "struct cpu_info" for clockintr_establish()Scott Soule Cheloha
With input from claudio@ and deraadt@.