summaryrefslogtreecommitdiff
path: root/sys/kern
AgeCommit message (Collapse)Author
2023-12-12remove support for syscall(2) -- the "indirection system call" becauseTheo de Raadt
it is a dangerous alternative entry point for all system calls, and thus incompatible with the precision system call entry point scheme we are heading towards. This has been a 3-year mission: First perl needed a code-generated wrapper to fake syscall(2) as a giant switch table, then all the ports were cleaned with relatively minor fixes, except for "go". "go" required two fixes -- 1) a framework issue with old library versions, and 2) like perl, a fake syscall(2) wrapper to handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over the place in the "go" ecosystem because the "go developers" are plan9-loving unix-hating folk who tried to build an ecosystem without allowing "ioctl". ok kettenis, jsing, afresh1, sthen
2023-12-11Implement per-CPU caching for the page table page (vp) pool and the PTEMark Kettenis
descriptor (pted) pool in the arm64 pmap implementation. This significantly reduces the side-effects of lock contention on the kernel map lock that is (incorrectly) translated into excessive page daemon wakeups. This is not a perfect solution but it does lead to significant speedups on machines with many CPU cores. This requires adding a new pmap_init_percpu() function that gets called at the point where kernel is ready to set up the per-CPU pool caches. Dummy implementations of this function are added for all non-arm64 architectures. Some other architectures can probably benefit from providing an actual implementation that sets up per-CPU caches for pmap pools as well. ok phessler@, claudio@, miod@, patrick@
2023-12-10syncTheo de Raadt
2023-12-10pinsyscalls(2) 2nd argument can be "uint *" instead of "void *Theo de Raadt
ok kettenis
2023-12-07syncTheo de Raadt
2023-12-07Add a stub pinsyscalls() system call that simply returns 0 for now,Theo de Raadt
before future work where ld.so(1) will need this new system call. Putting this in the kernel ahead of time will save some grief. ok kettenis
2023-11-29regen syscallsAlexander Bluhm
2023-11-29Unlock bind(2) syscall.Alexander Bluhm
For internet sockets sobind() runs with exclusive net lock due to solock(). For unix domain sockets uipc_bind() grabs the kernel lock itself. So sys_bind() is MP safe. Add NOLOCK flag to avoid kernel lock. OK mvs@
2023-11-29Cleanup kmeminit_nkmempages().Claudio Jeker
NKMEMPAGES_MIN was removed long time ago in all archs so there is no need to keep it. Also initalize nkmempages_max at compile time since sparc (with variable page size) is long gone as well. No objection from miod@
2023-11-28correct spelling of FALLTHROUGHJonathan Gray
2023-11-24Fix comments longer than 80 column.ASOU Masato
ok miod@
2023-11-21Fix kernel build without option PTRACE, but with dt(4).Alexander Bluhm
Since revision 1.26 dt_ioctl_get_auxbase() is calling process_domem(). Build the latter function into kernel if pseudo device dt is enabled. from Matthias Pitzl; OK claudio@
2023-11-15Constify disk_map()'s path argumentKlemens Nanni
The disklabel UID passed in is not modified, reflect that and allow callers using 'const char *'. OK miod
2023-10-30Do not truncate MSG_EOR in recvmsg().Alexander Bluhm
The soreceive() code depends on the fact that MSG_EOR is set on the last mbuf of the chain. In sbappendcontrol() move MSG_EOR to the end like sbcompress() does it. This fixes MSG_EOR handling for SOCK_SEQPACKET sockets with control message. bug reported by Eric Wong analysed, tested and OK claudio@
2023-10-30Use ERESTART for any single_thread_set() error in sys_execve().Claudio Jeker
If single thread is already held by another thread just unwind to userret() wait there and retry the system call later (if at all). OK mpi@
2023-10-24Normally context switches happen in mi_switch() but there are 3 casesClaudio Jeker
where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
2023-10-20Avoid assertion failure when splitting mbuf cluster.Alexander Bluhm
m_split() calls m_align() to initialize the data pointer of newly allocated mbuf. If the new mbuf will be converted to a cluster, this is not necessary. If additionally the new mbuf is larger than MLEN, this can lead to a panic. Only call m_align() when a valid m_data is needed. This is the case if we do not refecence the existing cluster, but memcpy() the data into the new mbuf. Reported-by: syzbot+0e6817f5877926f0e96a@syzkaller.appspotmail.com OK claudio@ deraadt@
2023-10-17clockintr: move callback-specific API behaviors to "clockrequest" namespaceScott Soule Cheloha
The API's behavior when invoked from a callback function is impossible to document. Move the special behavior into a distinct namespace, "clockrequest". - Add a 'struct clockrequest'. Basically a stripped-down 'struct clockintr' for exclusive use during clockintr_dispatch(). - In clockintr_queue, replace the "cq_shadow" clockintr with a "cq_request" clockrequest. They serve the same purpose. - CLST_SHADOW_PENDING -> CR_RESCHEDULE; different namespace, same meaning. - CLST_IGNORE_SHADOW -> CLST_IGNORE_REQUEST; same meaning. - Move shadow branch in clockintr_advance() to clockrequest_advance(). - clockintr_request_random() becomes clockrequest_advance_random(). - Delete dead shadow branches in clockintr_cancel(), clockintr_schedule(). - Callback functions now get a clockrequest pointer instead of a special clockintr pointer: update all prototypes, callers. No functional change intended.
2023-10-12timeout: add TIMEOUT_MPSAFE flagScott Soule Cheloha
Add a TIMEOUT_MPSAFE flag to signal that a timeout is safe to run without the kernel lock. Currently, TIMEOUT_MPSAFE requires TIMEOUT_PROC. When the softclock() is unlocked in the future this dependency will be removed. On MULTIPROCESSOR kernels, softclock() now shunts TIMEOUT_MPSAFE timeouts to a dedicated "timeout_proc_mp" bucket for processing by the dedicated softclock_thread_mp() kthread. Unlike softclock_thread(), softclock_thread_mp() is not pinned to any CPU and runs run at IPL_NONE. Prompted by bluhm@. Lots of input from bluhm@. Joint work with mvs@. Prompt: https://marc.info/?l=openbsd-tech&m=169646019109736&w=2 Thread: https://marc.info/?l=openbsd-tech&m=169652212131109&w=2 ok mvs@
2023-10-11kernel: expand fixed clock interrupt periods to 64-bit valuesScott Soule Cheloha
Technically, all the current fixed clock interrupt periods fit within an unsigned 32-bit value. But 32-bit multiplication is an accident waiting to happen. So, expand the fixed periods for hardclock, statclock, profclock, and roundrobin to 64-bit values. One exception: statclock_mask remains 32-bit because random(9) yields 32-bit values. Update the initclocks() comment to make it clear that this is not an accident.
2023-10-11clockintr: move clockintr_schedule() into public APIScott Soule Cheloha
Prototype clockintr_schedule() in <sys/clockintr.h>.
2023-10-11clockintr_stagger: rename parameters: "n" -> "numer", "count" -> "denom"Scott Soule Cheloha
Rename these parameters to align the code with the forthcoming manpage. No functional change.
2023-10-08clockintr: move intrclock wrappers from sys/clockintr.h to kern_clockintr.cScott Soule Cheloha
intrclock_rearm() and intrclock_trigger() are not part of the public API, so there's no reason to implement them in sys/clockintr.h. Move them to kern_clockintr.c.
2023-10-06In sys___thrsigdivert() switch tsleep_nsec() to use the nowake identClaudio Jeker
channel instead of inventing an own one. OK kettenis@ mvs@
2023-10-01Add sysctl hw.ucomnames to list 'fixed' paths to USB serialKenneth R Westerback
ports. Suggested by deraadt@, USB route idea from kettenis@. Feedback from anton@, man page improvements from deraadt@, jmc@, schwarze@. ok deraadt@ kettenis@
2023-09-29Extend single_thread_set() mode with additional flag attributes.Claudio Jeker
The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter the behaviour of single_thread_set(). This allows explicit control of the SINGLE_DEEP behaviour. If SINGLE_DEEP is set the deep flag is passed to the initial check call and by that the check will error out instead of suspending (SINGLE_UNWIND) or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to single_thread_set() outside of userret. E.g. at the start of sys_execve because the proc is not allowed to call exit1() in that location. SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor returns BEFORE all threads have been parked. Currently this is only used by the ptrace code and should not be used anywhere else. Not waiting for all threads to settle is asking for trouble. This solves an issue by using SINGLE_UNWIND in the coredump case where the code should actually exit in case another thread crashed moments earlier. Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since the call to pledge_fail() is for sure not at the kernel boundary. OK mpi@
2023-09-25ddb(4): clockintr: print cl_arg address when displaying a clockintrScott Soule Cheloha
2023-09-24kern_clockintr.c: remove extra newlineScott Soule Cheloha
2023-09-23Fix unreliable sys_setsockopt() with consistent use of M_WAITJan Klemkow
Also remove useless NULL check. ok bluhm@
2023-09-22Make `logread_filterops' MP safe. For that purpose use `log_mtx' mutex(9)Vitaliy Makkoveev
protecting message buffer. ok bluhm
2023-09-21Move code inside exit1() to better spots.Claudio Jeker
- PS_PROFIL bit is moved into the process cleanup block where it belongs - The proc read-only limit cache cleanup is moved up right after clearing p->p_fd cache. lim_free() can potentially sleep and so needs to be above the line where p_stat is set to SDEAD. With and OK jca@
2023-09-19Improve the output of ddb "show proc" commandClaudio Jeker
Include missing fields -- like the sleep channel and message -- and show both the PID and TID of the proc. Also add '/t' as an argument that can be used to specify a proc by TID instead of by address. OK mpi@
2023-09-19Add a KASSERT for p->p_wchan == NULL to setrunqueue()Claudio Jeker
There is the same check in sched_chooseproc() but that is too late to know where the bad insertion into the runqueue was done. OK mpi@
2023-09-19Before coredump or in pledge_fail use SINGLE_UNWIND to stop all threads.Claudio Jeker
SINGLE_UNWIND unwinds to the kernel boundary. On the other hand SINGLE_SUSPEND will sleep inside tsleep(9) and other sleep functions. Since the code will exit1() very soon after it is better to already unwind. Now one could argue that for coredumps all threads should stop asap to get a clean dump. Using SINGLE_UNWIND the sleep will fail with ERESTART and no copyout should happen in that case. This is a bit of a workaround since SINGLE_SUSPEND has a small race where single_thread_wait() returns before all threads are really stopped. When SINGLE_EXIT is called quickly after this can blow up inside sleep_finish. Reported-by: syzbot+3ef066fcfaf991f2ac2c@syzkaller.appspotmail.com OK mpi@ kettenis@
2023-09-17clockintr.h: forward-declare "struct cpu_info" for clockintr_establish()Scott Soule Cheloha
With input from claudio@ and deraadt@.
2023-09-17struct clockintr_queue: rename "cq_est" to "cq_all"Scott Soule Cheloha
"cq_all" is a more obvious name than "cq_est". It's the list of all established clockintrs. Duh.
2023-09-17clockintr: remove clockintr_init(), clockintr_flagsScott Soule Cheloha
All the state initialization once done in clockintr_init() has been moved to other parts of the kernel. It's a dead function. Remove it. Likewise, the clockintr_flags variable no longer sports any meaningful flags. Remove it. This frees up the CL_* flag namespace, which might be useful to the clockintr frontend if we ever need to add behavior flags to any of those functions.
2023-09-17scheduler_start: move static timeout structs into callback functionsScott Soule Cheloha
Move the schedcpu() and update_loadavg() timeout structs from scheduler_start() into their respective callback functions and statically initialize them with TIMEOUT_INITIALIZER(9). The structs are already hidden from the global namespace and the timeouts are already self-managing, so we may as well fully consolidate things. Thread: https://marc.info/?l=openbsd-tech&m=169488184019047&w=2 "Sure." claudio@
2023-09-16Allow counters_read(9) to take an optional scratch buffer.Martin Pieuchot
Using a scratch buffer makes it possible to take a consistent snapshot of per-CPU counters without having to allocate memory. Makes ddb(4) show uvmexp command work in OOM situations. ok kn@, mvs@, cheloha@
2023-09-15work around cpu.h not coming into early scope on all archTheo de Raadt
2023-09-14clockintr: move hz(9)-based initialization out to initclocks()Scott Soule Cheloha
To separate the hardclock from the clock interrupt subsystem we'll need to move all related state out first. hz(9) is set when we return from cpu_initclocks(), so it's safe to move hardclock_period and roundrobin_period initialization out into initclocks(). Move hardclock_period itself out into kern_clock.c alongside the statclock variables.
2023-09-14clockintr, scheduler: move statclock handle from clockintr_queue to ↵Scott Soule Cheloha
schedstate_percpu Move the statclock handle from clockintr_queue.cq_statclock to schedstate_percpu.spc_statclock. Establish spc_statclock during sched_init_cpu() alongside the other scheduler clock interrupts. Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
2023-09-14clockintr, statclock: eliminate clockintr_statclock() wrapperScott Soule Cheloha
- Move remaining statclock variables from kern_clockintr.c to kern_clock.c. Move statclock variable initialization from clockintr_init() into initclocks(). - Change statclock() prototype to make it a legal clockintr callback function and establish the handle with statclock() instead clockintr_statclock(). - Merge the contents of clockintr_statclock() into statclock(). statclock() can now reschedule itself and handles multiple expirations transparently. - Make statclock_avg visible from sys/systm.h so that clockintr_cpu_init() can use it to advance the statclock across suspend/hibernate. Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
2023-09-14clockintr: move clockintr_advance_random() prototype into sys/clockintr.hScott Soule Cheloha
statclock() is going to need this. Move the prototype into the public API. Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
2023-09-14clockintr: replace CL_RNDSTAT with global variable statclock_is_randomizedScott Soule Cheloha
In order to separate the statclock from the clock interrupt subsystem we need to move all statclock state out into the broader kernel. Start by replacing the CL_RNDSTAT flag with a new global variable, "statclock_is_randomized", in kern_clock.c. Update all clockintr_init() callers to set the boolean instead of passing the flag. Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
2023-09-13Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;Claudio Jeker
The change to the single thread API results in crashes inside exit1() as found by Syzkaller. There seems to be a race in the exit codepath. What exactly fails is not really clear therefor revert for now. This should fix the following Syzkaller reports: Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com and maybe more. Reverted commits: ---------------------------- Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex. The single thread API needs to lock the process to enter single thread mode and does not need to stop the scheduler. This code changes ps_singlecount from a count down to zero to ps_singlecnt which counts up until equal to ps_threadcnt (in which case all threads are properly asleep). Tested by phessler@, OK mpi@ cheloha@ ---------------------------- Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK. The per process thread list can be traversed (read) by holding either the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK). Abusing the SCHED_LOCK for this makes it impossible to split up the scheduler lock into something more fine grained. Tested by phessler@, ok mpi@ ---------------------------- Fix SCHED_LOCK() leak in single_thread_set() In the (q->p_flag & P_WEXIT) branch is a continue that did not release the SCHED_LOCK. Refactor the code a bit to simplify the places SCHED_LOCK is grabbed and released. Reported-by: syzbot+ea26d351acfad3bb3f15@syzkaller.appspotmail.com OK kettenis@
2023-09-10clockintr: support an arbitrary callback function argumentScott Soule Cheloha
Callers can now provide an argument pointer to clockintr_establish(). The pointer is kept in a new struct clockintr member, cl_arg. The pointer is passed as the third parameter to clockintr.cl_func when it is executed during clockintr_dispatch(). Like the callback function, the callback argument is immutable after the clockintr is established. At present, nothing uses this. All current clockintr_establish() callers pass a NULL arg pointer. However, I am confident that dt(4)'s profile provider will need this in the near future. Requested by dlg@ back in March.
2023-09-10clockintr_dispatch: add intermediate pointer for clockintr_queue.cq_shadowScott Soule Cheloha
Adding an intermediate pointer lets me shortens "cq->cq_shadow" to just "shadow". I think it makes the dispatch loop logic a little easier to read. While here, add a clarifying comment.
2023-09-09kernel: remove schedhzScott Soule Cheloha
Now that alpha no longer sets schedhz, schedhz is a dead variable. Remove it. For now, leave the schedclock() call in place in statclock(). It still runs at its default rate of (stathz / 4). Part of mpi@'s WIP scheduler patch. Suggested by mpi@. Thread: https://marc.info/?l=openbsd-tech&m=169419781317781&w=2 ok mpi@
2023-09-09clockintr_advance: tweak logic to eliminate early-returnScott Soule Cheloha
With the switch to clockintr_schedule_locked(), clockintr_advance() is now much shorter and the early-return from the non-mutex path doesn't make the function any easier to read. Move the mutex path into the else branch and always return 'count' at the end of the function.