summaryrefslogtreecommitdiff
path: root/sys/kern/kern_sysctl.c
AgeCommit message (Collapse)Author
2024-11-08Use PCB iterator for raw IPv6 input loop.Alexander Bluhm
Implement inpcb iterator in rip6_input(). Factor out the real work to rip6_sbappend(). Now UDP broadcast and multicast, raw IPv4 and IPv6 input work similar. While there, make rip_input() look more like rip6_input(). OK mvs@
2024-11-05Use PCB iterator for raw IP input deliver loop.Alexander Bluhm
Inspired by mvs@ idea of the iterator in the UDP multicast loop, implement the same for raw IP input delivery. This removes an unneccesary rwlock and only uses table mutex. When comparing the inp routing table, address and port, the table lock must be held. So assume that in_pcb_iterator() already has the table mutex and hold it while traversing the list and doing the checks. Release the mutex during mbuf copy, socket buffer append and the upcalls. Adapt the logic for both rip_input() and udp_input(). In rip_input() move the actual work to rip_sbappend(). This can be called without mutex during list traversal and for the final element. OK mvs@
2024-11-05Replace rwlock with iterator in UDP input multicast loop.Alexander Bluhm
The broadcast and multicast loop in udp_input() is protected by the table mutex. The relevant PCBs were collected in a separate list, which was processed while the table notify rwlock was held. When sending UDP multicast packets over vxlan(4) configured over UDP with multicast groups, this lock was taken recursively causing a kernel crash. By using an iterator, traversing the PCB list of the table does not require to hold the mutex all the time. Only while accessing the next element after the iterator, the mutex is taken for a short time. udp_sbappend() and the upcall to vxlan_input() is done with neither mutex nor rwlock. The PCB is reference counted while traversing the list. crash reported by Holger Glaess; iterator implemented by mvs@; tested and fixed by bluhm@; OK mvs@
2024-10-31Unlock fs_sysctl(). It is the only `suid_clear' variable - atomicallyVitaliy Makkoveev
accessed integer. ok bluhm
2024-10-28Unlock KERN_ALLOWKMEM. The `allowkmem' is atomically accessed integer.Vitaliy Makkoveev
Also use atomic_load_int(9) to load `securelevel'. sysctl_securelevel() is mp-safe, but will be under kernel lock until all existing `securelevel' loading became mp-safe too. ok mpi
2024-10-25Unlock timeout_sysctl(). `tostat' timeout(9) statistics is alreadyVitaliy Makkoveev
protected by `timeout_mtx' mutex(9). ok kettenis
2024-09-30Use ps_ppid instead of ps_pptr->ps_pid in all places.Claudio Jeker
OK mpi@
2024-09-24Fix sleeping race during malloc in sysctl hw.disknames.Alexander Bluhm
When mallocarray(9) sleeps, disk_count can change, and diskstatslen gets inconsistent. This caused free(9) to panic. Reported-by: syzbot+36e1f3b306f721f90c72@syzkaller.appspotmail.com OK deraadt@ mpi@
2024-08-29Show expensive mbuf operations in netstat(1) statistics.Alexander Bluhm
If the memory layout is not optimal, m_defrag(), m_prepend(), m_pullup(), and m_pulldown() will allocate mbufs or copy memory. Count these operations to find possible optimizations. input dhill@; OK mvs@
2024-08-26style(9) fix. No functional changes.Vitaliy Makkoveev
2024-08-23Fix KERN_AUDIO broken in rev 1.440.Vitaliy Makkoveev
2024-08-22Introduce sysctl_securelevel() to modify `securelevel' mp-safe. KeepVitaliy Makkoveev
KERN_SECURELVL locked until existing `securelevel' checks became moved out of kernel lock. Make sysctl_securelevel_int() mp-safe by using atomic_load_int(9) to unlocked read-only access for `securelevel'. Unlock KERN_ALLOWDT. `allowdt' is the atomically accessed integer used only once in dtopen(). ok mpi
2024-08-20Unlock KERN_MAXFILES.Vitaliy Makkoveev
`maxfiles' is atomically accessed integer which is lockless and read-only accessed in file descriptors layer. lim_startup() called during kernel bootstrap, no need to atomic_load_int() within. ok mpi
2024-08-20Unlock KERN_MAXPROC and KERN_MAXTHREAD from `kern_vars'. BothVitaliy Makkoveev
`maxprocess' and `maxthread' are atomically accessed integers. ok mpi
2024-08-20Unlock sysctl_audio().Vitaliy Makkoveev
It is the only KERN_AUDIO_RECORD. `audio_record_enable' is atomically accessed integer. Reasonable from deraadt
2024-08-14Push kernel lock down to net_sysctl().Vitaliy Makkoveev
All except PF_MPLS paths are mp-safe: - net_link_sysctl() and following net_ifiq_sysctl() only return EOPNOTSUPP; - uipc_sysctl() - mp-safe atomic access to integers; - bpf_sysctl() - mp-safe atomic access to integers; - pflow_sysctl() - returns statistics from per-CPU counters; - pipex_sysctl() - mp-safe atomic access to integer; Push kernel lock down to mpls_sysctl(). sysctl_int_bounded() do copying with local variable, so context switch is safe. No need to wire memory or take `sysctl_lock' rwlock(9). Keep protocols locked as they was include pages wiring. Copying will not sleep - no network slowdown while doing it with net lock held. ok bluhm
2024-08-14Make sysctl_int() and sysctl_int_lower() mp-safe and unlock KERN_HOSTID.Vitaliy Makkoveev
The only difference between sysctl_int() and sysctl_int_bounded() is the range check, so sysctl_int() is just sysctl_int_bounded(..., INT_MIN, INT_MAX). sysctl_int() is not the fast path, so this useless check is not significant. Mp-safe sysctl_int() is meaningless for sysctl_int_lower(), so rework it in the sysctl_int_bounded() style. This time all affected paths are kernel locked, but this doesn't make sysctl_int_lower() worse. Change `hostid' type to the type of int. It only stored but never used within kernel, userland accesses it through sysctl_int(). Nothing changes, but variable becomes consistent with sysctl_int(). ok bluhm
2024-08-11Make exit1() wait sysctl(2) `allprocess' loops.Vitaliy Makkoveev
Regardless on wired userland memory, KERN_FILE_BYPID and KERN_FILE_BYUID `allprocess' loops have netlock provided sleep points, so concurrent process exit(1) could crash kernel. The main exit1() problem is that process teardown begins while process is still linked to `allprocess' list, and current code doesn't allow to unlink it first. Wait for concurrent sysctl(2) `allprocess' loops between PS_EXITING bit setting and list unlinking. Both KERN_FILE_BYPID and KERN_FILE_BYUID loops do PS_EXITING check and won't deal with dying process. Concurrent exit1() thread will wait loops keeping process linked to `allprocess' list. Tested with i386 dpb(1) run. Stress tests and ok bluhm.
2024-08-08In sysctl KERN_FILE_BYPID stop traversal after pid has been found.Alexander Bluhm
When searching for a specific process, there is no need to traverse the list of all processes to the end. Break after pid has been found and the file structure has been filled. Also check for arg >= 0 as this is consistent with the arg < -1 check before. This makes no functional difference as process 0 has PS_SYSTEM set and is skipped anyway. OK millert@ mvs@
2024-08-08Unlock KERN_MSGBUFSIZE and KERN_CONSBUFSIZE.Vitaliy Makkoveev
`msgbufp' and `consbufp' are immutable, such as `msg_magic' and `msg_bufs'. initmsgbuf() and initconsbuf() which initialize this buffers are called during kernel bootstrap, when concurrent sysctl(2) is impossible, so they don't need to be reordered or use barriers. ok bluhm
2024-08-06Unlock KERN_CLOCKRATE.Vitaliy Makkoveev
Read-only access to local `clkinfo' filled with immutable data. ok bluhm
2024-08-05Unlock KERN_BOOTTIME.Vitaliy Makkoveev
microboottime() and following binboottime() are mp-safe and `mb' is local data. ok bluhm
2024-08-05Unlock most of `kern_vars' variables.Vitaliy Makkoveev
Add corresponding cases to the kern_sysctl() switch and unlock read-only variables from `kern_vars'. Unlock KERN_SOMAXCONN and KERN_SOMINCONN which are atomically read-only accessed only from solisten(). ok kettenis
2024-08-05Take `sysctl_lock' before kernel lock.Vitaliy Makkoveev
ok bluhm
2024-08-02Push kernel lock down to sysctl(2).Vitaliy Makkoveev
Unlock few obvious immutable or read-only variables from "kern.*" and "hw.*" paths. Keep the rest variables locked as before, include pages wiring. Use new sysctl_vs{,un}lock() functions introduced for thar purpose. In kern.* path: - KERN_OSTYPE, KERN_OSRELEASE, KERN_OSVERSION, KERN_VERSION - immutable; - KERN_NUMVNODES - read-only access to integer; - KERN_MBSTAT - read-only access to per-CPU counters; In hw.* path: - HW_MACHINE, HW_MODEL, HW_NCPUONLINE, HW_PHYSMEM, HW_VENDOR, HW_PRODUCT, HW_VERSION, HW_SERIALNO, HW_UUID, HW_PHYSMEM64 - immutable; - HW_USERMEM and HW_USERMEM64 - `physmem' is immutable, uvmexp.wired is mutable but integer; read-only access to localy stored difference between `physmem' and uvmexp.wired; - `hw_vars' - read-only access to integers; some of them like HW_BYTEORDER and HW_PAGESIZE are immutable; ok bluhm kettenis
2024-07-11Use atomic operations to access integers in sysctl(2).Alexander Bluhm
In sysctl_int_bounded() use atomic operations to load, store, or swap integer values. By using volatile pointers this will result in a single assembly instruction, no matter how over optimizing compilers will become. Note that this does not solve data dependency problems, nor MP problems in the kernel code using these integers. For full MP safety additional considerations, memory barriers, or locks will be needed where the values are used. But for simple integer in- and output volatile is enough. If new and old value pointers are given to sysctl, atomic swapping guarantees that userlands sees the same old value only once. There are more sysctl_int() functions that have to be adapted. OK deraadt@ kettenis@
2024-07-08Rework per proc and per process time usage accountingClaudio Jeker
For procs (threads) the accounting happens now lockless by curproc using a generation counter. Callers need to use tu_enter() and tu_leave() for this. To read the proc p_tu struct tuagg_get_proc() should be used. It ensures that the values read is consistent. For processes only the time of exited threads is accumulated in ps_tu and to get the proper process time usage tuagg_get_process() needs to be called. tuagg_get_process() will sum up all procs p_tu plus the ps_tu. This removes another SCHED_LOCK() dependency. Adjust the code in exit1() and exit2() to correctly account for the full run time. For this adjust sched_exit() to do the runtime accounting like it is done in mi_switch(). OK jca@ dlg@
2024-04-12Split single TCP inpcb table into IPv4 and IPv6 parts.Alexander Bluhm
With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-03-29Remove one global variable duplicating uvmexp.pagesize.Miod Vallat
ok guenther@ deraadt@
2024-02-10On kernels without ucom(4) support, 'sysctl hw.ucomnames' should returnTheo de Raadt
the empty string, rather than error. ok krw
2024-01-19Backout priterator() for walking allprocess list.Alexander Bluhm
This approach does not work as LIST_NEXT() of a removed element does not return NULL. I causes a crash in syzcaller and triggers kernel diagnostic assertion "vp->v_uvcount == 0" in sys/kern/kern_unveil.c line 845 during reboot. Unfortunately the backout brings back the race in fill_file() and fstat(1) may crash the kernel. Reported-by: syzbot+54fba1c004d7383d5e85@syzkaller.appspotmail.com
2024-01-18Use solock() instead of netlock within fill_file(). This makes allVitaliy Makkoveev
socket types protected. The netlock is still used while fill_file() called through *table.inpt_queue walkthroughs, but this is the inet sockets case. ok bluhm
2024-01-15Introduce priterator(), the `ps_list' iterator. Some of `allprocess'Vitaliy Makkoveev
list walkthroughs have context switch within, so make exit1() wait until the last reference released. Reported-by: syzbot+0e9dda76c42c82c626d7@syzkaller.appspotmail.com ok bluhm claudio
2024-01-10Split UDP PCB table into IPv4 and IPv6.Alexander Bluhm
Having two hash tables instead of a common one, reduces table size and contention on the per table lock. The address family is always known in advance. The lookups and loops are more specific. OK sashan@
2023-10-01Add sysctl hw.ucomnames to list 'fixed' paths to USB serialKenneth R Westerback
ports. Suggested by deraadt@, USB route idea from kettenis@. Feedback from anton@, man page improvements from deraadt@, jmc@, schwarze@. ok deraadt@ kettenis@
2023-09-16Allow counters_read(9) to take an optional scratch buffer.Martin Pieuchot
Using a scratch buffer makes it possible to take a consistent snapshot of per-CPU counters without having to allocate memory. Makes ddb(4) show uvmexp command work in OOM situations. ok kn@, mvs@, cheloha@
2023-07-16Make the mbstat preserve the same size which is actually used. AlsoYASUOKA Masahiko
revert the previous that the mbstat is located on the stack. ok claudio
2023-07-07Expand the counters in struct mbstat from u_short to u_long. UseAlexander Bluhm
malloc(9) memory instead of kernel stack for sysctl kern.mbstat. from yasuoka@; chunk missed in previous commit; OK claudio@ tb@
2023-07-02all platforms, kernel: remove __HAVE_CLOCKINTR symbolScott Soule Cheloha
Every platform made the clockintr switch at least six months ago. The __HAVE_CLOCKINTR symbol is now redundant. Remove it. Prompted by claudio@. Link: https://marc.info/?l=openbsd-tech&m=168826181015032&w=2 "makes sense" mlarkin@
2023-05-21In sysctl_hwchargestop() check that hw_battery_setchargestop is setClaudio Jeker
and not hw_battery_setchargestart. OK kettenis@
2023-05-18Backout sysctl(2) unlocking. Lock order issue was triggered in UVMVitaliy Makkoveev
layer.
2023-05-17Implement battery management sysctl. This will provide a set of sysctlsMark Kettenis
to control the charging of laptop batteries: * hw.battery.chargemode (int) -1: force discharge 0: inhibit charge 1: auto In auto mode charging may be controlled by: * hw.battery.chargestop (int) Percentage (0-100) of last full capacity at which the battery should stop charging. * hw.battery.chargestart (int) Percentage (0-100) of last full capacity at which the battery should start charging. The idea is that with hw.battery.chargemode=1 hw.battery.chargestop=80 hw.battery.chargestart=75 the battery would be kept charged within the range between 75% and 80%. Allowable settings and some details of the behavior may differ between hardware implementations. Committing this early to easy testing of further diffs that implement this functionality in acpithinkpad(4) and aplsmc(4). ok kn@
2023-05-04Push kernel lock deep down to sys_sysctl(). At least network subset ofVitaliy Makkoveev
sysctl(8) MIBs relies on netlock or another locks and doesn't require kernel lock, so unlock it. The protocols layer *_sysctl()s are left under kernel lock and will be sequentially unlocked later. ok bluhm@
2023-01-22Move SS_CANTRCVMORE and SS_RCVATMARK bits from `so_state' to `sb_state' ofVitaliy Makkoveev
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition kept as is, but now these bits belongs to the `sb_state' of receive buffer. `sb_state' ored with `so_state' when socket data exporting to the userland. ok bluhm@
2023-01-21Introduce per-sockbuf `sb_state' to use it with SS_CANTSENDMORE.Vitaliy Makkoveev
This time, socket's buffer lock requires solock() to be held. As a part of socket buffers standalone locking work, move socket state bits which represent its buffers state to per buffer state. Opposing the previous reverted diff, the SS_CANTSENDMORE definition left as is, but it used only with `sb_state'. `sb_state' ored with original `so_state' when socket's data exported to the userland, so the ABI kept as it was. Inputs from deraadt@. ok bluhm@
2023-01-14sysctl(2): KERN_CPUSTATS: zero struct cpustats before copyoutScott Soule Cheloha
2022-11-07introduce a new kern.autoconf_serial sysctl that can be used by userlandRobert Nagy
to monitor state changes of the kernel device tree input from dnd ok dlg@, deraadt@
2022-11-05clockintr(9): initial commitScott Soule Cheloha
clockintr(9) is a machine-independent clock interrupt scheduler. It emulates most of what the machine-dependent clock interrupt code is doing on every platform. Every CPU has a work schedule based on the system uptime clock. For now, every CPU has a hardclock(9) and a statclock(). If schedhz is set, every CPU has a schedclock(), too. This commit only contains the MI pieces. All code is conditionally compiled with __HAVE_CLOCKINTR. This commit changes no behavior yet. At a high level, clockintr(9) is configured and used as follows: 1. During boot, the primary CPU calls clockintr_init(9). Global state is initialized. 2. Primary CPU calls clockintr_cpu_init(9). Local, per-CPU state is initialized. An "intrclock" struct may be installed, too. 3. Secondary CPUs call clockintr_cpu_init(9) to initialize their local state. 4. All CPUs repeatedly call clockintr_dispatch(9) from the MD clock interrupt handler. The CPUs complete work and rearm their local interrupt clock, if any, during the dispatch. 5. Repeat step (4) until the system shuts down, suspends, or hibernates. 6. During resume, the primary CPU calls inittodr(9) and advances the system uptime. 7. Go to step (2). This time around, clockintr_cpu_init(9) also advances the work schedule on the calling CPU to skip events that expired during suspend. This prevents a "thundering herd" of useless work during the first clock interrupt. In the long term, we need an MI clock interrupt scheduler in order to (1) provide control over the clock interrupt to MI subsystems like timeout(9) and dt(4) to improve their accuracy, (2) provide drivers like acpicpu(4) a means for slowing or stopping the clock interrupt on idle CPUs to conserve power, and (3) reduce the amount of duplicated code in the MD clock interrupt code. Before we can do any of that, though, we need to switch every platform over to using clockintr(9) and do some cleanup. Prompted by "the vmm(4) time bug," among other problems, and a discussion at a2k19 on the subject. Lots of design input from kettenis@. Early versions reviewed by kettenis@ and mlarkin@. Platform-specific help and testing from kettenis@, gkoehler@, mlarkin@, miod@, aoyama@, visa@, and dv@. Babysitting and spiritual guidance from mlarkin@ and kettenis@. Link: https://marc.info/?l=openbsd-tech&m=166697497302283&w=2 ok kettenis@ mlarkin@
2022-08-16Remove obsolete kern.nselcoll sysctl.Visa Hankala
OK millert@ deraadt@
2022-08-14remove unneeded includes in sys/kernJonathan Gray
ok mpi@ miod@